Outlier Detection Among Influencer Blogs Based on off-Site Web Analytics Data

. In the current scenario, with the exponential increase in the use of internet, organizations are continuously thriving for visibility on the web. This has opened new avenues in influencer marketing. Several portals encourage these marketers to build content for the purpose of digital marketing. However, the content building process produces a lot of spam within these websites when done in bulk. This is often done in order to establish their presence by using techniques including article spinning and keyword stuffing. This study thus attempts to identify these spam websites using a dataset comprising 2751 web-sites using bio inspired outlier detection approaches. We use publically available key performance indicators (KPIs) through which websites that create spam content to boost the amount of text in the domain are identified. A hybrid wolf search algorithm (WSA) and bat algorithm (BA) integrated with K-means are used to classify these websites into spam. Findings indicate that metrics including Domain Authority, Page Authority, Moz Rank, Links In, External Equity Links, Spam Score, Alexa Rank, Citation Flow, Trust Flow, External Back Links, Referred Domains, SemRush URL Links and SemRush Hostname Links play an important role in identifying spam. The proposed approach may prove beneficial in segregating spam influencer websites for effective influencer marketing.


Background
The exponential increase in the use of internet in this era of digitization across the world has become an important source of competitive edge for the marketing of products and services [1].This explosion of digital marketing has completely revamped the way business is done and also affects the brand positioning strategy of the organizations [2].Organizations have realized the importance of web visibility for better customer engagement [3].These organizations have thus started adopting ways to artificially boost their presence on the web using digital marketing specifically opening new avenues for influencer marketing.Influencer marketing is an approach to marketing that focuses on individuals that advise the decision-making consumers.Such people are referred to as influencers and often play a critical role in the customer engagement process [4].These influencers often need to build huge amount of content in order to maximize web visibility.
The use of web analytics for enhancing digital marketing has been in practice for the last few decades.However, organizations are still not able to fully utilize the core potential of these techniques for improvising their web visibility.Studies highlight opportunities and practices in web analytics that organizations may adopt for better online marketing [5].The optimization comprises of two primary categories of on-site (a measure of actual visitors on the website) and off-site web analytics (comprising of tools measuring website audience) [6].One primary reason for failing to achieve the desired promotion from web analytics in online marketing is inexperienced and unskilled influencers.These influencers in order to expedite the process use unethical practices like artificially generating keywords and links to build low quality content.This not only results in that result ineffective off-site analytics but may even prove to be detrimental to the customer if detected by search engines [6,7].After the Google's Panda and subsequent updates such malpractices for artificially boosting the web site rank on search engines results page have resulted penalization and website delisting from search engines [8].This study thus primarily focuses on identifying outlier influencer websites for the purpose of effective off-site web analytics.
There are several freelancing platforms including Blogmint, Influencer, Upwork and Craiglist that offer freelancers to build content on topics that may be utilized for generating back links and keywords for the customer website [9,10].These techniques attract traffic to the customer website and artificially boost the website rank.However, the influencers in the process to expedite the process generate low quality content that is often not original and use techniques like article spinning, keyword stuffing, link building and link farming [6] making the website quality a key driver for successful e-business [11].The customer is often not aware of the adverse effects of such techniques and thus in the long run these may even lead to penalization by search engines.Studies in literature also discuss about website selection for advertising campaigns [13].To avoid such spam within the website, our study proposes an outlier detection approach that uses website KPIs to identify spam influencer websites that indulge in low quality content building.Metrics like page rank, page authority, domain authority, alexa rank, google index, social shares, trust flow, citation flow, links, external equity link; external back links, referred domains and domain age are used as indicators for identifying spam influencer websites.A spam score is further associated with each of the 2751 websites considered for the analysis.A bio inspired wolf search and bat algorithm integrated with K-Means is used for subsequently segregating the outlier websites.

Research Methodology
This study uses a mixed research methodology where in the data collected surrounding the website KPIs for 2751 influencer blogs on unique domains.A statistical t-test is conducted on the normalized data for the two sets of influencer web domains, with low and high spam score.Further, the significant metrics are used as KPIs for analyzing whether the influencer is spam or not using bio inspired optimization approaches integrated with K-Means for mining outliers.The subsequent sub-sections highlight detailed discussions surrounding the analysis.

Data collection and metric identification
The data is extracted through an API from the SEO Rank website (https://seorank.my-addr.com/)that provides a holistic list of selected metrics provided by various data providers like Majestic [13], Ahref [14], Moz [15], SemRush and Webmaster tools.These data providers have developed ranking mechanisms that are used worldwide for identifying the position of a page in organic search.A list of metrics considered for the analysis is demonstrated in Table .1.
Table 1.Description of website metrics for off-site analytics and digital marketing.

Data Provider
Metric Description 1.

Moz Domain Authority
Prediction of the ranking of domain on search engines.Depends on links, Moz Rank and other metrics.

Page Authority
Prediction of how a given URL may be ranked on search engines, associated with number of links, Moz Rank, and others.

Moz Rank
Link popularity score indicative of importance of the page on the web.

Moz Trust
Link trust checks for links from trustworthy sources.

5.
Links In Links to the web page, includes equity, or non-equity both internal and external links.

External Equity Links
Number of external equity links to the URL

7.
Spam Score Based on number of sites penalized (de-listed) containing links to the web page.8.

Alexa Alexa Rank
Global Alexa rank of webpage Live & Fresh Index List of live and dead links for the website A total of 21 metrics are considered for the study, the data providers are mentioned along the metrics.This study uses a collective list of the metrics as KPIs for detecting spam influencer websites.The spam score is used as the criteria for dividing the data set into two for identifying the statistical significance of the metrics for subsequent analysis.

Statistical Analysis
The dataset is divided into two equal sets and 500 influencer websites each having a spam score less than 5 and greater than 5 are taken as sample for conducting a statistical t-test to identify metrics that are significantly different in the two sets.Since, the range of values of each of the metrics is considerably varied; min-max normalization is used to standardize the data to a 0-1 range.Subsequently t-test is conducted and the metrics having a p-value less than 0.05 are considered insignificant for further analysis.A list of remaining 12 significant metrics is highlighted in Table .2. The final dataset for analysis thus comprises of the 13 significant attributes namely Domain Authority (DA), Page Authority (PA), Moz Rank (MR), Links In (LI), External Equity Links (ELL), Alexa Rank (AR), Citation Flow (CF), Trust Flow (TF), External Back Links (EBL), Referred Domains (RD), SemRush URL Links (UL) and SemRush Hostname Links (HL) for 2751 influencer websites.Subsequent subsections model the identified metrics to segregate outliers using bio inspired computing algorithms [16].

Outlier Detection
After the statistical t-test that identifies significant metrics, a hybrid bio inspired approach is used for detecting outlier influencer websites.Outlier detection is a popular approach when identifying data points that do not comply with majority of the data set based on selective metrics.There are several studies in literature that demonstrate various outlier detection approaches.An exhaustive list of outlier detection approaches with a comparison of motivation, comparison and disadvantages is highlighted with a categorization into statistical models, neural networks, machine learning and hybrid systems [17].Chandola et al. [18] further provide an exhaustive review of the techniques by grouping the existing studies into six main categories based on classification, clustering, nearest neighbor, statistical, information theoretic and spectral.They further highlight the widespread applications of these approaches across domains encompassing cyber-intrusion detection [19], fraud detection [20], medical anomaly detection [21], image data [22], textual anomaly detection and sensor networks [23].
With the huge data influx, there are studies for outlier detection in high dimensional data [24,25,26].However, these approaches are computationally intensive often NP hard and may even lead to a locally optimum solution [27].Since the data under consideration is huge and may also be unstructured textual data.This is creates need of integrating approaches that do not converge to a local optima.The meta-heuristic approaches are known to help in reaching to a globally optimum system [28].Further, bio inspired algorithms have been one of the most popular optimization techniques and mimic swarm behavior for optimization problems [16,29,30].Tang et al. [31] thus integrate a few popular bio inspired algorithms with K-means to avoid the local convergence.This study thus utilizes the integrated bio inspired wolf search algorithm for outlier detection.We thus use the 2751 influencer websites comprising of 14 attributes including KPIs for each website for identifying these outliers.
The wolf search algorithm (WSA) is one such optimization approach that is said to overcome local optima by imitating the wolf preying behavior [31,32].Another similar wolf hunting approach for grey wolves is used in literature for detecting outliers integrated with k-nearest neighbor [33].In the current study, the number of clusters is identified as 2 for normal and outlier data points.The wolf population is initialized with visual distance and escape probability.The initial centroids are assigned for the two clusters.The fitness for the centroid in each wolf is calculated and the best solution is identified.The random preying behavior of the wolf is done by selecting a companion having the best solution within the visual distance.If the fitness of the companion is better than the self fitness of the wolf the companion is selected and is thus approached.After the prey is hunted the wolf randomly selects a position beyond the visual range and the process is repeated from the new location.The centroids with the best fitness are considered as the final solution.
Further, the results are compared with the integrated bat algorithm (BA) which uses the echolocation behavior of bats to find the prey and differentiate between different insects even in the dark [31,34].The bat algorithm is one of the most popular algorithms used for several engineering, multi-objective and constrained optimization problems [35,36,37].For the integrated bat approach along with the two clusters, the bat population, frequency factor and loudness are initialized.The initial clusters are randomly assigned or the bat population.For each bat, the initial centroids are similarly identified.The fitness of the centroids is computed and the best solutions are identified.Further, the new solution is generated by adjusting the frequency and velocity.If the randomly generated solution is greater than the defined pulse rate, a new best solution is selected from the best solutions from each of the bats.The new solutions are accepted by adjusting pulse rate and loudness for subsequent iterations.The pulse rate is increased and the loudness is decreased for the next iteration.
Thus the bio-inspired algorithms help in identifying the best cluster centroids over iterations.The formulation of centroids is mainly iteratively guided by the search agents in the mentioned approaches.Since the dataset considered for this study requires only two clusters and has a total of 14 attributes for which the centroid values need to be computed.
The   is value of the centroid for   cluster and   attribute.Thus, the . The centroids largely depend on the weight that tells whether the data point belongs to the cluster or not.  = 1,    ∈   ;    = 0. Once the best cluster centroids are identified for the two clusters of outliers and normal data points, a distance measure is subsequently used segregate the outliers.The subsequent section demonstrates the findings.

Findings
The K-means integrating WSA and BA algorithms have been used in this study for detecting outliers.The use of bio-inspired algorithms avoids locally optimum solutions.The study demonstrates the segregation of outlier influencer websites based on certain KPIs that have been extracted for a set of 2751 influencer websites using APIs.A total of 13 attributes are considered for detecting the outlier influencers for off-site web analytics.The spam score is excluded for the classification and is used for the validation.Table .3. highlights the cluster centers for the remaining 12 metrics.The table lists the cluster centroids for the authentic blogs (A) and outlier blogs (O) for both WSA and BA.The results for the two approaches used for the purpose show that the bat algorithm shows higher accuracy.Out of 2751 influencer websites, 1254 websites were identified as outliers based on their spam score and manual examination.The bat algorithm correctly identified 1218 giving an accuracy of 97.12% while the wolf search algorithm correctly identified 1203 with an accuracy of 95.93%.However, time taken to converge to the optimum solution is 22.61 seconds for BA while it is just 16.18 seconds for WSA.The Fig. 1. demonstrates the outlier plots for WSA and BA.Thus, the findings indicate that a large number (45.58%) of influencer websites are actually outliers.The reason behind this is that majority of influencer websites being categorized as outliers is because these blogs are heavily dependent on techniques like article spinning, link farming and keyword stuffing for content building and subsequent promotion.They often pick up original content and spin/manipulate the content by paraphrasing and including keywords related to the consumer domain to gain traction.These practices are often deemed unfit when it comes to digital marketing.However, the customers adopting these services are often not aware of such malpractices adopted by the websites.This has adverse effects on the consumer website in the long run and may even result in penalization.The use of KPIs in identifying such outlier influencers thus segregates these websites on the basis of publically available metrics from several service providers.

Conclusion
With the increased internet use and online marketing opportunities, organizations have realized the importance of web visibility and have started leveraging the power of internet to reach a larger audience for their products and services.This has opened new avenues for digital marketing especially influencer marketing where on several portals have emerged to encourage these influencers to build content for customer businesses.However, this process of content building generates a lot of spam content within these websites when done in bulk for a large consumer base and often involves techniques like article spinning and keyword stuffing for user traction.Such practices are not considered ethical as per the search engine guidelines and affect the consumers adversely.This study thus attempts to use publically available influencer website KPIs, a total of 13 attributes including Domain Authority, Page Authority, Moz Rank, Links In, External Equity Links, Spam Score, Alexa Rank, Citation Flow, Trust Flow, External Back Links, Referred Domains, SemRush URL Links and SemRush Hostname Links for 2751 influencer websites.Further, K-means integrated bioinspired computing techniques are used for detecting and segregating outliers from the extracted data.Findings indicate that such approaches overcome local optima problems and give globally optimum solutions for such NP hard and computationally extensive data.Further, it is seen that the integrated bat algorithm gives better accuracy than wolf search algorithm as demonstrated in existing literature when the approach is used for clustering [31].Our study re-establishes the same for the web analytics data set under consideration for outlier detection by extending the proposed approach.

Implications and Future Scope
This study uses KPIs and segregates outlier influencer websites that is beneficial for off-site web analytics.This may be useful for preventing consumer investments to such spam influencers that may adversely affect the websites position on search engines in the long run.Apart from the KPIs, content based analytics including keyword density, lexical diversity, meta information and topic modeling may also be incorporated in the analysis.
Future studies can be extended to using social media analytics for further validation of the results since social media platforms are utilized by consumers for raising concerns regarding the services used by them.These platforms specially Twitter and Facebook profiles of such influencer websites provide a lot of information in the form of user generated content that may be integrated with the existing metrics to reinforce the findings.An empirical validation of the results can also be done using a structured questionnaire for the consumers opting for such influencer marketing services and the short term and long term impact of the same on their visitors and web visibility.An existing study surrounding an analysis of results suggested by search engines for market share establishment can also be extended for influencer marketing [38].

Table 2 .
Statistically significant metrics having a p-value greater than 0.05

Table 3 .
Cluster centers for K-Means Integrated WSA and BA