The Representativeness of Automated Web Crawls as a Surrogate for Human Browsing

David Zeber; Sarah Bird; Camila Oliveira; Walter Rudametkin; Ilana Segall; Fredrik Wollsén; Martin Lopatka

doi:10.1145/3366423.3380104

Communication Dans Un Congrès Année : 2020

The Representativeness of Automated Web Crawls as a Surrogate for Human Browsing

(1) , (1) , (1) , (2, 3) , (1) , (1) , (1)

1
2
3

David Zeber

Fonction : Auteur
PersonId : 1063690

Mozilla

Sarah Bird

Fonction : Auteur
PersonId : 1063691

Mozilla

Camila Oliveira

Fonction : Auteur

Mozilla

Walter Rudametkin

Fonction : Auteur
PersonId : 16377
IdHAL : wrudamet
ORCID : 0000-0003-2903-7600
IdRef : 169898180

Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189

Self-adaptation for distributed services and large software systems

Ilana Segall

Fonction : Auteur
PersonId : 1063692

Mozilla

Fredrik Wollsén

Fonction : Auteur
PersonId : 1063693

Mozilla

Martin Lopatka

Fonction : Auteur
PersonId : 1062933

Mozilla

Résumé

Large-scale Web crawls have emerged as the state of the art for studying characteristics of the Web. In particular, they are a core tool for online tracking research. Web crawling is an attractive approach to data collection, as crawls can be run at relatively low infrastructure cost and don't require handling sensitive user data such as browsing histories. However, the biases introduced by using crawls as a proxy for human browsing data have not been well studied. Crawls may fail to capture the diversity of user environments , and the snapshot view of the Web presented by one-time crawls does not reflect its constantly evolving nature, which hinders reproducibility of crawl-based studies. In this paper, we quantify the repeatability and representativeness of Web crawls in terms of common tracking and fingerprinting metrics, considering both variation across crawls and divergence from human browser usage. We quantify baseline variation of simultaneous crawls, then isolate the effects of time, cloud IP address vs. residential, and operating system. This provides a foundation to assess the agreement between crawls visiting a standard list of high-traffic websites and actual browsing behaviour measured from an opt-in sample of over 50,000 users of the Firefox Web browser. Our analysis reveals differences between the treatment of stateless crawling infrastructure and generally stateful human browsing, showing, for example, that crawlers tend to experience higher rates of third-party activity than human browser users on loading pages from the same domains.

Mots clés

Web mining Traffic analysis Data extraction and integration Web Crawling Tracking Online Privacy Browser Fingerprinting World Wide Web

Domaines

Informatique [cs] Web

Fichier principal

Jestr_vs_crawl_WWW2020.pdf (5.83 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

Vikas Mishra : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-02456195

Soumis le : lundi 27 janvier 2020-11:05:54

Dernière modification le : mercredi 24 janvier 2024-09:54:23

Archivage à long terme le : mardi 28 avril 2020-14:26:27

Dates et versions

hal-02456195 , version 1 (27-01-2020)

Identifiants

HAL Id : hal-02456195 , version 1
DOI : 10.1145/3366423.3380104

Citer

David Zeber, Sarah Bird, Camila Oliveira, Walter Rudametkin, Ilana Segall, et al.. The Representativeness of Automated Web Crawls as a Surrogate for Human Browsing. The Web Conference, Apr 2020, Taipei, Taiwan. ⟨10.1145/3366423.3380104⟩. ⟨hal-02456195⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS INRIA CRISTAL INRIA2 CRISTAL-SPIRALS UNIV-LILLE ANR

387 Consultations

405 Téléchargements

The Representativeness of Automated Web Crawls as a Surrogate for Human Browsing

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager