LiveRank: How to Refresh Old Crawls

Abstract : This paper considers the problem of refreshing a crawl. More precisely, given a collection of Web pages (with hyperlinks) gathered at some time, we want to identify a significant fraction of these pages that still exist at present time. The liveness of an old page can be tested through an online query at present time. We call LiveRank a ranking of the old pages so that active nodes are more likely to appear first. The quality of a LiveRank is measured by the number of queries necessary to identify a given fraction of the alive pages when using the LiveRank order. We study different scenarios from a static setting where the LiveRank is computed before any query is made, to dynamic settings where the LiveRank can be updated as queries are processed. Our results show that building on the PageRank can lead to efficient LiveRanks for Web graphs.
Keywords : Pagerank LiveRank crawl
Type de document :
Communication dans un congrès
Algorithms and Models for the Web Graph - 11th International Workshop (WAW 2014), Dec 2014, Beijing, China. pp.148 - 160, 2014, 〈10.1007/978-3-319-13123-8_12〉
Liste complète des métadonnées

Littérature citée [14 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/hal-01093188
Contributeur : Laurent Viennot <>
Soumis le : mercredi 10 décembre 2014 - 12:08:39
Dernière modification le : jeudi 11 janvier 2018 - 06:21:34
Document(s) archivé(s) le : samedi 15 avril 2017 - 06:36:18

Fichier

liverank2014waw.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

Collections

Citation

The Dang Huynh, Fabien Mathieu, Laurent Viennot. LiveRank: How to Refresh Old Crawls. Algorithms and Models for the Web Graph - 11th International Workshop (WAW 2014), Dec 2014, Beijing, China. pp.148 - 160, 2014, 〈10.1007/978-3-319-13123-8_12〉. 〈hal-01093188〉

Partager

Métriques

Consultations de la notice

152

Téléchargements de fichiers

123