Deriving Dynamics of Web Pages: A Survey

Abstract : The World Wide Web is dynamic by nature: content is continuously added, deleted, or changed, which makes it challenging for Web crawlers to keep up-to-date with the current version of a Web page, all the more so since not all apparent changes are significant ones. We review major approaches to change detection in Web pages and extraction of temporal properties (especially, timestamps) of Web pages. We focus our attention on techniques and systems that have been proposed in the last ten years and we analyze them to get some insight into the practical solutions and best practices available. We aim at providing an analytical view of the range of methods that can be used, distinguishing them on several dimensions, especially, their static or dynamic nature, the modeling of Web pages, or, for dynamic methods relying on comparison of successive versions of a page, the similarity metrics used. We advocate for more comprehensive studies of the effectiveness of Web page change detection methods, and finally highlight open issues.
Type de document :
Communication dans un congrès
TWAW (Temporal Workshop on Web Archiving), Mar 2011, Hyderabad, India. 2011
Domaine :
Liste complète des métadonnées

Littérature citée [35 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/inria-00588715
Contributeur : Marilena Oita <>
Soumis le : mardi 26 avril 2011 - 11:29:18
Dernière modification le : mardi 26 avril 2011 - 11:43:46
Document(s) archivé(s) le : jeudi 8 novembre 2012 - 17:16:23

Fichier

survey.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : inria-00588715, version 1

Collections

Citation

Marilena Oita, Pierre Senellart. Deriving Dynamics of Web Pages: A Survey. TWAW (Temporal Workshop on Web Archiving), Mar 2011, Hyderabad, India. 2011. 〈inria-00588715〉

Partager

Métriques

Consultations de la notice

239

Téléchargements de fichiers

336