Archiving Data Objects using Web Feeds

Abstract : Web feeds, either in RSS or Atom XML-based formats, are evolving descriptive documents that characterize a dynamic hub of a Web site and help subscribers keep up with what is the most recent Web content of interest. In this paper, we show how Web feeds can be useful instruments for information extraction and Web page change detection. Web pages referenced by feed items are usually blog posts or news articles, data with a dynamic (then ephemeral) nature and which is clustered topically in a feed channel. We monitor Web channels and extract from the associated Web pages the text and references corresponding to Web articles. The result is enriched with the timestamp and additional metadata mined from the feed, and encapsulated in a 'data object'. The data object will be in particular information devoided of all the template elements or advertisements. These irrelevant elements, generically called boileplate, are not only consuming time and space from the crawler's point of view, but also hinder the data analysis process. We first make some statistics on a set of Web feeds, by crawling them for a period of time and observing their temporal aspects. Then we present the algorithm used for article extraction, algorithm that uses the feed semantics (more specifically the description and title of feed items) in order to identify the DOM node in the HTML page that contains the article. The data objects constructed in this way can be used as a semantic overlay collection for an archive or in the context of an incremental crawl, making it more efficient by detecting change at data object level. Experiments on the extraction technique are done in order to validate our approach, with good results even in cases when other techniques fail. We finally discuss useful applications based on the extraction and change detection of Web objects.
Type de document :
Communication dans un congrès
International Workshop on Web Archiving, Sep 2010, Vienna, Austria. 2010
Domaine :
Liste complète des métadonnées

Littérature citée [24 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/inria-00537962
Contributeur : Marilena Oita <>
Soumis le : vendredi 19 novembre 2010 - 17:35:25
Dernière modification le : samedi 3 mars 2018 - 15:12:01
Document(s) archivé(s) le : vendredi 26 octobre 2012 - 16:10:42

Fichier

iwawienna.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : inria-00537962, version 1

Collections

Citation

Marilena Oita, Pierre Senellart. Archiving Data Objects using Web Feeds. International Workshop on Web Archiving, Sep 2010, Vienna, Austria. 2010. 〈inria-00537962〉

Partager

Métriques

Consultations de la notice

576

Téléchargements de fichiers

293