Skip to Main content Skip to Navigation
Conference papers

Archiving Data Objects using Web Feeds

Abstract : Web feeds, either in RSS or Atom XML-based formats, are evolving descriptive documents that characterize a dynamic hub of a Web site and help subscribers keep up with what is the most recent Web content of interest. In this paper, we show how Web feeds can be useful instruments for information extraction and Web page change detection. Web pages referenced by feed items are usually blog posts or news articles, data with a dynamic (then ephemeral) nature and which is clustered topically in a feed channel. We monitor Web channels and extract from the associated Web pages the text and references corresponding to Web articles. The result is enriched with the timestamp and additional metadata mined from the feed, and encapsulated in a 'data object'. The data object will be in particular information devoided of all the template elements or advertisements. These irrelevant elements, generically called boileplate, are not only consuming time and space from the crawler's point of view, but also hinder the data analysis process. We first make some statistics on a set of Web feeds, by crawling them for a period of time and observing their temporal aspects. Then we present the algorithm used for article extraction, algorithm that uses the feed semantics (more specifically the description and title of feed items) in order to identify the DOM node in the HTML page that contains the article. The data objects constructed in this way can be used as a semantic overlay collection for an archive or in the context of an incremental crawl, making it more efficient by detecting change at data object level. Experiments on the extraction technique are done in order to validate our approach, with good results even in cases when other techniques fail. We finally discuss useful applications based on the extraction and change detection of Web objects.
Document type :
Conference papers
Complete list of metadata

Cited literature [24 references]  Display  Hide  Download

https://hal.inria.fr/inria-00537962
Contributor : Marilena Oita <>
Submitted on : Friday, November 19, 2010 - 5:35:25 PM
Last modification on : Friday, July 31, 2020 - 10:44:09 AM
Long-term archiving on: : Friday, October 26, 2012 - 4:10:42 PM

File

iwawienna.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : inria-00537962, version 1

Collections

Citation

Marilena Oita, Pierre Senellart. Archiving Data Objects using Web Feeds. International Workshop on Web Archiving, Sep 2010, Vienna, Austria. ⟨inria-00537962⟩

Share

Metrics

Record views

911

Files downloads

403