Skip to Main content Skip to Navigation
Conference papers

Toward a Structured Information Retrieval System on the Web: Automatic Structure Extraction of Web Pages

Abstract : The World Wide Web is a distributed, heterogeneous and semi-structured information space. With the growth of available data, retrieving interesting information is becoming quite difficult and classical search engines give often very poor results. The Web is changing very quickly, and search engines mainly use old and well-known IR techniques. One of the main problems is the lack of explicit HTML page structure, and more generally the lack of explicit Web sites structure. We show in this paper that it is possible to extract such a structure, which can be explicit or implicit: hypertext links between pages, the implicit relations between pages, the HTML tags describing structure, etc. We present some preliminary results of a Web sample analysis extracting several levels of structure (a hierarchical tree structure, a graph-like structure).
Document type :
Conference papers
Complete list of metadata

Cited literature [31 references]  Display  Hide  Download

https://hal.inria.fr/hal-00953947
Contributor : Marie-Christine Fauvet <>
Submitted on : Friday, February 28, 2014 - 4:07:35 PM
Last modification on : Tuesday, December 8, 2020 - 10:42:46 AM
Long-term archiving on: : Wednesday, May 28, 2014 - 5:20:20 PM

File

gery01a_WebDyn.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-00953947, version 1

Collections

Citation

Mathias Géry, Jean-Pierre Chevallet. Toward a Structured Information Retrieval System on the Web: Automatic Structure Extraction of Web Pages. International Workshop on Web Dynamics, 2001, International Workshop on Web Dynamics. ⟨hal-00953947⟩

Share

Metrics

Record views

295

Files downloads

324