Toward a Structured Information Retrieval System on the Web: Automatic Structure Extraction of Web Pages - Inria - Institut national de recherche en sciences et technologies du numérique Accéder directement au contenu
Communication Dans Un Congrès Année : 2001

Toward a Structured Information Retrieval System on the Web: Automatic Structure Extraction of Web Pages

Résumé

The World Wide Web is a distributed, heterogeneous and semi-structured information space. With the growth of available data, retrieving interesting information is becoming quite difficult and classical search engines give often very poor results. The Web is changing very quickly, and search engines mainly use old and well-known IR techniques. One of the main problems is the lack of explicit HTML page structure, and more generally the lack of explicit Web sites structure. We show in this paper that it is possible to extract such a structure, which can be explicit or implicit: hypertext links between pages, the implicit relations between pages, the HTML tags describing structure, etc. We present some preliminary results of a Web sample analysis extracting several levels of structure (a hierarchical tree structure, a graph-like structure).
Fichier principal
Vignette du fichier
gery01a_WebDyn.pdf (53.65 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-00953947 , version 1 (28-02-2014)

Identifiants

  • HAL Id : hal-00953947 , version 1

Citer

Mathias Géry, Jean-Pierre Chevallet. Toward a Structured Information Retrieval System on the Web: Automatic Structure Extraction of Web Pages. International Workshop on Web Dynamics, 2001, International Workshop on Web Dynamics. ⟨hal-00953947⟩
134 Consultations
159 Téléchargements

Partager

Gmail Facebook X LinkedIn More