Toward a Structured Information Retrieval System on the Web: Automatic Structure Extraction of Web Pages

Mathias Géry 1 Jean-Pierre Chevallet 2
2 MRIM - Modélisation et Recherche d’Information Multimédia [Grenoble]
LIG - Laboratoire d'Informatique de Grenoble, Inria - Institut National de Recherche en Informatique et en Automatique
Abstract : The World Wide Web is a distributed, heterogeneous and semi-structured information space. With the growth of available data, retrieving interesting information is becoming quite difficult and classical search engines give often very poor results. The Web is changing very quickly, and search engines mainly use old and well-known IR techniques. One of the main problems is the lack of explicit HTML page structure, and more generally the lack of explicit Web sites structure. We show in this paper that it is possible to extract such a structure, which can be explicit or implicit: hypertext links between pages, the implicit relations between pages, the HTML tags describing structure, etc. We present some preliminary results of a Web sample analysis extracting several levels of structure (a hierarchical tree structure, a graph-like structure).
Type de document :
Communication dans un congrès
International Workshop on Web Dynamics, 2001, International Workshop on Web Dynamics, 2001
Liste complète des métadonnées

Littérature citée [31 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/hal-00953947
Contributeur : Marie-Christine Fauvet <>
Soumis le : vendredi 28 février 2014 - 16:07:35
Dernière modification le : jeudi 11 janvier 2018 - 06:22:06
Document(s) archivé(s) le : mercredi 28 mai 2014 - 17:20:20

Fichier

gery01a_WebDyn.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : hal-00953947, version 1

Collections

Citation

Mathias Géry, Jean-Pierre Chevallet. Toward a Structured Information Retrieval System on the Web: Automatic Structure Extraction of Web Pages. International Workshop on Web Dynamics, 2001, International Workshop on Web Dynamics, 2001. 〈hal-00953947〉

Partager

Métriques

Consultations de la notice

240

Téléchargements de fichiers

250