Toward a Structured Information Retrieval System on the Web: Automatic Structure Extraction of Web Pages

Mathias Géry; Jean-Pierre Chevallet

Communication Dans Un Congrès Année : 2001

Toward a Structured Information Retrieval System on the Web: Automatic Structure Extraction of Web Pages

(1) , (2)

1
2

Mathias Géry

Fonction : Auteur

Communication Langagière et Interaction Personne-Système

Jean-Pierre Chevallet

Fonction : Auteur
PersonId : 169702
IdHAL : jean-pierre-chevallet
ORCID : 0000-0002-5945-9444
IdRef : 088217116

Modélisation et Recherche d’Information Multimédia [Grenoble]

Résumé

The World Wide Web is a distributed, heterogeneous and semi-structured information space. With the growth of available data, retrieving interesting information is becoming quite difficult and classical search engines give often very poor results. The Web is changing very quickly, and search engines mainly use old and well-known IR techniques. One of the main problems is the lack of explicit HTML page structure, and more generally the lack of explicit Web sites structure. We show in this paper that it is possible to extract such a structure, which can be explicit or implicit: hypertext links between pages, the implicit relations between pages, the HTML tags describing structure, etc. We present some preliminary results of a Web sample analysis extracting several levels of structure (a hierarchical tree structure, a graph-like structure).

Mots clés

Web Information Retrieval Web Pages Analysis Structure Extraction Statistics

Domaines

Recherche d'information [cs.IR]

Fichier principal

gery01a_WebDyn.pdf (53.65 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Marie-Christine Fauvet : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-00953947

Soumis le : vendredi 28 février 2014-16:07:35

Dernière modification le : jeudi 4 avril 2024-21:04:45

Archivage à long terme le : mercredi 28 mai 2014-17:20:20

Dates et versions

hal-00953947 , version 1 (28-02-2014)

Identifiants

HAL Id : hal-00953947 , version 1

Citer

Mathias Géry, Jean-Pierre Chevallet. Toward a Structured Information Retrieval System on the Web: Automatic Structure Extraction of Web Pages. International Workshop on Web Dynamics, 2001, International Workshop on Web Dynamics. ⟨hal-00953947⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UGA IMAG CNRS LIG LIG_TDCGE_MRIM LIG_SIDCH

134 Consultations

159 Téléchargements

Toward a Structured Information Retrieval System on the Web: Automatic Structure Extraction of Web Pages

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager