Towards automatic XML structure building for Web documents

Agnès Guerraz 1
1 WAM - Web, adaptation and multimedia
Inria Grenoble - Rhône-Alpes
Abstract : Web documents avalaible through the Internet are frequently supplied simply as poorly-written HTML or as plain text. Indeed, almost all of these Web documents are understand- able only by humans, staying unexploitable by softwares and computers. The power of Semantic Web tools and XML technologies can only be deployed on documents having a minimum of formalism in their structure. This paper relates to the structuration process for Web documents that do not have a real structure through markup languages such as XML or deefinition of grammars for validing them. It deals with building of structure in documents when existing struc- ture is insuufficient or inexistant. This subject is closely related to the problems of automatic creation of XML schemas or templates. This work lies concretely within the scope of XML documents and their problems, related to the fact that their structure building and set up is time consuming for the user. Being based on techniques of data mining, information of structures is captured, clarifying and returning the names and the characteristics of structure elements, in particular their relationships, their constraints and their logical organization. This paper proposes a process which makes it possible to calculate automatically elements of structures (1) by applying methods of data mining on documents, (2) by building components of structure automatically, (3) by automatically proposing XML transformations on the final structured document. Initially, this work will use all the range of schemas going from XML schemas to templates.
Liste complète des métadonnées


https://hal.inria.fr/inria-00133649
Contributeur : Rapport de Recherche Inria <>
Soumis le : lundi 25 juin 2007 - 14:39:42
Dernière modification le : samedi 17 septembre 2016 - 01:35:18
Document(s) archivé(s) le : vendredi 25 novembre 2016 - 15:14:14

Fichier

RR-6147.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : inria-00133649, version 4

Collections

Citation

Agnès Guerraz. Towards automatic XML structure building for Web documents. [Research Report] RR-6147, INRIA. 2007, pp.8. <inria-00133649v4>

Partager

Métriques

Consultations de
la notice

224

Téléchargements du document

87