HAL will be down for maintenance from Friday, June 10 at 4pm through Monday, June 13 at 9am. More information
Skip to Main content Skip to Navigation

Towards automatic XML structure building for Web documents

Agnès Guerraz 1
1 WAM - Web, adaptation and multimedia
Inria Grenoble - Rhône-Alpes
Abstract : Web documents avalaible through the Internet are frequently supplied simply as poorly-written HTML or as plain text. Indeed, almost all of these Web documents are understand- able only by humans, staying unexploitable by softwares and computers. The power of Semantic Web tools and XML technologies can only be deployed on documents having a minimum of formalism in their structure. This paper relates to the structuration process for Web documents that do not have a real structure through markup languages such as XML or deefinition of grammars for validing them. It deals with building of structure in documents when existing struc- ture is insuufficient or inexistant. This subject is closely related to the problems of automatic creation of XML schemas or templates. This work lies concretely within the scope of XML documents and their problems, related to the fact that their structure building and set up is time consuming for the user. Being based on techniques of data mining, information of structures is captured, clarifying and returning the names and the characteristics of structure elements, in particular their relationships, their constraints and their logical organization. This paper proposes a process which makes it possible to calculate automatically elements of structures (1) by applying methods of data mining on documents, (2) by building components of structure automatically, (3) by automatically proposing XML transformations on the final structured document. Initially, this work will use all the range of schemas going from XML schemas to templates.
Complete list of metadata

Cited literature [16 references]  Display  Hide  Download

Contributor : Rapport de Recherche Inria Connect in order to contact the contributor
Submitted on : Monday, June 25, 2007 - 2:39:42 PM
Last modification on : Wednesday, April 6, 2022 - 3:48:37 PM
Long-term archiving on: : Friday, November 25, 2016 - 3:14:14 PM


Files produced by the author(s)


  • HAL Id : inria-00133649, version 4



Agnès Guerraz. Towards automatic XML structure building for Web documents. [Research Report] RR-6147, INRIA. 2007, pp.8. ⟨inria-00133649v4⟩



Record views


Files downloads