Towards automatic XML structure building for Web documents

Agnès Guerraz 1
1 WAM - Web, adaptation and multimedia
Inria Grenoble - Rhône-Alpes
Abstract : Web documents avalaible through the Internet are frequently supplied simply as poorly-written HTML or as plain text. Indeed, almost all of these Web documents are understand- able only by humans, staying unexploitable by softwares and computers. The power of Semantic Web tools and XML technologies can only be deployed on documents having a minimum of formalism in their structure. This paper relates to the structuration process for Web documents that do not have a real structure through markup languages such as XML or deefinition of grammars for validing them. It deals with building of structure in documents when existing struc- ture is insuufficient or inexistant. This subject is closely related to the problems of automatic creation of XML schemas or templates. This work lies concretely within the scope of XML documents and their problems, related to the fact that their structure building and set up is time consuming for the user. Being based on techniques of data mining, information of structures is captured, clarifying and returning the names and the characteristics of structure elements, in particular their relationships, their constraints and their logical organization. This paper proposes a process which makes it possible to calculate automatically elements of structures (1) by applying methods of data mining on documents, (2) by building components of structure automatically, (3) by automatically proposing XML transformations on the final structured document. Initially, this work will use all the range of schemas going from XML schemas to templates.
Complete list of metadatas

Cited literature [16 references]  Display  Hide  Download

https://hal.inria.fr/inria-00133649
Contributor : Rapport de Recherche Inria <>
Submitted on : Monday, June 25, 2007 - 2:39:42 PM
Last modification on : Wednesday, April 11, 2018 - 1:55:16 AM
Long-term archiving on : Friday, November 25, 2016 - 3:14:14 PM

File

RR-6147.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : inria-00133649, version 4

Collections

Citation

Agnès Guerraz. Towards automatic XML structure building for Web documents. [Research Report] RR-6147, INRIA. 2007, pp.8. ⟨inria-00133649v4⟩

Share

Metrics

Record views

503

Files downloads

112