Skip to Main content Skip to Navigation
Conference papers

Classification de documents XML à partir d'une représentation linéaire des arbres de ces documents

Anne-Marie Vercoustre 1 Mounir Fegas 1 Yves Lechevallier 1 Thierry Despeyroux 1
1 AxIS - Usage-centered design, analysis and improvement of information systems
CRISAM - Inria Sophia Antipolis - Méditerranée , Inria Paris-Rocquencourt
Abstract : In this work, we propose a new clustering document representation for semi-structured documents collections. Our approach consists on a representation of XML documents based on their sub-paths, defined according to some criteria (length, root beginning, leaf ending) using the structure only or both the structure and the content. By considering those sub-paths as words, we can use standard methods for vocabulary reduction, and simple clustering methods such as K-means that scale up well. We actually use an implementation of the clustering algorithm known as \textit{dynamic clouds} that can work with distinct groups of independent variables. This is necessary in our model since embedded sub-paths are not independent. For validation and evaluation of our method, two collections are used: the INEX corpus and the INRIA activity reports, and a set of metrics well-known in Information Retrieval.
Document type :
Conference papers
Complete list of metadata

Cited literature [19 references]  Display  Hide  Download

https://hal.inria.fr/inria-00000840
Contributor : Anne-Marie Vercoustre <>
Submitted on : Wednesday, November 23, 2005 - 3:58:38 PM
Last modification on : Thursday, March 5, 2020 - 4:52:54 PM
Long-term archiving on: : Friday, April 2, 2010 - 10:44:11 PM

Identifiers

  • HAL Id : inria-00000840, version 1

Collections

Citation

Anne-Marie Vercoustre, Mounir Fegas, Yves Lechevallier, Thierry Despeyroux. Classification de documents XML à partir d'une représentation linéaire des arbres de ces documents. Actes des 6ème journées Extraction et Gestion des Connaissances (EGC 2006), Revue des Nouvelles Technologies de l'Information (RNTI-E-3), Jan 2006, Paris, France. ⟨inria-00000840⟩

Share

Metrics

Record views

336

Files downloads

888