HAL will be down for maintenance from Friday, June 10 at 4pm through Monday, June 13 at 9am. More information
Skip to Main content Skip to Navigation
Conference papers

Classification de documents XML à partir d'une représentation linéaire des arbres de ces documents

Anne-Marie Vercoustre 1 Mounir Fegas 1 Yves Lechevallier 1 Thierry Despeyroux 1
1 AxIS - Usage-centered design, analysis and improvement of information systems
CRISAM - Inria Sophia Antipolis - Méditerranée , Inria Paris-Rocquencourt
Abstract : In this work, we propose a new clustering document representation for semi-structured documents collections. Our approach consists on a representation of XML documents based on their sub-paths, defined according to some criteria (length, root beginning, leaf ending) using the structure only or both the structure and the content. By considering those sub-paths as words, we can use standard methods for vocabulary reduction, and simple clustering methods such as K-means that scale up well. We actually use an implementation of the clustering algorithm known as \textit{dynamic clouds} that can work with distinct groups of independent variables. This is necessary in our model since embedded sub-paths are not independent. For validation and evaluation of our method, two collections are used: the INEX corpus and the INRIA activity reports, and a set of metrics well-known in Information Retrieval.
Document type :
Conference papers
Complete list of metadata

Cited literature [19 references]  Display  Hide  Download

https://hal.inria.fr/inria-00000840
Contributor : Anne-Marie Vercoustre Connect in order to contact the contributor
Submitted on : Wednesday, November 23, 2005 - 3:58:38 PM
Last modification on : Wednesday, April 6, 2022 - 3:48:34 PM
Long-term archiving on: : Friday, April 2, 2010 - 10:44:11 PM

Identifiers

  • HAL Id : inria-00000840, version 1

Collections

Citation

Anne-Marie Vercoustre, Mounir Fegas, Yves Lechevallier, Thierry Despeyroux. Classification de documents XML à partir d'une représentation linéaire des arbres de ces documents. Actes des 6ème journées Extraction et Gestion des Connaissances (EGC 2006), Revue des Nouvelles Technologies de l'Information (RNTI-E-3), Jan 2006, Paris, France. ⟨inria-00000840⟩

Share

Metrics

Record views

139

Files downloads

546