A Flexible Structured-based Representation for XML Document Mining

Anne-Marie Vercoustre 1 Mounir Fegas 1 Saba Gul 1 Yves Lechevallier 1
1 AxIS - Usage-centered design, analysis and improvement of information systems
CRISAM - Inria Sophia Antipolis - Méditerranée , Inria Paris-Rocquencourt
Abstract : This paper reports on the INRIA group's approach to XML mining while participating in the INEX XML Mining track 2005. We use a flexible representation of XML documents that allows taking into account the structure only or both the structure and content. Our approach consists of representing XML documents by a set of their sub-paths, defined according to some criteria (length, root beginning, leaf ending). By considering those sub-paths as words, we can use standard methods for vocabulary reduction, and simple clustering methods such as K-means that scale well. We actually use an implementation of the clustering algorithm known as "dynamic clouds" that can work with distinct groups of independent variables put in separate variables. This is useful in our model since embedded sub-paths are not independent: we split potentially dependant paths into separate variables, resulting in each of them containing independant paths. Experiments with the INEX collections show good results for the structure-only collections, but our approach could not scale well for large structure-and-content collections.
Type de document :
Communication dans un congrès
Norbert Fuhr, Mounia Lalmas, Saadia Malik, Gabriella Kazai. The Fourth International Workshop of the Initiative for the Evaluation of XML Retrieval (INEX 2005), Nov 2005, Schloss Dagstuhl, Germany, Springer, Volume 3977 / 2006/3-540-34962-6 (3-540-34962-6), pp. 443 - 457, 2005, Lecture Notes in Computer Science. 〈10.1007/11766278_34〉
Liste complète des métadonnées

Littérature citée [26 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/inria-00000839
Contributeur : Anne-Marie Vercoustre <>
Soumis le : mercredi 5 juillet 2006 - 12:08:44
Dernière modification le : jeudi 11 janvier 2018 - 16:25:46
Document(s) archivé(s) le : lundi 20 septembre 2010 - 16:18:58

Fichiers

Identifiants

Collections

Citation

Anne-Marie Vercoustre, Mounir Fegas, Saba Gul, Yves Lechevallier. A Flexible Structured-based Representation for XML Document Mining. Norbert Fuhr, Mounia Lalmas, Saadia Malik, Gabriella Kazai. The Fourth International Workshop of the Initiative for the Evaluation of XML Retrieval (INEX 2005), Nov 2005, Schloss Dagstuhl, Germany, Springer, Volume 3977 / 2006/3-540-34962-6 (3-540-34962-6), pp. 443 - 457, 2005, Lecture Notes in Computer Science. 〈10.1007/11766278_34〉. 〈inria-00000839v2〉

Partager

Métriques

Consultations de la notice

270

Téléchargements de fichiers

123