Graphes des liens et anti-liens statistiquement valides entre les mots d'un corpus textuel

Alain Lelu 1, 2 Martine Cadot 3
2 KIWI - Knowledge Information and Web Intelligence
LORIA - Laboratoire Lorrain de Recherche en Informatique et ses Applications
3 ABC - Machine Learning and Computational Biology
LORIA - Laboratoire Lorrain de Recherche en Informatique et ses Applications
Abstract : Neighborhood is a central concept in datamining, and a bunch of definitions have been implemented, mainly rooted in geometrical or topological considerations. We propose here a statistical definition of neighborhood: our TourneBool randomization test processes an ob-jects vs. attributes binary table in order to establish which inter-attribute relation is fortuitous, and which one is meaningful, out of any hypotheses on the underlying statistical distribu-tions, but taking into account these empirical distributions. It ensues a robust and statistically validated graph. A previous encouraging small-scale test led us to scale up the different phases of the process, making it possible to test it on one of the public access Reuters test corpus. We then characterized the resulting word graph with a series of well-known indicators, such as clustering coefficients, degree distribution and correlation, cluster modularity and size distribution. Another graph structure stems from this process: the one conveying the negative " counter-relations " between words, i.e. words which " steer clear " one from another. We characterize in the same way the counter-relations graph.
Type de document :
Communication dans un congrès
Jean-Gabriel Ganascia. Extraction et gestion de connaissance 2009 (EGC'09), Jan 2009, Strasbourg, France. Cépaduès éditions, E-15, pp.367-378, 2009, Revue des Nouvelles Technologies de l'Information
Liste complète des métadonnées

https://hal.inria.fr/inria-00342751
Contributeur : Alain Lelu <>
Soumis le : vendredi 28 novembre 2008 - 13:48:30
Dernière modification le : mardi 24 avril 2018 - 13:36:12

Identifiants

  • HAL Id : inria-00342751, version 1

Citation

Alain Lelu, Martine Cadot. Graphes des liens et anti-liens statistiquement valides entre les mots d'un corpus textuel. Jean-Gabriel Ganascia. Extraction et gestion de connaissance 2009 (EGC'09), Jan 2009, Strasbourg, France. Cépaduès éditions, E-15, pp.367-378, 2009, Revue des Nouvelles Technologies de l'Information. 〈inria-00342751〉

Partager

Métriques

Consultations de la notice

582