HAL will be down for maintenance from Friday, June 10 at 4pm through Monday, June 13 at 9am. More information
Skip to Main content Skip to Navigation
Conference papers

Graphes des liens et anti-liens statistiquement valides entre les mots d'un corpus textuel

Alain Lelu 1, 2 Martine Cadot 3
2 KIWI - Knowledge Information and Web Intelligence
LORIA - Laboratoire Lorrain de Recherche en Informatique et ses Applications
3 ABC - Machine Learning and Computational Biology
LORIA - Laboratoire Lorrain de Recherche en Informatique et ses Applications
Abstract : Neighborhood is a central concept in datamining, and a bunch of definitions have been implemented, mainly rooted in geometrical or topological considerations. We propose here a statistical definition of neighborhood: our TourneBool randomization test processes an ob-jects vs. attributes binary table in order to establish which inter-attribute relation is fortuitous, and which one is meaningful, out of any hypotheses on the underlying statistical distribu-tions, but taking into account these empirical distributions. It ensues a robust and statistically validated graph. A previous encouraging small-scale test led us to scale up the different phases of the process, making it possible to test it on one of the public access Reuters test corpus. We then characterized the resulting word graph with a series of well-known indicators, such as clustering coefficients, degree distribution and correlation, cluster modularity and size distribution. Another graph structure stems from this process: the one conveying the negative " counter-relations " between words, i.e. words which " steer clear " one from another. We characterize in the same way the counter-relations graph.
Complete list of metadata

https://hal.inria.fr/inria-00342751
Contributor : Alain Lelu Connect in order to contact the contributor
Submitted on : Friday, November 28, 2008 - 1:48:30 PM
Last modification on : Thursday, January 20, 2022 - 3:42:18 AM

Identifiers

  • HAL Id : inria-00342751, version 1

Citation

Alain Lelu, Martine Cadot. Graphes des liens et anti-liens statistiquement valides entre les mots d'un corpus textuel. Extraction et gestion de connaissance 2009 (EGC'09), Pierre Gançarski, Jan 2009, Strasbourg, France. pp.367-378. ⟨inria-00342751⟩

Share

Metrics

Record views

300