Un modèle segmental probabiliste combinant cohésion lexicale et rupture lexicale pour la segmentation thématique

Anca-Roxana Simon 1 Guillaume Gravier 1, * Pascale Sébillot 1
* Corresponding author
1 TEXMEX - Multimedia content-based indexing
IRISA - Institut de Recherche en Informatique et Systèmes Aléatoires, Inria Rennes – Bretagne Atlantique
Abstract : Identifying topical structure in any text-like data is a challenging task. Most existing techniques rely either on maximizing a measure of the lexical cohesion or on detecting lexical disruptions. A novel method combining the two criteria so as to obtain the best trade-off between cohesion and disruption is proposed in this paper. A new statistical model is defined, based on the work of Isahara and Utiyama (2001), maintaining the properties of domain independence and limited a priori of the latter. Evaluations are performed both on written texts and on automatic transcripts of TV shows, the latter not respecting the norms of written texts, thus increasing the difficulty of the task. Experimental results demonstrate the relevance of combining lexical cohesion and disrupture.
Document type :
Conference papers
Liste complète des métadonnées

https://hal.inria.fr/hal-00844112
Contributor : Patrick Gros <>
Submitted on : Friday, July 12, 2013 - 6:16:53 PM
Last modification on : Friday, November 16, 2018 - 1:21:56 AM

Identifiers

  • HAL Id : hal-00844112, version 1

Citation

Anca-Roxana Simon, Guillaume Gravier, Pascale Sébillot. Un modèle segmental probabiliste combinant cohésion lexicale et rupture lexicale pour la segmentation thématique. TALN - Conférence sur le traitement automatique des langues naturelles, ATALA, Jun 2013, Les Sables d'Olonne, France. ⟨hal-00844112⟩

Share

Metrics

Record views

1716