Enhancing lexical cohesion measure with confidence measures, semantic relations and language model interpolation for multimedia spoken content topic segmentation

Camille Guinaudeau 1 Guillaume Gravier 1 Pascale Sébillot 1
1 TEXMEX - Multimedia content-based indexing
IRISA - Institut de Recherche en Informatique et Systèmes Aléatoires, Inria Rennes – Bretagne Atlantique
Abstract : Transcript-based topic segmentation of TV programs faces several difficulties arising from transcription errors, from the presence of potentially short segments and from the limited number of word repetitions to enforce lexical cohesion, i.e., lexical relations that exist within a text to provide a certain unity. To overcome these problems, we extend a probabilistic measure of lexical cohesion based on generalized probabilities with a unigram language model. On the one hand, confidence measures and semantic relations are considered as additional sources of information. On the other hand, language model interpolation techniques are investigated for better language model estimation. Experimental topic segmentation results are presented on two corpora with distinct characteristics, composed respectively of broadcast news and reports on current affairs. Significant improvements are obtained on both corpora, demonstrating the effectiveness of the extended lexical cohesion measure for spoken TV contents as well as its genericity over different programs.
Document type :
Journal articles
Complete list of metadatas

Cited literature [32 references]  Display  Hide  Download

https://hal.archives-ouvertes.fr/hal-00645705
Contributor : Guillaume Gravier <>
Submitted on : Wednesday, November 30, 2011 - 1:58:27 PM
Last modification on : Friday, November 16, 2018 - 1:29:18 AM
Long-term archiving on : Thursday, March 1, 2012 - 2:20:36 AM

File

guinaudeau.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-00645705, version 1

Citation

Camille Guinaudeau, Guillaume Gravier, Pascale Sébillot. Enhancing lexical cohesion measure with confidence measures, semantic relations and language model interpolation for multimedia spoken content topic segmentation. Computer Speech and Language, Elsevier, 2012, 26 (2), pp.90-104. ⟨hal-00645705⟩

Share

Metrics

Record views

2046

Files downloads

442