Comparing performance of different set-covering strategies for linguistic content optimization in speech corpora

Nelly Barbot 1 Olivier Boëffard 1 Arnaud Delhay 1
1 CORDIAL - Human-machine spoken dialogue
IRISA - Institut de Recherche en Informatique et Systèmes Aléatoires, INRIA Rennes, ENSSAT - École Nationale Supérieure des Sciences Appliquées et de Technologie
Abstract : Set covering algorithms are efficient tools for solving an optimal linguistic corpus reduction. The optimality of such a process is directly related to the descriptive features of the sentences of a reference corpus. This article suggests to verify experimentally the behaviour of three algorithms, a greedy approach and a lagrangian relaxation based one giving importance to rare events and a third one considering the Kullback-Liebler divergence between a reference and the ongoing distribution of events. The analysis of the content of the reduced corpora shows that the both first approaches stay the most effective to compress a corpus while guaranteeing a minimal content. The variant which minimises the Kullback-Liebler divergence guarantees a distribution of events close to a reference distribution as expected; however, the price for this solution is a much more important corpus. In the proposed experiments, we have also evaluated a mixed-approach considering a random complement to the smallest coverings.
Type de document :
Communication dans un congrès
International Conference on Language Resources and Evaluation (LREC'12), May 2012, Istanbul, Turkey. 2012
Liste complète des métadonnées

https://hal.inria.fr/hal-00784377
Contributeur : Expression Irisa <>
Soumis le : lundi 4 février 2013 - 12:10:52
Dernière modification le : vendredi 16 novembre 2018 - 01:30:34

Identifiants

  • HAL Id : hal-00784377, version 1

Citation

Nelly Barbot, Olivier Boëffard, Arnaud Delhay. Comparing performance of different set-covering strategies for linguistic content optimization in speech corpora. International Conference on Language Resources and Evaluation (LREC'12), May 2012, Istanbul, Turkey. 2012. 〈hal-00784377〉

Partager

Métriques

Consultations de la notice

277