Regroupement des occurrences des mots hors-vocabulaire répétés en vue de leur modélisation pour la transcription d'émissions radio

Frederik Stouten 1 Irina Illina 1 Dominique Fohr 1
1 PAROLE - Analysis, perception and recognition of speech
INRIA Lorraine, LORIA - Laboratoire Lorrain de Recherche en Informatique et ses Applications
Abstract : This paper describes a novel technique to cluster Out-Of-Vocabulary (OOV) word tokens in a LVCSR system used for transcribing broadcast news speech data. The system is composed of two blocks: (1) an OOV word detector and (2) a clustering module working on the detected OOV word segments. This combination allows a more reliable detection of repeated OOV words than would be possible with the OOV detector only. In the paper we focus our attention on the second part of the system i.e. the cluster-ing algorithm. This algorithm is based on the estimation of the entropy. The proposed algorithm gives better per-formance than a classical incremental clustering algorithm based on a distance threshold.
Keywords : LVCSR OOV clustering
Document type :
Conference papers
Complete list of metadatas

https://hal.inria.fr/inria-00544140
Contributor : Dominique Fohr <>
Submitted on : Tuesday, December 7, 2010 - 1:24:49 PM
Last modification on : Thursday, January 11, 2018 - 6:19:56 AM

Identifiers

  • HAL Id : inria-00544140, version 1

Collections

Citation

Frederik Stouten, Irina Illina, Dominique Fohr. Regroupement des occurrences des mots hors-vocabulaire répétés en vue de leur modélisation pour la transcription d'émissions radio. 28ème Journées d'étude sur la parole - JEP'10, Université de Mons, May 2010, Mons, Belgique. ⟨inria-00544140⟩

Share

Metrics

Record views

200