Textual Data Selection for Language Modelling in the Scope of Automatic Speech Recognition

Freha Mezzoudj; David Langlois; Denis Jouvet; Abdelkader Benyettou

Communication Dans Un Congrès Année : 2015

Textual Data Selection for Language Modelling in the Scope of Automatic Speech Recognition

(1) , (2) , (3) , (1)

1
2
3

Freha Mezzoudj

Fonction : Auteur

Laboratoire Signal Image Parole

David Langlois

Fonction : Auteur correspondant
PersonId : 298
IdHAL : david-langlois
IdRef : 070239509

Connectez-vous pour contacter l'auteur

Statistical Machine Translation and Speech Modelization and Text

Denis Jouvet

Fonction : Auteur
PersonId : 15904
IdHAL : denis-jouvet
IdRef : 029418666

Speech Modeling for Facilitating Oral-Based Communication

Abdelkader Benyettou

Fonction : Auteur

Laboratoire Signal Image Parole

Résumé

The language model is an important module in many applications that produce natural language text, in particular speech recognition. Training of language models requires large amounts of textual data that matches with the target domain. Selection of target domain (or in-domain) data has been investigated in the past. For example [1] has proposed a criterion based on the difference of cross-entropy between models representing in-domain and non-domain-specific data. However evaluations were conducted using only two sources of data, one corresponding to the in-domain, and another one to generic data from which sentences are selected. In the scope of broadcast news and TV shows transcription systems, language models are built by interpolating several language models estimated from various data sources. This paper investigates the data selection process in this context of building interpolated language models for speech transcription. Results show that, in the selection process, the choice of the language models for representing in-domain and non-domain-specific data is critical. Moreover, it is better to apply the data selection only on some selected data sources. This way, the selection process leads to an improvement of 8.3 in terms of perplexity and 0.2% in terms of word-error rate on the French broadcast transcription task.

Domaines

Traitement du signal et de l'image [eess.SP]

Fichier principal

ICNLSP15-V8-final-aug2015.pdf (183.89 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Denis Jouvet : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-01184192

Soumis le : jeudi 13 août 2015-11:17:06

Dernière modification le : lundi 11 septembre 2023-17:41:19

Archivage à long terme le : samedi 14 novembre 2015-10:15:07

Dates et versions

hal-01184192 , version 1 (13-08-2015)

Identifiants

HAL Id : hal-01184192 , version 1

Citer

Freha Mezzoudj, David Langlois, Denis Jouvet, Abdelkader Benyettou. Textual Data Selection for Language Modelling in the Scope of Automatic Speech Recognition. International Conference on Natural Language and Speech Processing, Oct 2015, Alger, Algeria. ⟨hal-01184192⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS INRIA UNIV-LORRAINE INRIA2 LORIA LORIA-NLPKD

218 Consultations

586 Téléchargements

Textual Data Selection for Language Modelling in the Scope of Automatic Speech Recognition

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager