A Corpus Balancing Method for Language Model Construction

Abstract : The language model is an important component of any speech recogn ition system. In this paper, we present a lexical enrichment methodology of corpora focused on the construction of statistical language models. This methodology considers, on one hand, the identification of the set of poor represented words of a given training corpus, and on the other hand, the enrichment of the given corpus by the repetitive inclusion of selected text fragments containing these words. The first part of the paper describes the formal details about this methodology; the second part presents some experiments and results that validate our method.
Document type :
Conference papers
Complete list of metadatas

Cited literature [6 references]  Display  Hide  Download

https://hal.inria.fr/inria-00326515
Contributor : Dominique Vaufreydaz <>
Submitted on : Friday, October 3, 2008 - 11:59:58 AM
Last modification on : Thursday, February 7, 2019 - 4:03:59 PM
Long-term archiving on : Friday, June 4, 2010 - 12:10:06 PM

File

Villasenor03a.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : inria-00326515, version 1

Collections

LIG | UGA

Citation

Luis Villaseñor-Pineda, Manuel Montes-Y-Gómez, Manuel Pérez-Coutiño, Dominique Vaufreydaz. A Corpus Balancing Method for Language Model Construction. Fourth International Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2003), Feb 2003, Mexico City, Mexico. 9 p. ⟨inria-00326515⟩

Share

Metrics

Record views

153

Files downloads

319