Statistical Language Modeling Based on Variable-Length Sequences - Archive ouverte HAL Access content directly
Journal Articles Computer Speech and Language Year : 2003

Statistical Language Modeling Based on Variable-Length Sequences

(1) , (1) , (1)


In natural language and especially in spontaneous speech, people often group words in order to constitute phrases which become usual expressions. This is due to phonological (to make the pronunciation easier), or to semantic reasons (to remember more easily a phrase by assigning a meaning to a block of words). Classical language models do not adequately take into account such phrases. A better approach consists in modeling some word sequences as if they were individual dictionary elements. Sequences are considered as additional entries of the vocabulary, on which language models are computed. In this paper, we present a method for automatically retrieving the most relevant phrases from a corpus of written sentences. The originality of our approach resides in the fact that the extracted phrases are obtained from a linguistically tagged corpus. Therefore, the obtained phrases are linguistically viable. To measure the contribution of classes in retrieving phrases, we have implemented the same algorithm without using classes. The class-based method outperformed by 11% the other method. Our approach uses information theoretic criteria which insure a high statistical consistency and make the decision of selecting a potential sequence optimal in accordance with the language perplexity. We propose several variants of language model with and without word sequences. Among them, we present a model in which the trigger pairs are linguistically more significant. We show that the use of sequences decrease the word error rate and improve the normalized perplexity. For instance, the best sequence model improves the perplexity by 16%, and the accuracy of our dictation system (MAUD) by approximately 14%. Experiments, in terms of perplexity and recognition rate, have been carried out on a vocabulary of 20000 words extracted from a corpus of 43 million words made up of two years of the French newspaper Le Monde. The acoustic model (HMM) is trained with the Bref80 corpus.
Not file

Dates and versions

inria-00099785 , version 1 (26-09-2006)


  • HAL Id : inria-00099785 , version 1


Imed Zitouni, Kamel Smaïli, Jean-Paul Haton. Statistical Language Modeling Based on Variable-Length Sequences. Computer Speech and Language, 2003, 17 (1), pp.27-41. ⟨inria-00099785⟩
82 View
0 Download


Gmail Facebook Twitter LinkedIn More