Statistical Language Modeling Based on Variable-Length Sequences

Imed Zitouni 1 Kamel Smaïli 1 Jean-Paul Haton 1
1 PAROLE - Analysis, perception and recognition of speech
INRIA Lorraine, LORIA - Laboratoire Lorrain de Recherche en Informatique et ses Applications
Abstract : In natural language and especially in spontaneous speech, people often group words in order to constitute phrases which become usual expressions. This is due to phonological (to make the pronunciation easier), or to semantic reasons (to remember more easily a phrase by assigning a meaning to a block of words). Classical language models do not adequately take into account such phrases. A better approach consists in modeling some word sequences as if they were individual dictionary elements. Sequences are considered as additional entries of the vocabulary, on which language models are computed. In this paper, we present a method for automatically retrieving the most relevant phrases from a corpus of written sentences. The originality of our approach resides in the fact that the extracted phrases are obtained from a linguistically tagged corpus. Therefore, the obtained phrases are linguistically viable. To measure the contribution of classes in retrieving phrases, we have implemented the same algorithm without using classes. The class-based method outperformed by 11% the other method. Our approach uses information theoretic criteria which insure a high statistical consistency and make the decision of selecting a potential sequence optimal in accordance with the language perplexity. We propose several variants of language model with and without word sequences. Among them, we present a model in which the trigger pairs are linguistically more significant. We show that the use of sequences decrease the word error rate and improve the normalized perplexity. For instance, the best sequence model improves the perplexity by 16%, and the accuracy of our dictation system (MAUD) by approximately 14%. Experiments, in terms of perplexity and recognition rate, have been carried out on a vocabulary of 20000 words extracted from a corpus of 43 million words made up of two years of the French newspaper Le Monde. The acoustic model (HMM) is trained with the Bref80 corpus.
Type de document :
Article dans une revue
Computer Speech and Language, Elsevier, 2003, 17 (1), pp.27-41
Liste complète des métadonnées
Contributeur : Publications Loria <>
Soumis le : mardi 26 septembre 2006 - 09:41:12
Dernière modification le : jeudi 11 janvier 2018 - 06:19:57


  • HAL Id : inria-00099785, version 1



Imed Zitouni, Kamel Smaïli, Jean-Paul Haton. Statistical Language Modeling Based on Variable-Length Sequences. Computer Speech and Language, Elsevier, 2003, 17 (1), pp.27-41. 〈inria-00099785〉



Consultations de la notice