Skip to Main content Skip to Navigation
Conference papers

Beyond the Conventional Statistical Language Models: The Variable-Length Sequences Approach

Imed Zitouni 1 Kamel Smaïli 1 Jean-Paul Haton 1
1 PAROLE - Analysis, perception and recognition of speech
INRIA Lorraine, LORIA - Laboratoire Lorrain de Recherche en Informatique et ses Applications
Abstract : In natural language, several sequences of words are very frequent. A classical language model, like n-gram, does not adequately take into account such sequences, because it underestimates their probabilities. A better approach consists in modelling word sequences as if they were individual dictionary elements. Sequences are considered as additional entries of the word lexicon, on which language models are computed. In this paper, we present an original method for automatically determining the most important phrases in corpora. This method is based on information theoretic criteria, which insure a high statistical consistency, and on French grammatical classes which include additional type of linguistic dependencies. In addition, the perplexity is used in order to make the decision of selecting a potential sequence more accurate. We propose also several variants of language models with and without word sequences. Among them, we present a model in which the trigger pairs are more significant linguistically. The originality of this model, compared with the commonly used trigger approaches, is the use of word sequences to estimate the trigger pair without limiting itself to single words. Experimental tests, in terms of perplexity and recognition rate, are carried out on a vocabulary of 20000 words and a corpus of 43 million words. The use of word sequences proposed by our algorithm reduces perplexity by more than 16% compared to those, which are limited to single words. The introduction of these word sequences in our dictation machine improves the accuracy by approximately 15%.
Document type :
Conference papers
Complete list of metadata

Cited literature [15 references]  Display  Hide  Download
Contributor : Publications Loria Connect in order to contact the contributor
Submitted on : Tuesday, November 21, 2017 - 10:42:09 AM
Last modification on : Friday, February 26, 2021 - 3:28:06 PM


Files produced by the author(s)


  • HAL Id : inria-00099107, version 1



Imed Zitouni, Kamel Smaïli, Jean-Paul Haton. Beyond the Conventional Statistical Language Models: The Variable-Length Sequences Approach. International Conference on Speech Language Processing, 2000, Pékin, China. pp.4. ⟨inria-00099107⟩



Record views


Files downloads