Beyond the Conventional Statistical Language Models: The Variable-Length Sequences Approach

Imed Zitouni 1 Kamel Smaïli 1 Jean-Paul Haton 1
1 PAROLE - Analysis, perception and recognition of speech
INRIA Lorraine, LORIA - Laboratoire Lorrain de Recherche en Informatique et ses Applications
Abstract : In natural language, several sequences of words are very frequent. A classical language model, like n-gram, does not adequately take into account such sequences, because it underestimates their probabilities. A better approach consists in modelling word sequences as if they were individual dictionary elements. Sequences are considered as additional entries of the word lexicon, on which language models are computed. In this paper, we present an original method for automatically determining the most important phrases in corpora. This method is based on information theoretic criteria, which insure a high statistical consistency, and on French grammatical classes which include additional type of linguistic dependencies. In addition, the perplexity is used in order to make the decision of selecting a potential sequence more accurate. We propose also several variants of language models with and without word sequences. Among them, we present a model in which the trigger pairs are more significant linguistically. The originality of this model, compared with the commonly used trigger approaches, is the use of word sequences to estimate the trigger pair without limiting itself to single words. Experimental tests, in terms of perplexity and recognition rate, are carried out on a vocabulary of 20000 words and a corpus of 43 million words. The use of word sequences proposed by our algorithm reduces perplexity by more than 16% compared to those, which are limited to single words. The introduction of these word sequences in our dictation machine improves the accuracy by approximately 15%.
Type de document :
Communication dans un congrès
International Conference on Speech Language Processing, 2000, Pékin, China. pp.4, 2000
Liste complète des métadonnées

Littérature citée [15 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/inria-00099107
Contributeur : Publications Loria <>
Soumis le : mardi 21 novembre 2017 - 10:42:09
Dernière modification le : jeudi 11 janvier 2018 - 06:19:57

Fichier

ICSLP00.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : inria-00099107, version 1

Collections

Citation

Imed Zitouni, Kamel Smaïli, Jean-Paul Haton. Beyond the Conventional Statistical Language Models: The Variable-Length Sequences Approach. International Conference on Speech Language Processing, 2000, Pékin, China. pp.4, 2000. 〈inria-00099107〉

Partager

Métriques

Consultations de la notice

178

Téléchargements de fichiers

14