Variable-Length Sequence Language Model for Large Vocabulary Continuous Dictation Machine

Imed Zitouni 1 Jean-François Mari 2 Kamel Smaïli 1 Jean-Paul Haton 1
1 PAROLE - Analysis, perception and recognition of speech
INRIA Lorraine, LORIA - Laboratoire Lorrain de Recherche en Informatique et ses Applications
2 ORPAILLEUR - Knowledge representation, reasonning
INRIA Lorraine, LORIA - Laboratoire Lorrain de Recherche en Informatique et ses Applications
Abstract : In natural language, some sequences of words are very frequent. A classical language model, like n-gram, does not adequately take into account such sequences, because it underestimates their probabilities. A better approach consists in modeling word sequences as if they were individual dictionary elements. Sequences are considered as additional entries of the word lexicon, on which language models are computed. In this paper, we present two methods for automatically determining frequent phrases in unlabeled corpora of written sentences. These methods are based on information theoretic criteria which insure a high statistical consistency. Our models reach their local optimum since they minimize the perplexity. One procedure is based only on the n-gram language model to extract word sequences. The second one is based on a class n-gram model trained on 233 classes extracted from the eight grammatical classes of French. Experimental tests, in terms of perplexity and recognition rate, are carried out on a vocabulary of 20000 words and a corpus of 43 million words extracted from the ?Le Monde? newspaper. Our models reduce perplexity by more than 20% compared with n-gram (nR3) and multigram models. In terms of recognition rate, our models outperform n-gram and multigram models.
Type de document :
Communication dans un congrès
6th European Conference on Speech Communication and Technology - EUROSPEECH'99, 1999, Budapest, Hungary, 1999
Liste complète des métadonnées

Littérature citée [10 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/inria-00107585
Contributeur : Publications Loria <>
Soumis le : jeudi 19 octobre 2006 - 09:02:11
Dernière modification le : jeudi 11 janvier 2018 - 06:19:55
Document(s) archivé(s) le : mercredi 29 mars 2017 - 13:18:00

Fichier

Identifiants

  • HAL Id : inria-00107585, version 1

Collections

Citation

Imed Zitouni, Jean-François Mari, Kamel Smaïli, Jean-Paul Haton. Variable-Length Sequence Language Model for Large Vocabulary Continuous Dictation Machine. 6th European Conference on Speech Communication and Technology - EUROSPEECH'99, 1999, Budapest, Hungary, 1999. 〈inria-00107585〉

Partager

Métriques

Consultations de la notice

378

Téléchargements de fichiers

89