Dealing with distant relationships in natural language modelling for automatic speech recognition

David Langlois 1 Kamel Smaïli 1 Jean-Paul Haton 1
1 PAROLE - Analysis, perception and recognition of speech
INRIA Lorraine, LORIA - Laboratoire Lorrain de Recherche en Informatique et ses Applications
Abstract : Classical statistical language models, called n-gram models, describe natural language using the probabilistic relationship between a word to predict and the n-1 contiguous words preceding it. Obviously, the linguistic relationships present in a sentence are more complex. A first remark is that there exist distant relationships. We present here some recent work on an alternative model to n-gram models, based on the split of the history, dealing with the interpolation between distant bigram models. More precisely, our model is a cheaper alternative to high order n-grams. In conventional n-grams, when n is greater than 3, events are less frequent and statistics are not reliable. To deal with this problem, and to accurately estimate parameters, we combine a smoothed bigram with distant 3-bigram, distant 4-bigram and a cache composed of 100 words. We present new progresses obtained by using a simulated annealing algorithm in order to calculate the best parameters of this linear combination. With a 20K vocabulary and 40 million words for training, our algorithm improved the perplexity by 5.4% in comparison with the Baum-Welch algorithm. Moreover, this new model outperforms a smoothed bigram by 6.1% in terms of perplexity.
Type de document :
Communication dans un congrès
4th World Multiconference on Systemics, Cybernetics & Informatics - SCI'2000, 2000, Orlando, USA, 6, pp.400-405, 2000
Liste complète des métadonnées

Littérature citée [6 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/inria-00099031
Contributeur : Publications Loria <>
Soumis le : mardi 26 septembre 2006 - 08:46:29
Dernière modification le : jeudi 11 janvier 2018 - 06:19:57
Document(s) archivé(s) le : mercredi 29 mars 2017 - 12:42:16

Fichiers

Identifiants

  • HAL Id : inria-00099031, version 1

Collections

Citation

David Langlois, Kamel Smaïli, Jean-Paul Haton. Dealing with distant relationships in natural language modelling for automatic speech recognition. 4th World Multiconference on Systemics, Cybernetics & Informatics - SCI'2000, 2000, Orlando, USA, 6, pp.400-405, 2000. 〈inria-00099031〉

Partager

Métriques

Consultations de la notice

195

Téléchargements de fichiers

50