Dealing with distant relationships in natural language modelling for automatic speech recognition

David Langlois 1 Kamel Smaïli 1 Jean-Paul Haton 1
1 PAROLE - Analysis, perception and recognition of speech
INRIA Lorraine, LORIA - Laboratoire Lorrain de Recherche en Informatique et ses Applications
Abstract : Classical statistical language models, called n-gram models, describe natural language using the probabilistic relationship between a word to predict and the n-1 contiguous words preceding it. Obviously, the linguistic relationships present in a sentence are more complex. A first remark is that there exist distant relationships. We present here some recent work on an alternative model to n-gram models, based on the split of the history, dealing with the interpolation between distant bigram models. More precisely, our model is a cheaper alternative to high order n-grams. In conventional n-grams, when n is greater than 3, events are less frequent and statistics are not reliable. To deal with this problem, and to accurately estimate parameters, we combine a smoothed bigram with distant 3-bigram, distant 4-bigram and a cache composed of 100 words. We present new progresses obtained by using a simulated annealing algorithm in order to calculate the best parameters of this linear combination. With a 20K vocabulary and 40 million words for training, our algorithm improved the perplexity by 5.4% in comparison with the Baum-Welch algorithm. Moreover, this new model outperforms a smoothed bigram by 6.1% in terms of perplexity.
Document type :
Conference papers
Complete list of metadatas

Cited literature [6 references]  Display  Hide  Download

https://hal.inria.fr/inria-00099031
Contributor : Publications Loria <>
Submitted on : Tuesday, September 26, 2006 - 8:46:29 AM
Last modification on : Thursday, January 11, 2018 - 6:19:57 AM
Long-term archiving on: Wednesday, March 29, 2017 - 12:42:16 PM

Identifiers

  • HAL Id : inria-00099031, version 1

Collections

Citation

David Langlois, Kamel Smaïli, Jean-Paul Haton. Dealing with distant relationships in natural language modelling for automatic speech recognition. 4th World Multiconference on Systemics, Cybernetics & Informatics - SCI'2000, International Institute of Informatics & Systemics, 2000, Orlando, USA, pp.400-405. ⟨inria-00099031⟩

Share

Metrics

Record views

211

Files downloads

82