Skip to Main content Skip to Navigation
Conference papers

Cleaning statistical language models

Abstract : In this paper, we describe how to decide a n-gram is actually impossible in a language. We use decision rules on a corpus tagged with POS. These rules are based on statistics and phonological criteria. In terms of statistical language modeling, deciding that a n-gram is impossible leads to assign to it a null probability.We defer on the possible n-grams the released mass of probabilities. To do this, we define a new formulation of P(w|h). We apply the principle of impossible events to bigrams. Then we use the list of impossible bigrams to build a list of impossible trigrams. The new trigram model exceeds the baseline model by 5.53% in terms of perplexity.
Document type :
Conference papers
Complete list of metadatas

https://hal.inria.fr/inria-00579376
Contributor : David Langlois <>
Submitted on : Wednesday, March 23, 2011 - 4:06:30 PM
Last modification on : Monday, September 24, 2018 - 9:04:03 AM

Identifiers

  • HAL Id : inria-00579376, version 1

Collections

Citation

Reda Jourani, David Langlois, Kamel Smaïli, Khalid Daoudi, Driss Aboutajdine. Cleaning statistical language models. 3d. International Conference on Information Systems and Economic Intelligence (SIIE'2010), Feb 2010, Sousse, Tunisia. ⟨inria-00579376⟩

Share

Metrics

Record views

291