Skip to Main content Skip to Navigation
Conference papers

Cleaning statistical language models

Abstract : In this paper, we describe how to decide a n-gram is actually impossible in a language. We use decision rules on a corpus tagged with POS. These rules are based on statistics and phonological criteria. In terms of statistical language modeling, deciding that a n-gram is impossible leads to assign to it a null probability.We defer on the possible n-grams the released mass of probabilities. To do this, we define a new formulation of P(w|h). We apply the principle of impossible events to bigrams. Then we use the list of impossible bigrams to build a list of impossible trigrams. The new trigram model exceeds the baseline model by 5.53% in terms of perplexity.
Document type :
Conference papers
Complete list of metadata
Contributor : David Langlois Connect in order to contact the contributor
Submitted on : Wednesday, March 23, 2011 - 4:06:30 PM
Last modification on : Friday, February 26, 2021 - 3:28:06 PM


  • HAL Id : inria-00579376, version 1



Reda Jourani, David Langlois, Kamel Smaïli, Khalid Daoudi, Driss Aboutajdine. Cleaning statistical language models. 3d. International Conference on Information Systems and Economic Intelligence (SIIE'2010), Feb 2010, Sousse, Tunisia. ⟨inria-00579376⟩



Les métriques sont temporairement indisponibles