Cleaning statistical language models - Inria - Institut national de recherche en sciences et technologies du numérique Accéder directement au contenu
Communication Dans Un Congrès Année : 2010

Cleaning statistical language models

Résumé

In this paper, we describe how to decide a n-gram is actually impossible in a language. We use decision rules on a corpus tagged with POS. These rules are based on statistics and phonological criteria. In terms of statistical language modeling, deciding that a n-gram is impossible leads to assign to it a null probability.We defer on the possible n-grams the released mass of probabilities. To do this, we define a new formulation of P(w|h). We apply the principle of impossible events to bigrams. Then we use the list of impossible bigrams to build a list of impossible trigrams. The new trigram model exceeds the baseline model by 5.53% in terms of perplexity.
Fichier non déposé

Dates et versions

inria-00579376 , version 1 (23-03-2011)

Identifiants

  • HAL Id : inria-00579376 , version 1

Citer

Reda Jourani, David Langlois, Kamel Smaïli, Khalid Daoudi, Driss Aboutajdine. Cleaning statistical language models. 3d. International Conference on Information Systems and Economic Intelligence (SIIE'2010), Feb 2010, Sousse, Tunisia. ⟨inria-00579376⟩
243 Consultations
0 Téléchargements

Partager

Gmail Facebook X LinkedIn More