Cleaning statistical language models

Abstract : In this paper, we describe how to decide a n-gram is actually impossible in a language. We use decision rules on a corpus tagged with POS. These rules are based on statistics and phonological criteria. In terms of statistical language modeling, deciding that a n-gram is impossible leads to assign to it a null probability.We defer on the possible n-grams the released mass of probabilities. To do this, we define a new formulation of P(w|h). We apply the principle of impossible events to bigrams. Then we use the list of impossible bigrams to build a list of impossible trigrams. The new trigram model exceeds the baseline model by 5.53% in terms of perplexity.
Type de document :
Communication dans un congrès
3d. International Conference on Information Systems and Economic Intelligence (SIIE'2010), Feb 2010, Sousse, Tunisia. 2010
Liste complète des métadonnées

https://hal.inria.fr/inria-00579376
Contributeur : David Langlois <>
Soumis le : mercredi 23 mars 2011 - 16:06:30
Dernière modification le : samedi 28 avril 2018 - 01:11:24

Identifiants

  • HAL Id : inria-00579376, version 1

Collections

Citation

Reda Jourani, David Langlois, Kamel Smaïli, Khalid Daoudi, Driss Aboutajdine. Cleaning statistical language models. 3d. International Conference on Information Systems and Economic Intelligence (SIIE'2010), Feb 2010, Sousse, Tunisia. 2010. 〈inria-00579376〉

Partager

Métriques

Consultations de la notice

124