Cleaning statistical language models

Reda Jourani; David Langlois; Kamel Smaïli; Khalid Daoudi; Driss Aboutajdine

Communication Dans Un Congrès Année : 2010

Cleaning statistical language models

(1) , (2) , (2) , (1) , (3)

1
2
3

Reda Jourani

Fonction : Auteur
PersonId : 881619
IdRef : 165708018

Geometry and Statistics in acquisition data

David Langlois

Fonction : Auteur
PersonId : 298
IdHAL : david-langlois
IdRef : 070239509

Analysis, perception and recognition of speech

Kamel Smaïli

Fonction : Auteur
PersonId : 2521
IdHAL : kamel-smaili
IdRef : 034429700

Analysis, perception and recognition of speech

Khalid Daoudi

Fonction : Auteur
PersonId : 1329075
ORCID : 0000-0003-3536-1060
IdRef : 115483500

Geometry and Statistics in acquisition data

Driss Aboutajdine

Fonction : Auteur

Laboratoire de Recherche en Informatique et Télécommunications [Rabat]

Résumé

In this paper, we describe how to decide a n-gram is actually impossible in a language. We use decision rules on a corpus tagged with POS. These rules are based on statistics and phonological criteria. In terms of statistical language modeling, deciding that a n-gram is impossible leads to assign to it a null probability.We defer on the possible n-grams the released mass of probabilities. To do this, we deﬁne a new formulation of P(w|h). We apply the principle of impossible events to bigrams. Then we use the list of impossible bigrams to build a list of impossible trigrams. The new trigram model exceeds the baseline model by 5.53% in terms of perplexity.

Domaines

Intelligence artificielle [cs.AI]

David Langlois : Connectez-vous pour contacter le contributeur

https://inria.hal.science/inria-00579376

Soumis le : mercredi 23 mars 2011-16:06:30

Dernière modification le : mercredi 31 janvier 2024-15:22:16

Dates et versions

inria-00579376 , version 1 (23-03-2011)

Identifiants

HAL Id : inria-00579376 , version 1

Citer

Reda Jourani, David Langlois, Kamel Smaïli, Khalid Daoudi, Driss Aboutajdine. Cleaning statistical language models. 3d. International Conference on Information Systems and Economic Intelligence (SIIE'2010), Feb 2010, Sousse, Tunisia. ⟨inria-00579376⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS INRIA UNIV-LORRAINE INRIA2 LORIA

243 Consultations

0 Téléchargements

Cleaning statistical language models

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager