inria-00582493, version 1
Modeling Arabic Language using statistical methods
Arabian Journal for Science and Engineering 35, 2C (2010) 69-82
Résumé : In this paper we propose to investigate statistical language models for Arabic. First, several experiments using different smoothing techniques are carried out on a small corpus extracted from a daily newspaper. The sparseness of the data leads us to investigate other solutions without increasing the size of the corpus. A word segmentation technique has been employed in order to increase the statistical viability of the corpus. An n-morpheme model has been developed which leads to a better performance in terms of normalized perplexity. The second experiment concerns the study of the behaviour of statistical models based on different kinds of corpora. The introduction of distant n-gram improves the baseline model. Finally we propose a comparative study of statistical language models for Arabic and several foreign languages. The objective of this study is to understand how to better model each of this languages. For foreign languages, trigram models are most appropriate whatever the smoothing technique used. For Arabic, the n-gram models of higher order smoothed with Witten Bell method are more efficient.
- a – Université Badji Mokhtar
- 1 :
- Université Badji Mokhtar
- 2 :
- INRIA – CNRS : UMR7503 – Université Henri Poincaré - Nancy I – Université Nancy II – Institut National Polytechnique de Lorraine (INPL)
- Domaine : Informatique/Informatique et langage
- Mots-clés : Modèle de langage – morphèmes – perplexité – lissage – modèle distant
- inria-00582493, version 1
- http://hal.inria.fr/inria-00582493
- oai:hal.inria.fr:inria-00582493
- Contributeur :
- Soumis le : Vendredi 1 Avril 2011, 16:00:07
- Dernière modification le : Vendredi 1 Avril 2011, 16:00:07


Exporter