Evaluation of Topic Identification Methods on Arabic Corpora

Mourad Abbas; Kamel Smaili; Daoud Berkani

Article Dans Une Revue Journal of Digital Information Management Année : 2011

Evaluation of Topic Identification Methods on Arabic Corpora

(1) , (2, 3) , (4)

1
2
3
4

Mourad Abbas

Fonction : Auteur
PersonId : 950687

Centre de Recherche Scientifique et Technique pour le Dévelopement de la Langue Arabe

Kamel Smaili

Fonction : Auteur
PersonId : 2521
IdHAL : kamel-smaili
IdRef : 034429700

Analysis, perception and recognition of speech

Statistical Machine Translation and Speech Modelization and Text

Daoud Berkani

Fonction : Auteur

Department of Electronics [El Harrach]

Résumé

Topic Identification is one of the important keys for the success of many applications. Indeed, there are few works in this field concerning Arabic language because of lack of standard corpora. In this study, we will provide directly comparable results of six text categorization methods on a new Arabic corpus Alwatan-2004. Hence, Topic Unigram Language Model (TULM), Term Frequency/Inverse Document Frequency (TFIDF), Neural Network, SVM, M-SVM and TR have been experimented, and showed that TR-Classifier is the most efficient among the set of classifiers, nevertheless, only binary SVM outperformed it thanks to its characteristics. Moreover, we should note that the size of Alwatan-2004 corpus used to achieve our experiments is considered the most important compared to any other Arabic corpus which had been used for topic identification experiments until now. In addition, we aim through using small sizes of vocabularies to reduce the time of computation. This is important for adaptive language modeling, particularly Topic Adaptation, which is required in real time applications such as speech recognition and machine translation systems. Our experiments indicate that the results are better than other works dealing with Arabic text categorization.

Mots clés

Topic Identification Arabic Language SVM

Domaines

Informatique et langage [cs.CL]

Kamel Smaïli : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-01586544

Soumis le : mercredi 13 septembre 2017-01:30:23

Dernière modification le : jeudi 7 mars 2024-10:34:03

Dates et versions

hal-01586544 , version 1 (13-09-2017)

Identifiants

HAL Id : hal-01586544 , version 1

Citer

Mourad Abbas, Kamel Smaili, Daoud Berkani. Evaluation of Topic Identification Methods on Arabic Corpora. Journal of Digital Information Management, 2011, 9 (5), pp.8 double column. ⟨hal-01586544⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS INRIA UNIV-LORRAINE INRIA2 LORIA LORIA-NLPKD

134 Consultations

0 Téléchargements

Evaluation of Topic Identification Methods on Arabic Corpora

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager