TR-Classifier and kNN Evaluation for Topic Identification tasks

Abstract : This paper focuses on studying topic identification for Arabic language by using two methods. The first method is the well-known kNN (k Nearest Neighbors) which is used as baseline. The second one is the TR-Classifier, mainly based on computing triggers. The experiments show that TR-Classifier has the advantage to give best performances compared to kNN, by using much reduced sizes of Topic Vocabularies. TR-Classifier performance is enhanced by increasing jointly the number of triggers and the size of topic vocabularies. It should be noted that topic vocabularies are used by the TR-Classifier. Whereas, a general vocabulary is needed for kNN, and it is obtained by the concatenation of those used by the TR-Classifier. In addition to the standard measures Recall and Precision used for the evaluation step, we have drawn ROC curves for some topics to illustrate more clearly the difference in performance between the two classifiers. The corpus used in our experiments is downloaded from an online Arabic newspaper. Its size is about 10 millions words, distributed over six selected topics, in this case: culture, religion, economy, local news, international news and sports.
Type de document :
Article dans une revue
International Journal on Information and Communication Technologies, Serials Publications, 2010, 3 (3), pp.10
Liste complète des métadonnées

Littérature citée [44 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/hal-01586549
Contributeur : Kamel Smaïli <>
Soumis le : lundi 20 novembre 2017 - 15:29:32
Dernière modification le : jeudi 11 janvier 2018 - 06:27:18
Document(s) archivé(s) le : mercredi 21 février 2018 - 15:59:55

Fichier

TR-Classifier_and_kNN_Evaluati...
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : hal-01586549, version 1

Collections

Citation

Mourad Abbas, Kamel Smaïli, Daoud Berkani. TR-Classifier and kNN Evaluation for Topic Identification tasks. International Journal on Information and Communication Technologies, Serials Publications, 2010, 3 (3), pp.10. 〈hal-01586549〉

Partager

Métriques

Consultations de la notice

129

Téléchargements de fichiers

36