Supervised feature extraction for text categorization

Abstract : This paper concerns finding the `optimal' number of (non-overlapping) word groups for text classification. We present a method to select which words to cluster in word groups and how many such word groups to use on the basis of a set of pre-classified texts. The method involves a `greedy' search through the space of possible word groups. The words are grouped according to the `Jensen-Shannon divergence' between the corresponding distributions over the classes. The criterion to decide which number of word groups to use is based on Rissanen's MDL Principle. We present empirical results that indicate that the proposed method performs well. Furthermore, the proposed method outperforms cross-validation in the sense that far fewer word groups are selected while prediction accuracy is just slightly worse. For the experimentation we used a subset of the `20 Newsgroup' data set.
Type de document :
Communication dans un congrès
A.J. Feelders. Tenth Belgian-Dutch Conference on Machine Learning (Benelearn '00), Dec 2000, Tilburg, Netherlands. 2000
Liste complète des métadonnées


https://hal.inria.fr/inria-00321520
Contributeur : Jakob Verbeek <>
Soumis le : mercredi 16 février 2011 - 16:59:44
Dernière modification le : lundi 25 septembre 2017 - 10:08:04
Document(s) archivé(s) le : mardi 17 mai 2011 - 02:39:31

Fichiers

verbeek00bnl.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : inria-00321520, version 1

Citation

Jakob Verbeek. Supervised feature extraction for text categorization. A.J. Feelders. Tenth Belgian-Dutch Conference on Machine Learning (Benelearn '00), Dec 2000, Tilburg, Netherlands. 2000. 〈inria-00321520〉

Partager

Métriques

Consultations de la notice

294

Téléchargements de fichiers

490