Supervised feature extraction for text categorization - Inria - Institut national de recherche en sciences et technologies du numérique Accéder directement au contenu
Communication Dans Un Congrès Année : 2000

Supervised feature extraction for text categorization

Jakob Verbeek

Résumé

This paper concerns finding the `optimal' number of (non-overlapping) word groups for text classification. We present a method to select which words to cluster in word groups and how many such word groups to use on the basis of a set of pre-classified texts. The method involves a `greedy' search through the space of possible word groups. The words are grouped according to the `Jensen-Shannon divergence' between the corresponding distributions over the classes. The criterion to decide which number of word groups to use is based on Rissanen's MDL Principle. We present empirical results that indicate that the proposed method performs well. Furthermore, the proposed method outperforms cross-validation in the sense that far fewer word groups are selected while prediction accuracy is just slightly worse. For the experimentation we used a subset of the `20 Newsgroup' data set.
Fichier principal
Vignette du fichier
verbeek00bnl.pdf (206.76 Ko) Télécharger le fichier
Vignette du fichier
Ver00a.png (63.98 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Format : Figure, Image

Dates et versions

inria-00321520 , version 1 (16-02-2011)

Identifiants

  • HAL Id : inria-00321520 , version 1

Citer

Jakob Verbeek. Supervised feature extraction for text categorization. Tenth Belgian-Dutch Conference on Machine Learning (Benelearn '00), Dec 2000, Tilburg, Netherlands. ⟨inria-00321520⟩
193 Consultations
441 Téléchargements

Partager

Gmail Facebook X LinkedIn More