sign in
english version rss feed

inria-00321519, version 1

An information theoretic approach to finding word groups for text classification

Jakob Verbeek () a1

(2000). Master thesis 2, University of Amsterdam

Abstract: This thesis concerns finding the 'optimal' number of (non-overlapping) word groups for text classification. We present a method to select which words to cluster in word groups and how many such word groups to use on the basis of a set of pre-classified texts. The method involves a greedy search through the space of possible word groups. The criterion on which is navigated through this space is based on 'mutual information' and is known as 'Jensen Shannon divergence'. The criterion to decide which number of word groups to use is based on Rissanen's MDL Principle. We present empirical results that indicate that the proposed method performs well at its task. The prediction model used is based on the Naive Bayes model and the data set used for the experiments is a subset of the 20 Newsgroup data set.

  • Icone de Ver00.png
  • Domain : Computer Science/Learning
  • Keywords : text classification – MDL principle – feature extraction
 
  • inria-00321519, version 1
  • oai:hal.inria.fr:inria-00321519
  • From: 
  • Submitted on: Wednesday, 16 February 2011 17:00:56
  • Updated on: Thursday, 9 June 2011 11:03:36
all articles on CCSd database...
all articles on CCSd database...
all articles on CCSd database...
all articles on CCSd database...
all articles on CCSd database...
all articles on CCSd database...
all articles on CCSd database...
all articles on CCSd database...
all articles on CCSd database...
all articles on CCSd database...
all articles on CCSd database...
all articles on CCSd database...
all articles on CCSd database...
all articles on CCSd database...
all articles on CCSd database...
all articles on CCSd database...
all articles on CCSd database...
all articles on CCSd database...
all articles on CCSd database...
all articles on CCSd database...