An information theoretic approach to finding word groups for text classification

Jakob Verbeek

Master Thesis Year : 2000

An information theoretic approach to finding word groups for text classification

(1)

Jakob Verbeek

Function : Author
PersonId : 10676
IdHAL : verbeek
ORCID : 0000-0003-1419-1816
IdRef : 180998463

Instituut voor Informatica

Abstract

This thesis concerns finding the 'optimal' number of (non-overlapping) word groups for text classification. We present a method to select which words to cluster in word groups and how many such word groups to use on the basis of a set of pre-classified texts. The method involves a greedy search through the space of possible word groups. The criterion on which is navigated through this space is based on 'mutual information' and is known as 'Jensen Shannon divergence'. The criterion to decide which number of word groups to use is based on Rissanen's MDL Principle. We present empirical results that indicate that the proposed method performs well at its task. The prediction model used is based on the Naive Bayes model and the data set used for the experiments is a subset of the 20 Newsgroup data set.

Keywords

text classification MDL principle feature extraction

Domains

Machine Learning [cs.LG]

Fichier principal

verbeek00msc.pdf (572.75 Ko)

Ver00.png (28.77 Ko)

Origin : Files produced by the author(s)

Format : Figure, Image

Jakob Verbeek : Connect in order to contact the contributor

https://inria.hal.science/inria-00321519

Submitted on : Wednesday, February 16, 2011-5:00:56 PM

Last modification on : Monday, September 25, 2017-10:08:04 AM

Long-term archiving on: Tuesday, May 17, 2011-2:38:35 AM

Dates and versions

inria-00321519 , version 1 (16-02-2011)

Identifiers

HAL Id : inria-00321519 , version 1

Cite

Jakob Verbeek. An information theoretic approach to finding word groups for text classification. Machine Learning [cs.LG]. 2000. ⟨inria-00321519⟩

Export

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

105 View

403 Download

An information theoretic approach to finding word groups for text classification

Abstract

Keywords

Domains

Dates and versions

Identifiers

Cite

Export

Share