Clustering and Model Selection via Penalized Likelihood for Different-sized Categorical Data Vectors

Esther Derman; Erwan Le Pennec

Pré-Publication, Document De Travail Année : 2017

Clustering and Model Selection via Penalized Likelihood for Different-sized Categorical Data Vectors

(1) , (1, 2)

1
2

Esther Derman

Fonction : Auteur
PersonId : 1015947

Centre de Mathématiques Appliquées - Ecole Polytechnique

Erwan Le Pennec

Fonction : Auteur
PersonId : 1259955
IdHAL : erwan-le-pennec
ORCID : 0000-0002-7988-7999

Centre de Mathématiques Appliquées - Ecole Polytechnique

Modélisation en pharmacologie de population

Résumé

In this study, we consider unsupervised clustering of categorical vectors that can be of different size using mixture. We use likelihood maximization to estimate the parameters of the underlying mixture model and a penalization technique to select the number of mixture components. Regardless of the true distribution that generated the data, we show that an explicit penalty, known up to a multiplicative constant, leads to a non-asymptotic oracle inequality with the Kullback-Leibler divergence on the two sides of the inequality. This theoretical result is illustrated by a document clustering application. To this aim a novel robust expectation-maximization algorithm is proposed to estimate the mixture parameters that best represent the different topics. Slope heuristics are used to calibrate the penalty and to select a number of clusters.

Mots clés

Document clustering Expectation-maximization algorithm Multinomial mix- ture Model selection Penalized likelihood Slope heuristics

Domaines

Statistiques [math.ST]

Fichier principal

CategoricalVectorClustering.pdf (547.83 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Erwan Le Pennec : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-01583692

Soumis le : jeudi 7 septembre 2017-16:51:16

Dernière modification le : lundi 22 avril 2024-10:19:34

Dates et versions

hal-01583692 , version 1 (07-09-2017)

Identifiants

HAL Id : hal-01583692 , version 1
ARXIV : 1709.02294

Citer

Esther Derman, Erwan Le Pennec. Clustering and Model Selection via Penalized Likelihood for Different-sized Categorical Data Vectors. 2017. ⟨hal-01583692⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

X CNRS INRIA X-CMAP X-DEP-MATHA CMAP INRIA2 UNIV-PARIS-SACLAY GS-COMPUTER-SCIENCE

261 Consultations

272 Téléchargements

Clustering and Model Selection via Penalized Likelihood for Different-sized Categorical Data Vectors

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager