Combining Mixture Components for Clustering

Jean-Patrick Baudry; Adrian E. Raftery; Gilles Celeux; Kenneth Lo; Raphael Gottardo

Article Dans Une Revue Journal of Computational and Graphical Statistics Année : 2010

Combining Mixture Components for Clustering

(1) , (2) , (3) , (4) , (4)

1
2
3
4

Jean-Patrick Baudry

Fonction : Auteur correspondant
PersonId : 853690

Connectez-vous pour contacter l'auteur

Laboratoire de Mathématiques d'Orsay

Adrian E. Raftery

Fonction : Auteur

Department of Statistics

Gilles Celeux

Fonction : Auteur
PersonId : 833415
ORCID : 0000-0002-7221-6594
IdRef : 02951598X

Model selection in statistical learning

Kenneth Lo

Fonction : Auteur

Department of Statistics [Vancouver]

Raphael Gottardo

Fonction : Auteur

Department of Statistics [Vancouver]

Résumé

Model-based clustering consists of fitting a mixture model to data and identifying each cluster with one of its components. Multivariate normal distributions are typically used. The number of clusters is usually determined from the data, often using BIC. In practice, however, individual clusters can be poorly fitted by Gaussian distributions, and in that case model-based clustering tends to represent one non-Gaussian cluster by a mixture of two or more Gaussian distributions. If the number of mixture components is interpreted as the number of clusters, this can lead to overestimation of the number of clusters. This is because BIC selects the number of mixture components needed to provide a good approximation to the density, rather than the number of clusters as such. We propose first selecting the total number of Gaussian mixture components, K, using BIC and then combining them hierarchically according to an entropy criterion. This yields a unique soft clustering for each number of clusters less than or equal to K; these clusterings can be compared on substantive grounds. We illustrate the method with simulated data and a flow cytometry dataset.

Mots clés

BIC entropy flow cytometry mixture model model-based clustering multivariate normal distribution

Domaines

Statistiques [math.ST] Théorie [stat.TH]

Fichier principal

RR-6644.pdf (1.38 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

Gilles Celeux : Connectez-vous pour contacter le contributeur

https://inria.hal.science/inria-00321090

Soumis le : vendredi 12 septembre 2008-12:26:21

Dernière modification le : jeudi 23 novembre 2023-10:49:15

Archivage à long terme le : vendredi 4 juin 2010-11:17:10

Dates et versions

inria-00321090 , version 1 (12-09-2008)

Identifiants

HAL Id : inria-00321090 , version 1

Citer

Jean-Patrick Baudry, Adrian E. Raftery, Gilles Celeux, Kenneth Lo, Raphael Gottardo. Combining Mixture Components for Clustering. Journal of Computational and Graphical Statistics, 2010, 19, pp.332-353. ⟨inria-00321090⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS INRIA INSMI LM-ORSAY INRIA2 UNIV-PARIS-SACLAY GS-MATHEMATIQUES

190 Consultations

485 Téléchargements

Combining Mixture Components for Clustering

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager