Combining Mixture Components for Clustering

Jean-Patrick Baudry 1, * Adrian E. Raftery 2 Gilles Celeux 3 Kenneth Lo 4 Raphael Gottardo 4
* Auteur correspondant
3 SELECT - Model selection in statistical learning
Inria Saclay - Ile de France, LMO - Laboratoire de Mathématiques d'Orsay, CNRS - Centre National de la Recherche Scientifique : UMR
Abstract : Model-based clustering consists of fitting a mixture model to data and identifying each cluster with one of its components. Multivariate normal distributions are typically used. The number of clusters is usually determined from the data, often using BIC. In practice, however, individual clusters can be poorly fitted by Gaussian distributions, and in that case model-based clustering tends to represent one non-Gaussian cluster by a mixture of two or more Gaussian distributions. If the number of mixture components is interpreted as the number of clusters, this can lead to overestimation of the number of clusters. This is because BIC selects the number of mixture components needed to provide a good approximation to the density, rather than the number of clusters as such. We propose first selecting the total number of Gaussian mixture components, K, using BIC and then combining them hierarchically according to an entropy criterion. This yields a unique soft clustering for each number of clusters less than or equal to K; these clusterings can be compared on substantive grounds. We illustrate the method with simulated data and a flow cytometry dataset.
Type de document :
Article dans une revue
Journal of Computational and Graphical Statistics, Taylor & Francis, 2010, 19, pp.332-353
Liste complète des métadonnées

Littérature citée [20 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/inria-00321090
Contributeur : Gilles Celeux <>
Soumis le : vendredi 12 septembre 2008 - 12:26:21
Dernière modification le : jeudi 11 janvier 2018 - 06:22:14
Document(s) archivé(s) le : vendredi 4 juin 2010 - 11:17:10

Fichier

RR-6644.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : inria-00321090, version 1

Collections

Citation

Jean-Patrick Baudry, Adrian E. Raftery, Gilles Celeux, Kenneth Lo, Raphael Gottardo. Combining Mixture Components for Clustering. Journal of Computational and Graphical Statistics, Taylor & Francis, 2010, 19, pp.332-353. 〈inria-00321090〉

Partager

Métriques

Consultations de la notice

285

Téléchargements de fichiers

238