Skip to Main content Skip to Navigation
Journal articles

Combining Mixture Components for Clustering

Abstract : Model-based clustering consists of fitting a mixture model to data and identifying each cluster with one of its components. Multivariate normal distributions are typically used. The number of clusters is usually determined from the data, often using BIC. In practice, however, individual clusters can be poorly fitted by Gaussian distributions, and in that case model-based clustering tends to represent one non-Gaussian cluster by a mixture of two or more Gaussian distributions. If the number of mixture components is interpreted as the number of clusters, this can lead to overestimation of the number of clusters. This is because BIC selects the number of mixture components needed to provide a good approximation to the density, rather than the number of clusters as such. We propose first selecting the total number of Gaussian mixture components, K, using BIC and then combining them hierarchically according to an entropy criterion. This yields a unique soft clustering for each number of clusters less than or equal to K; these clusterings can be compared on substantive grounds. We illustrate the method with simulated data and a flow cytometry dataset.
Complete list of metadata

Cited literature [20 references]  Display  Hide  Download

https://hal.inria.fr/inria-00321090
Contributor : Gilles Celeux <>
Submitted on : Friday, September 12, 2008 - 12:26:21 PM
Last modification on : Tuesday, July 6, 2021 - 3:39:47 AM
Long-term archiving on: : Friday, June 4, 2010 - 11:17:10 AM

File

RR-6644.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : inria-00321090, version 1

Collections

Citation

Jean-Patrick Baudry, Adrian E. Raftery, Gilles Celeux, Kenneth Lo, Raphael Gottardo. Combining Mixture Components for Clustering. Journal of Computational and Graphical Statistics, Taylor & Francis, 2010, 19, pp.332-353. ⟨inria-00321090⟩

Share

Metrics

Record views

400

Files downloads

1058