Exact and Monte Carlo Calculations of Integrated Likelihoods for the Latent Class Model

Christophe Biernacki 1 Gilles Celeux 2 Gérard Govaert 3
2 SELECT - Model selection in statistical learning
Inria Saclay - Ile de France, LMO - Laboratoire de Mathématiques d'Orsay, CNRS - Centre National de la Recherche Scientifique : UMR
Abstract : The latent class model or multivariate multinomial mixture is a powerful approach for clustering categorical data. This model uses a conditional independence assumption given the latent class to which an object is belonging to represent heterogeneous populations. . In this paper, we exploit the fact that a fully Bayesian analysis with Jeffreys non informative prior distributions does not involve technical difficulty to propose an exact expression of the integrated {\em complete-data} likelihood, which is known as being a meaningful model selection criterion in a clustering perspective. Similarly, a Monte Carlo approximation of the integrated {\em observed-data} likelihood can be obtained in two steps: An exact integration over the parameters is followed by an approximation of the sum over all possible partitions through either a frequentist or a Bayesian importance sampling strategy. Then, the exact and the approximate criteria experimentally compete respectively with their standard asymptotic BIC approximations for choosing the number of mixture components. Numerical experiments on simulated data and a biological example highlight that asymptotic criteria are usually dramatically more conservative than the non asymptotic presented criteria, not only for moderate sample sizes as expected but also for quite large sample sizes. It appears that asymptotic standard criteria could often fail to select some interesting structures present in the data. It is also the opportunity to highlight the deep purpose difference between the integrated {\em complete-data} and the {\em observed-data} likelihoods: The integrated {\em complete-data} likelihood is focussing on a cluster analysis view and favors well separated clusters, implying some robustness against model misspecification, while the integrated {\em observed-data} likelihood is focussing on a density estimation view and is expected to provide a consistent estimation of the distribution of the data.
Type de document :
Rapport
[Research Report] RR-6609, INRIA. 2008, pp.25
Liste complète des métadonnées

Littérature citée [14 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/inria-00310137
Contributeur : Gilles Celeux <>
Soumis le : jeudi 7 août 2008 - 18:45:12
Dernière modification le : jeudi 11 janvier 2018 - 06:26:36
Document(s) archivé(s) le : jeudi 3 juin 2010 - 18:05:38

Fichiers

RR6609.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : inria-00310137, version 1

Collections

Citation

Christophe Biernacki, Gilles Celeux, Gérard Govaert. Exact and Monte Carlo Calculations of Integrated Likelihoods for the Latent Class Model. [Research Report] RR-6609, INRIA. 2008, pp.25. 〈inria-00310137〉

Partager

Métriques

Consultations de la notice

565

Téléchargements de fichiers

192