Accelerated EM-based clustering of large data sets

Jakob Verbeek; Jan Nunnink; Nikos Vlassis

doi:10.1007/s10618-005-0033-3

Article Dans Une Revue Data Mining and Knowledge Discovery Année : 2006

Accelerated EM-based clustering of large data sets

(1) , (2) , (2)

1
2

Jakob Verbeek

Fonction : Auteur
PersonId : 10676
IdHAL : verbeek
ORCID : 0000-0003-1419-1816
IdRef : 180998463

Learning and recognition in vision

Jan Nunnink

Fonction : Auteur

Instituut voor Informatica

Nikos Vlassis

Fonction : Auteur
PersonId : 853678

Instituut voor Informatica

Résumé

Motivated by the poor performance (linear complexity) of the EM algorithm in clustering large data sets, and inspired by the successful accelerated versions of related algorithms like k-means, we derive an accelerated variant of the EM algorithm for Gaussian mixtures that: (1) offers speedups that are at least linear in the number of data points, (2) ensures convergence by strictly increasing a lower bound on the data log-likelihood in each learning step, and (3) allows ample freedom in the design of other accelerated variants. We also derive a similar accelerated algorithm for greedy mixture learning, where very satisfactory results are obtained. The core idea is to define a lower bound on the data log-likelihood based on a grouping of data points. The bound is maximized by computing in turn (i) optimal assignments of groups of data points to the mixture components, and (ii) optimal re-estimation of the model parameters based on average sufficient statistics computed over groups of data points. The proposed method naturally generalizes to mixtures of other members of the exponential family. Experimental results show the potential of the proposed method over other state-of-the-art acceleration techniques.

Mots clés

Gaussian mixtures EM algorithm Free energy kd-trees Large data sets

Domaines

Apprentissage [cs.LG]

Fichier principal

Verbeek04dmkd_rev.pdf (243.23 Ko)

VNV06.png (26.83 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Format : Figure, Image

Jakob Verbeek : Connectez-vous pour contacter le contributeur

https://inria.hal.science/inria-00321022

Soumis le : lundi 11 avril 2011-11:19:30

Dernière modification le : jeudi 4 avril 2024-21:20:04

Archivage à long terme le : mardi 12 juillet 2011-02:40:45

Dates et versions

inria-00321022 , version 1 (25-01-2011)

inria-00321022 , version 2 (11-04-2011)

Identifiants

HAL Id : inria-00321022 , version 2
DOI : 10.1007/s10618-005-0033-3

Citer

Jakob Verbeek, Jan Nunnink, Nikos Vlassis. Accelerated EM-based clustering of large data sets. Data Mining and Knowledge Discovery, 2006, Data Mining and Knowledge Discovery, 13 (3), pp.291-307. ⟨10.1007/s10618-005-0033-3⟩. ⟨inria-00321022v2⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-RENNES1 UGA IMAG CNRS INRIA IRISA INRIA2 UR1-MATH-STIC UR1-UFR-ISTIC UNIV-RENNES UR1-MATH-NUM

365 Consultations

607 Téléchargements

Accelerated EM-based clustering of large data sets

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager