Accelerated EM-based clustering of large data sets

Jakob Verbeek 1 Jan Nunnink 2 Nikos Vlassis 2
1 LEAR - Learning and recognition in vision
GRAVIR - IMAG - Graphisme, Vision et Robotique, Inria Grenoble - Rhône-Alpes, CNRS - Centre National de la Recherche Scientifique : FR71
Abstract : Motivated by the poor performance (linear complexity) of the EM algorithm in clustering large data sets, and inspired by the successful accelerated versions of related algorithms like k-means, we derive an accelerated variant of the EM algorithm for Gaussian mixtures that: (1) offers speedups that are at least linear in the number of data points, (2) ensures convergence by strictly increasing a lower bound on the data log-likelihood in each learning step, and (3) allows ample freedom in the design of other accelerated variants. We also derive a similar accelerated algorithm for greedy mixture learning, where very satisfactory results are obtained. The core idea is to define a lower bound on the data log-likelihood based on a grouping of data points. The bound is maximized by computing in turn (i) optimal assignments of groups of data points to the mixture components, and (ii) optimal re-estimation of the model parameters based on average sufficient statistics computed over groups of data points. The proposed method naturally generalizes to mixtures of other members of the exponential family. Experimental results show the potential of the proposed method over other state-of-the-art acceleration techniques.
Type de document :
Article dans une revue
Data Mining and Knowledge Discovery, Springer Verlag, 2006, Data Mining and Knowledge Discovery, 13 (3), pp.291-307. 〈http://www.springerlink.com/content/a166k37132336532/〉. 〈10.1007/s10618-005-0033-3〉
Liste complète des métadonnées

Littérature citée [26 références]  Voir  Masquer  Télécharger


https://hal.inria.fr/inria-00321022
Contributeur : Jakob Verbeek <>
Soumis le : lundi 11 avril 2011 - 11:19:30
Dernière modification le : mercredi 11 avril 2018 - 01:59:48
Document(s) archivé(s) le : mardi 12 juillet 2011 - 02:40:45

Fichiers

Verbeek04dmkd_rev.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

Collections

Citation

Jakob Verbeek, Jan Nunnink, Nikos Vlassis. Accelerated EM-based clustering of large data sets. Data Mining and Knowledge Discovery, Springer Verlag, 2006, Data Mining and Knowledge Discovery, 13 (3), pp.291-307. 〈http://www.springerlink.com/content/a166k37132336532/〉. 〈10.1007/s10618-005-0033-3〉. 〈inria-00321022v2〉

Partager

Métriques

Consultations de la notice

416

Téléchargements de fichiers

398