Nearest Neighbor Clustering: A Baseline Method for Consistent Clustering with Arbitrary Objective Functions

Sébastien Bubeck 1 Ulrike Von Luxburg 2
1 SEQUEL - Sequential Learning
LIFL - Laboratoire d'Informatique Fondamentale de Lille, Inria Lille - Nord Europe, LAGIS - Laboratoire d'Automatique, Génie Informatique et Signal
Abstract : Clustering is often formulated as a discrete optimization problem. The objective is to find, among all partitions of the data set, the best one according to some quality measure. However, in the statistical setting where we assume that the finite data set has been sampled from some underlying space, the goal is not to find the best partition of the given sample, but to approximate the true partition of the underlying space. We argue that the discrete optimization approach usually does not achieve this goal, and instead can lead to inconsistency. We construct examples which provably have this behavior. As in the case of supervised learning, the cure is to restrict the size of the function classes under consideration. For appropriate ``small'' function classes we can prove very general consistency theorems for clustering optimization schemes. As one particular algorithm for clustering with a restricted function space we introduce ``nearest neighbor clustering''. Similar to the k-nearest neighbor classifier in supervised learning, this algorithm can be seen as a general baseline algorithm to minimize arbitrary clustering objective functions. We prove that it is statistically consistent for all commonly used clustering objective functions.
Type de document :
Pré-publication, Document de travail
2011
Liste complète des métadonnées

Littérature citée [29 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/inria-00185780
Contributeur : Sébastien Bubeck <>
Soumis le : lundi 7 mars 2011 - 11:34:50
Dernière modification le : jeudi 11 janvier 2018 - 06:22:13
Document(s) archivé(s) le : samedi 3 décembre 2016 - 20:46:22

Fichier

bubeck09a.pdf
Fichiers éditeurs autorisés sur une archive ouverte

Identifiants

  • HAL Id : inria-00185780, version 3

Collections

Citation

Sébastien Bubeck, Ulrike Von Luxburg. Nearest Neighbor Clustering: A Baseline Method for Consistent Clustering with Arbitrary Objective Functions. 2011. 〈inria-00185780v3〉

Partager

Métriques

Consultations de la notice

219

Téléchargements de fichiers

194