Overfitting of Clustering and how to avoid it

Sébastien Bubeck; Ulrike von Luxburg

Pré-Publication, Document De Travail Année : 2007

Overfitting of Clustering and how to avoid it

(1) , (2)

1
2

Sébastien Bubeck

Fonction : Auteur
PersonId : 844095

Sequential Learning

Ulrike von Luxburg

Fonction : Auteur

Max Planck Institute for Biological Cybernetics

Résumé

Clustering is often formulated as a discrete optimization problem. The objective is to find, among all partitions of the data set, the best one according to some quality measure. However, in the statistical setting where we assume that the finite data set has been sampled from some underlying space, the goal is not to find the best partition of the given sample, but to approximate the true partition of the underlying space. We argue that the discrete optimization approach usually does not achieve this goal, and instead can lead to overfitting. We construct examples which provably have this behavior. As in the case of supervised learning, the cure is to restrict the size of the function classes under consideration. For appropriate ``small'' function classes we can prove very general consistency theorems for clustering optimization schemes. As one particular algorithm for clustering with a restricted function space we introduce ``nearest neighbor clustering''. Similar as the k-nearest neighbor classifier in supervised learning, this algorithm can be seen as a general baseline algorithm to minimize arbitrary clustering objective functions. We prove that it is statistically consistent for all commonly used clustering objective functions.

Domaines

Statistiques [math.ST] Théorie [stat.TH] Apprentissage [cs.LG]

Fichier principal

BubeckLuxburg.pdf (378.42 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Sébastien Bubeck : Connectez-vous pour contacter le contributeur

https://inria.hal.science/inria-00185780

Soumis le : mercredi 7 novembre 2007-09:27:35

Dernière modification le : vendredi 24 mars 2023-14:52:49

Archivage à long terme le : lundi 24 septembre 2012-14:56:10

Dates et versions

inria-00185780 , version 1 (07-11-2007)

inria-00185780 , version 2 (19-11-2007)

inria-00185780 , version 3 (07-03-2011)

Identifiants

HAL Id : inria-00185780 , version 1

Citer

Sébastien Bubeck, Ulrike von Luxburg. Overfitting of Clustering and how to avoid it. 2007. ⟨inria-00185780v1⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

212 Consultations

3271 Téléchargements

Overfitting of Clustering and how to avoid it

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Partager