Skip to Main content Skip to Navigation
Reports

Clustering stability revealed by matchings between clusters of clusters

Frédéric Cazals 1 Dorian Mazauric 1 Romain Tetley 1
1 ABS - Algorithms, Biology, Structure
CRISAM - Inria Sophia Antipolis - Méditerranée
Abstract : Clustering is a fundamental problem in data science, yet, the variety of clustering methods and their sensitivity to parameters make clustering hard. To analyze the stability of a given clustering algorithm while varying its parameters, and to compare clusters yielded by different algorithms, several comparison schemes based on matchings, information theory and various indices (Rand, Jaccard) have been developed. We go beyond these by providing a novel class of methods computing meta-clusters within each clustering-- a meta-cluster is a group of clusters, together with a matching between these. Altogether, these pieces of information help assessing the coherence between two clusterings. More specifically, let the intersection graph of two clusterings be the edge-weighted bipartite graph in which the nodes represent the clusters, the edges represent the non empty intersection between two clusters, and the weight of an edge is the number of common items. We introduce the so-called (k,D) and D-family-matching problems on intersection graphs, with k the number of meta-clusters and D the upper-bound on the diameter of the graph induced by the clusters of any meta-cluster. First we prove hardness and inapproximability results. Second, we design exact polynomial time dynamic programming algorithms for some classes of graphs (in particular trees). Then, we prove efficient (exact, approximation, and heuristic) algorithms, based on spanning trees, for general graphs. Practically, we present extensive experiments in two directions. First, we illustrate the ability of our algorithms to identify relevant meta-clusters between a given clustering and an edited version of it. Second, we show how our methods can be used to identify notorious instabilities of the k-means algorithm.
Complete list of metadata

Cited literature [30 references]  Display  Hide  Download

https://hal.inria.fr/hal-01410396
Contributor : Dorian Mazauric <>
Submitted on : Tuesday, December 6, 2016 - 4:28:42 PM
Last modification on : Thursday, March 5, 2020 - 4:55:16 PM
Long-term archiving on: : Tuesday, March 21, 2017 - 9:54:11 AM

File

RR-8992-partition-matching.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-01410396, version 1

Collections

Citation

Frédéric Cazals, Dorian Mazauric, Romain Tetley. Clustering stability revealed by matchings between clusters of clusters. [Research Report] RR-8992, Inria Sophia Antipolis; Université Côte d'Azur. 2016, pp.51. ⟨hal-01410396⟩

Share

Metrics

Record views

339

Files downloads

298