Comparing Two Clusterings Using Matchings between Clusters of Clusters

Frédéric Cazals; Dorian Mazauric; Romain Tetley; Rémi Watrigant

doi:10.1145/3345951

Article Dans Une Revue ACM Journal of Experimental Algorithmics Année : 2019

Comparing Two Clusterings Using Matchings between Clusters of Clusters

(1) , (1) , (2) , (2)

1
2

Frédéric Cazals

Fonction : Auteur
PersonId : 1189617
ORCID : 0000-0003-2735-6755
IdRef : 094973881

Algorithms, Biology, Structure

Dorian Mazauric

Fonction : Auteur

Algorithms, Biology, Structure

Romain Tetley

Fonction : Auteur

Inria Sophia Antipolis - Méditerranée

Rémi Watrigant

Fonction : Auteur

Inria Sophia Antipolis - Méditerranée

Résumé

Clustering is a fundamental problem in data science, yet, the variety of clusteringmethods and their sensitivity to parameters make clustering hard. To analyze the stability of agiven clustering algorithm while varying its parameters, and to compare clusters yielded by differentalgorithms, several comparison schemes based on matchings, information theory and various indices(Rand, Jaccard) have been developed. We go beyond these by providing a novel class of methodscomputing meta-clusters within each clustering– a meta-cluster is a group of clusters, togetherwith a matching between these.Let the intersection graph of two clusterings be the edge-weighted bipartite graph in which thenodes represent the clusters, the edges represent the non empty intersection between two clus-ters, and the weight of an edge is the number of common items. We introduce the so-calledD-family-matching problem on intersection graphs, withDthe upper-bound on the diameter ofthe graph induced by the clusters of any meta-cluster. First we prove NP-completeness resultsand unbounded approximation ratio of simple strategies. Second, we design exact polynomial timedynamic programming algorithms for some classes of graphs (in particular trees). Then, we provespanning-tree based efficient algorithms for general graphs.Our experiments illustrate the role ofDas a scale parameter providing information on the rela-tionship between clusters within a clustering and in-between two clusterings. They also show theadvantages of our built-in mapping over classical cluster comparison measures such as the variationof information (VI)

Le clustering est une tâche essentielle en analyse de données, mais la variété desméthodes disponibles rend celle-ci ardue. Diverses stratégies ont été proposées pour analyserla stabilité d’un clustering en fonction des paramètres de l’algorithme l’ayant généré, ou biencomparer des clusterings produits par des algorithmes différents. Nous allons au delà de celles-ci,en proposant une nouvelle classe de méthodes formant des groupes de clusters (meta-clusters)dans chaque clustering, et établissant une correspondance entre ceux-ci.Plus spécifiquement, définissons le graphe intersection de deux clusterings comme le graphe bi-parti dont les sommets sont les clusters, chaque arête étant pondérée par le nombre de points com-muns à deux clusters. Nous définissons leD-family-matching problème à partir du graphe inter-section,Détant une borne supérieure sur le diamètre du graphe induit par les clusters des meta-clusters. Dans un premier temps, nous établissons des résultats de difficulté et d’inaproximabilité.Dans un second temps, nous développons des algorithmes de programmation dynamique pourcertaines classes de graphes (arbres en particulier). Enfin, nous concevons des algorithmes effi-caces, basés sur des arbres couvrants, pour des graphes généraux.Nos résultats expérimentaux illustrent le rôle deDcomme un paramètre d’échelle fournissantde l’information sur la relation entre les clusters intra ou inter clusterings. Ils montrent aussi lesavantages de notre appariement sur les outils de comparaison de clusterings classiques comme lavariation d’information (VI).

Mots clés

Dynamic programming algorithms NP-completeness Graph decomposition Clustering stability Comparison of clusterings

NP-complétude Programmation dynamique Décompositions de graphes Comparaison de clusterings Stabilité du clustering

Domaines

Géométrie algorithmique [cs.CG]

Frederic Cazals : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-02425599

Soumis le : lundi 30 décembre 2019-19:23:04

Dernière modification le : jeudi 15 février 2024-15:28:00

Dates et versions

hal-02425599 , version 1 (30-12-2019)

Identifiants

HAL Id : hal-02425599 , version 1
DOI : 10.1145/3345951

Citer

Frédéric Cazals, Dorian Mazauric, Romain Tetley, Rémi Watrigant. Comparing Two Clusterings Using Matchings between Clusters of Clusters. ACM Journal of Experimental Algorithmics, 2019, 24 (1), pp.1-41. ⟨10.1145/3345951⟩. ⟨hal-02425599⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

INRIA INRIA2 UNIV-COTEDAZUR 3IA-COTEDAZUR ANR

57 Consultations

0 Téléchargements

Comparing Two Clusterings Using Matchings between Clusters of Clusters

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager