A Random Matrix Analysis of Data Stream Clustering: Coping With Limited Memory Resources - Inria - Institut national de recherche en sciences et technologies du numérique Accéder directement au contenu
Article Dans Une Revue Proceedings of Machine Learning Research Année : 2022

A Random Matrix Analysis of Data Stream Clustering: Coping With Limited Memory Resources

Résumé

This article introduces a random matrix framework for the analysis of clustering on high-dimensional data streams, a particularly relevant setting for a more sober processing of large amounts of data with limited memory and energy resources. Assuming data x_1, x_2, ... arrives as a continuous flow and a small number L of them can be kept in the learning pipeline, one has only access to the diagonal elements of the Gram kernel matrix: [K_L]_{i, j} = 1 / p x_i^⊤ x_j 1_{|i−j| < L}. Under a large-dimensional data regime, we derive the limiting spectral distribution of the banded kernel matrix K_L and study its isolated eigenvalues and eigenvectors, which behave in an unfamiliar way. We detail how these results can be used to perform efficient online kernel spectral clustering and provide theoretical performance guarantees. Our findings are empirically confirmed on image clustering tasks. Leveraging on optimality results of spectral methods for clustering, this work offers insights on efficient online clustering techniques for high-dimensional data.
Fichier principal
Vignette du fichier
lebeau22a.pdf (2.34 Mo) Télécharger le fichier
Origine : Fichiers éditeurs autorisés sur une archive ouverte

Dates et versions

hal-03755939 , version 1 (22-08-2022)

Licence

Paternité

Identifiants

  • HAL Id : hal-03755939 , version 1

Citer

Hugo Lebeau, Romain Couillet, Florent Chatelain. A Random Matrix Analysis of Data Stream Clustering: Coping With Limited Memory Resources. Proceedings of Machine Learning Research, 2022, 162, pp.1-29. ⟨hal-03755939⟩
82 Consultations
42 Téléchargements

Partager

Gmail Facebook X LinkedIn More