Distributed adaptive sampling for kernel matrix approximation

Daniele Calandriello 1 Alessandro Lazaric 1 Michal Valko 1
1 SEQUEL - Sequential Learning
Inria Lille - Nord Europe, CRIStAL - Centre de Recherche en Informatique, Signal et Automatique de Lille (CRIStAL) - UMR 9189
Abstract : Most kernel-based methods, such as kernel or Gaussian process regression, kernel PCA, ICA, or $k$-means clustering, do not scale to large datasets, because constructing and storing the kernel matrix $\mathbf{K}_n$ requires at least $\mathcal{O}(n^2)$ time and space for $n$ samples. Recent works show that sampling points with replacement according to their ridge leverage scores (RLS) generates small dictionaries of relevant points with strong spectral approximation guarantees for $\mathbf{K}_n$. The drawback of RLS-based methods is that computing exact RLS requires constructing and storing the whole kernel matrix. In this paper, we introduce SQUEAK, a new algorithm for kernel approximation based on RLS sampling that sequentially processes the dataset, storing a dictionary which creates accurate kernel matrix approximations with a number of points that only depends on the effective dimension $d_{eff}(\gamma)$ of the dataset. Moreover since all the RLS estimations are efficiently performed using only the small dictionary, SQUEAK is the first RLS sampling algorithm that never constructs the whole matrix $\mathbf{K}_n$, runs in linear time $\widetilde{\mathcal{O}}(nd_{eff}(\gamma)^3)$ w.r.t. $n$, and requires only a single pass over the dataset. We also propose a parallel and distributed version of SQUEAK that linearly scales across multiple machines, achieving similar accuracy in as little as $\widetilde{\mathcal{O}}(\log(n)d_{eff}(\gamma)^3)$ time.
Document type :
Conference papers
Complete list of metadatas

Cited literature [13 references]  Display  Hide  Download

https://hal.inria.fr/hal-01482760
Contributor : Michal Valko <>
Submitted on : Friday, March 3, 2017 - 6:35:50 PM
Last modification on : Saturday, September 7, 2019 - 1:19:20 AM
Long-term archiving on: Tuesday, June 6, 2017 - 1:15:28 PM

File

calandriello2017distributed.pd...
Files produced by the author(s)

Identifiers

  • HAL Id : hal-01482760, version 1
  • ARXIV : 1803.10172

Citation

Daniele Calandriello, Alessandro Lazaric, Michal Valko. Distributed adaptive sampling for kernel matrix approximation. International Conference on Artificial Intelligence and Statistics, 2017, Fort Lauderdale, United States. ⟨hal-01482760⟩

Share

Metrics

Record views

458

Files downloads

207