Massively-Parallel Feature Selection for Big Data

Ioannis Tsamardinos; Giorgos · Borboudakis; Pavlos Katsogridakis; Polyvios Pratikakis; Vassilis Christophides

Pré-Publication, Document De Travail Année : 2018

Massively-Parallel Feature Selection for Big Data

(1) , (1) , (2) , (2) , (3)

1
2
3

Ioannis Tsamardinos

Fonction : Auteur

University of Crete [Heraklion]

Giorgos · Borboudakis

Fonction : Auteur

University of Crete [Heraklion]

Pavlos Katsogridakis

Fonction : Auteur

Institute of Computer Science [FORTH, Heraklion]

Polyvios Pratikakis

Fonction : Auteur

Institute of Computer Science [FORTH, Heraklion]

Vassilis Christophides

Fonction : Auteur
PersonId : 4825
IdHAL : vassilis-christophides
ORCID : 0000-0002-2076-1881
IdRef : 198210221

Middleware on the Move

Résumé

We present the Parallel, Forward-Backward with Pruning (PFBP) algorithm for feature selection (FS) in Big Data settings (high dimensionality and/or sample size). To tackle the challenges of Big Data FS PFBP partitions the data matrix both in terms of rows (samples, training examples) as well as columns (features). By employing the concepts of p-values of conditional independence tests and meta-analysis techniques PFBP manages to rely only on computations local to a partition while minimizing communication costs. Then, it employs powerful and safe (asymptotically sound) heuristics to make early, approximate decisions, such as Early Dropping of features from consideration in subsequent iterations, Early Stopping of consideration of features within the same iteration, or Early Return of the winner in each iteration. PFBP provides asymptotic guarantees of optimal-ity for data distributions faithfully representable by a causal network (Bayesian network or maximal ancestral graph). Our empirical analysis confirms a super-linear speedup of the algorithm with increasing sample size, linear scalability with respect to the number of features and processing cores, while dominating other competitive algorithms in its class.

Domaines

Apprentissage [cs.LG] Calcul parallèle, distribué et partagé [cs.DC]

VASSILIS CHRISTOPHIDES : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-01663813

Soumis le : jeudi 18 janvier 2018-08:24:38

Dernière modification le : mardi 3 octobre 2023-17:18:04

Dates et versions

hal-01663813 , version 1 (18-01-2018)

Identifiants

HAL Id : hal-01663813 , version 1
ARXIV : 1708.07178

Citer

Ioannis Tsamardinos, Giorgos · Borboudakis, Pavlos Katsogridakis, Polyvios Pratikakis, Vassilis Christophides. Massively-Parallel Feature Selection for Big Data. 2018. ⟨hal-01663813⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

INRIA INRIA2

81 Consultations

1 Téléchargements

Massively-Parallel Feature Selection for Big Data

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager