Random Forests for Big Data - Inria - Institut national de recherche en sciences et technologies du numérique Accéder directement au contenu
Pré-Publication, Document De Travail Année : 2015

Random Forests for Big Data

Résumé

Big Data is one of the major challenges of statistical science and has numerous consequences from algorithmic and theoretical viewpoints. Big Data always involve massive data but they also often include data streams and data heterogeneity. Recently some statistical methods have been adapted to process Big Data, like linear regression models, clustering methods and bootstrapping schemes. Based on decision trees combined with aggregation and bootstrap ideas, random forests were introduced by Breiman in 2001. They are a powerful nonparametric statistical method allowing to consider in a single and versatile framework regression problems, as well as two-class and multi-class classification problems. Focusing on classification problems, this paper reviews available proposals about random forests in parallel environments as well as about online random forests. Then, we formulate various remarks for random forests in the Big Data context. Finally, we experiment three variants involving subsampling, Big Data-bootstrap and MapReduce respectively, on two massive datasets (15 and 120 millions of observations), a simulated one as well as real world data.
Fichier principal
Vignette du fichier
genuer_etal_p2015-submitted.pdf (249.35 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-01233923 , version 1 (25-11-2015)
hal-01233923 , version 2 (22-03-2017)

Identifiants

Citer

Robin Genuer, Jean-Michel Poggi, Christine Tuleau-Malot, Nathalie Vialaneix. Random Forests for Big Data. 2015. ⟨hal-01233923v1⟩
1309 Consultations
3544 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More