Random Forests for Big Data

Big Data is one of the major challenges of statistical science and has numerous consequences from algorithmic and theoretical viewpoints. Big Data always involve massive data but they also often include data streams and data heterogeneity. Recently some statistical methods have been adapted to process Big Data, like linear regression models, clustering methods and bootstrapping schemes. Based on decision trees combined with aggregation and bootstrap ideas, random forests were introduced by Breiman in 2001. They are a powerful nonparametric statistical method allowing to consider in a single and versatile framework regression problems, as well as two-class and multi-class classification problems. Focusing on classification problems, this paper reviews available proposals about random forests in parallel environments as well as about online random forests. Then, we formulate various remarks for random forests in the Big Data context. Finally, we experiment three variants involving subsampling, Big Data-bootstrap and MapReduce respectively, on two massive datasets (15 and 120 millions of observations), a simulated one as well as real world data.

Mots clés

Big Data MapReduce Parallel Computing Random Forests

Domaines

Statistiques [math.ST] Machine Learning [stat.ML]

Fichier principal

genuer_etal_p2015-submitted.pdf (249.35 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Nathalie Vialaneix : Connectez-vous pour contacter le contributeur

https://hal.science/hal-01233923

Soumis le : mercredi 25 novembre 2015-21:57:11

Dernière modification le : lundi 11 mars 2024-15:14:03

Archivage à long terme le : vendredi 26 février 2016-17:40:28

Dates et versions

hal-01233923 , version 1 (25-11-2015)

hal-01233923 , version 2 (22-03-2017)

Identifiants

HAL Id : hal-01233923 , version 1
ARXIV : 1511.08327

Citer

Robin Genuer, Jean-Michel Poggi, Christine Tuleau-Malot, Nathalie Vialaneix. Random Forests for Big Data. 2015. ⟨hal-01233923v1⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

1309 Consultations

3544 Téléchargements