VSURF : un package R pour la sélection de variables à l'aide de forêts aléatoires

R Genuer 1, 2, * J.-M Poggi 3, 4 C Tuleau-Malot 5
* Auteur correspondant
2 SISTM - Statistics In System biology and Translational Medicine
Epidémiologie et Biostatistique [Bordeaux], Inria Bordeaux - Sud-Ouest
3 SELECT - Model selection in statistical learning
Inria Saclay - Ile de France, LMO - Laboratoire de Mathématiques d'Orsay, CNRS - Centre National de la Recherche Scientifique : UMR
Abstract : Variable selection is a crucial issue in many applied classication and regression problems. It is of interest for statistical analysis as well as for modelization or prediction purposes to remove irrelevant variables, to select all important ones or to determine a sucient subset for prediction. These main different objectives on a statistical learning perspective involve variable selection to simplify statistical problems, to help diagnosis and interpretation, and to speed up data processing. The authors have proposed a variable selection method based on random forests, and the aim of this presentation is to describe the (recently available on CRAN) associated R package called VSURF and to illustrate its use on real datasets. Introduced by Breiman, random forests (abbreviated RF in the sequel) is an attractive non-parametric statistical method to deal with such problems, since it requires only mild conditions on the model supposed to have generated the observed data. Indeed, since it is based on decision trees and it uses aggregation ideas, RF allow to consider in an elegant and versatile framework dierent models and problems, namely regressions, two-class or multiclass classications. In Genuer et.al. 2010 we have distinguished two variable selection objectives: interpretation and prediction. The first is to find important variables highly related to the response variable in order to select all the important variables, even with high redundancy. The second is to find a small number of variables sucient to a good parsimonious prediction of the response variable. We have proposed the following two-step procedure, the first one is the same for the two situations while the second one depends on the objective.
Type de document :
Communication dans un congrès
3èmes Rencontres R, 2014, Montpellier, France
Liste complète des métadonnées

https://hal.inria.fr/hal-01096237
Contributeur : Robin Genuer <>
Soumis le : mercredi 17 décembre 2014 - 09:45:46
Dernière modification le : vendredi 12 janvier 2018 - 01:56:22
Document(s) archivé(s) le : lundi 23 mars 2015 - 14:40:58

Fichier

Genuer_VSURF_RR2014.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : hal-01096237, version 1

Citation

R Genuer, J.-M Poggi, C Tuleau-Malot. VSURF : un package R pour la sélection de variables à l'aide de forêts aléatoires. 3èmes Rencontres R, 2014, Montpellier, France. 〈hal-01096237〉

Partager

Métriques

Consultations de la notice

368

Téléchargements de fichiers

107