Skip to Main content Skip to Navigation

Analyse supervisée multibloc en grande dimension

Hadrien Lorenzo 1, 2
2 SISTM - Statistics In System biology and Translational Medicine
Inria Bordeaux - Sud-Ouest, BPH - Bordeaux population health
Abstract : Statistical learning objective is to learn from observed data in order to predict the response for a new sample. In the context of vaccination, the number of features is higher than the number of individuals. This is a degenerate case of statistical analysis which needs specific tools. The regularization algorithms can deal with those drawbacks. Different types of regularization methods can be used which depends on the data set structure but also upon the question. In this work, the main objective was to use the available information with soft-thresholded empirical covariance matrix estimations through SVD decompositions. This solution is particularly efficient in terms of variable selection and computation time. Heterogeneous typed data sets (coming from different sources and also called multiblock data) were at the core of our methodology. Since some data set generations are expensive, it is common to down sample the population acquiring some types of data. This leads to multi-block missing data patterns. The second objective of our methodology is to deal with those missing values using the response values. But the response values are not present in the test data sets and so we have designed a methodology which permits to consider both the cases of missing values in the train or in the test data sets. Thanks to soft-thresholding, our methodology can regularize and select features. This estimator needs only two parameters to be fixed which are the number of components and the maximum number of features to be selected. The corresponding tuning is performed by cross-validation. According to simulations, the proposed method shows very good results comparing to benchmark methods, especially in terms of prediction and computation time. This method has also been applied to several real data sets associated with vaccine, thomboembolic and food researches.
Complete list of metadata
Contributor : Abes Star :  Contact
Submitted on : Friday, June 18, 2021 - 9:34:20 AM
Last modification on : Saturday, June 19, 2021 - 4:08:44 AM


Version validated by the jury (STAR)


  • HAL Id : tel-02433612, version 2



Hadrien Lorenzo. Analyse supervisée multibloc en grande dimension. Statistiques [math.ST]. Université de Bordeaux, 2019. Français. ⟨NNT : 2019BORD0256⟩. ⟨tel-02433612v2⟩



Record views