Analyse supervisée multibloc en grande dimension

Hadrien Lorenzo 1, 2
2 SISTM - Statistics In System biology and Translational Medicine
Inria Bordeaux - Sud-Ouest, Epidémiologie et Biostatistique [Bordeaux]
Abstract : Statistical learning objective is to learn from observed data in order to predict the response for a new sample. In the context of vaccination, the number of features is higher than the number of individuals. This is a degenerate case of statistical analysis which needs specific tools. The regularization algorithms can deal with those drawbacks. Different types of regularization methods can be used which depends on the data set structure but also upon the question. In this work, the main objective was to use the available information with soft-thresholded empirical covariance matrix estimations through SVD decompositions. This solution is particularly efficient in terms of variable selection and computation time. Heterogeneous typed data sets (coming from different sources and also called multiblock data) were at the core of our methodology. Since some data set generations are expensive, it is common to down sample the population acquiring some types of data. This leads to multi-block missing data patterns. The second objective of our methodology is to deal with those missing values using the response values. But the response values are not present in the test data sets and so we have designed a methodology which permits to consider both the cases of missing values in the train or in the test data sets. Thanks to soft-thresholding, our methodology can regularize and select features. This estimator needs only two parameters to be fixed which are the number of components and the maximum number of features to be selected. The corresponding tuning is performed by cross-validation. According to simulations, the proposed method shows very good results comparing to benchmark methods, especially in terms of prediction and computation time. This method has also been applied to several real data sets associated with vaccine, thomboembolic and food researches.
Complete list of metadatas

Cited literature [283 references]  Display  Hide  Download
Contributor : Marta Avalos <>
Submitted on : Tuesday, January 14, 2020 - 4:49:12 PM
Last modification on : Thursday, January 16, 2020 - 1:02:49 AM


Files produced by the author(s)


  • HAL Id : tel-02433612, version 1



Hadrien Lorenzo. Analyse supervisée multibloc en grande dimension. Machine Learning [stat.ML]. Université de bordeaux, 2019. Français. ⟨tel-02433612⟩



Record views


Files downloads