Skip to Main content Skip to Navigation
Journal articles

Simultaneous semi-parametric estimation of clustering and regression

Matthieu Marbac 1, 2 Mohammed Sedki 3, 4 Christophe Biernacki 5 Vincent Vandewalle 6 
5 MODAL - MOdel for Data Analysis and Learning
LPP - Laboratoire Paul Painlevé - UMR 8524, Université de Lille, Sciences et Technologies, Inria Lille - Nord Europe, METRICS - Evaluation des technologies de santé et des pratiques médicales - ULR 2694, Polytech Lille - École polytechnique universitaire de Lille
Abstract : We investigate the parameter estimation of regression models with fixed group effects, when the group variable is missing while group related variables are available. This problem involves clustering to infer the missing group variable based on the group related variables, and regression to build a model on the target variable given the group and eventually additional variables. Thus, this problem can be formulated as the joint distribution modeling of the target and of the group related variables. The usual parameter estimation strategy for this joint model is a two-step approach starting by learning the group variable (clustering step) and then plugging in its estimator for fitting the regression model (regression step). However, this approach is suboptimal (providing in particular biased regression estimates) since it does not make use of the target variable for clustering. Thus, we claim for a simultaneous estimation approach of both clustering and regression, in a semi-parametric framework. Numerical experiments illustrate the benefits of our proposition by considering wide ranges of distributions and regression models. The relevance of our new method is illustrated on real data dealing with problems associated with high blood pressure prevention.
Document type :
Journal articles
Complete list of metadata
Contributor : Vincent Vandewalle Connect in order to contact the contributor
Submitted on : Friday, January 7, 2022 - 5:04:07 PM
Last modification on : Monday, July 11, 2022 - 9:51:12 AM


Files produced by the author(s)


  • HAL Id : hal-03090573, version 2



Matthieu Marbac, Mohammed Sedki, Christophe Biernacki, Vincent Vandewalle. Simultaneous semi-parametric estimation of clustering and regression. Journal of Computational and Graphical Statistics, Taylor & Francis, In press. ⟨hal-03090573v2⟩



Record views


Files downloads