Learning Linear Regression Models over Factorized Joins

Abstract : We investigate the problem of building least squares regression models over training datasets defined by arbitrary join queries on database tables. Our key observation is that joins entail a high degree of redundancy in both computation and data representation, which is not required for the end-to-end solution to learning over joins. We propose a new paradigm for computing batch gradient descent that exploits the factorized computation and representation of the training datasets, a rewriting of the regression objective function that decouples the computation of cofactors of model parameters from their convergence, and the commutativity of cofactor computation with relational union and projection. We introduce three flavors of this approach: F/FDB computes the cofactors in one pass over the materialized factorized join; F avoids this materialization and intermixes cofactor and join computation; F/SQL expresses this mixture as one SQL query. Our approach has the complexity of join factorization, which can be exponentially lower than of standard joins. Experiments with commercial, public, and synthetic datasets show that it outperforms MADlib, Python StatsModels, and R, by up to three orders of magnitude.
Type de document :
Communication dans un congrès
ACM SIGMOD, Jun 2016, San Francisco, United States
Liste complète des métadonnées

https://hal.inria.fr/hal-01330113
Contributeur : Radu Ciucanu <>
Soumis le : vendredi 10 juin 2016 - 00:29:47
Dernière modification le : vendredi 10 juin 2016 - 00:29:47

Identifiants

  • HAL Id : hal-01330113, version 1

Citation

Maximilian Schleich, Dan Olteanu, Radu Ciucanu. Learning Linear Regression Models over Factorized Joins. ACM SIGMOD, Jun 2016, San Francisco, United States. 〈hal-01330113〉

Partager

Métriques

Consultations de la notice

164