, Estimation of the Warfarin dose with clinical and pharmacogenetic data, New England Journal of Medicine, vol.360, issue.8, pp.753-764, 2009.

A. Agarwal, D. Hsu, S. Kale, J. Langford, L. Li et al., Taming the monster: A fast and simple algorithm for contextual bandits, International Conference on Machine Learning (ICML), 2014.

Z. Ahmed, N. L. Roux, M. Norouzi, and D. Schuurmans, Understanding the impact of entropy on policy optimization, International Conference on Machine Learning (ICML), 2019.

C. D. Barnes and L. G. Eltherington, Drug dosage in laboratory animals: a handbook, 1966.

D. Bertsimas and C. Mccord, Optimization over continuous and multi-dimensional decisions with observational data, Advances in Neural Information Processing Systems (NeurIPS), 2018.

L. Bottou, J. Peters, J. Candela, D. X. Charles, D. M. Chickering et al., Counterfactual reasoning and learning systems: The example of computational advertising, Journal of Machine Learning Research, vol.14, issue.1, pp.3207-3260, 2013.

M. Demirer, V. Syrgkanis, G. Lewis, and V. Chernozhukov, Semi-parametric efficient policy learning with continuous actions, Advances in Neural Information Processing Systems (NeurIPS), 2019.

M. Dudik, J. Langford, and L. Li, Doubly robust policy evaluation and learning, International Conference on Machine Learning (ICML), 2011.

D. J. Foster and V. Syrgkanis, Orthogonal statistical learning, 2019.

M. Fukushima and H. Mine, A generalized proximal point algorithm for certain non-convex minimization problems, International Journal of Systems Science, vol.12, issue.8, pp.989-1000, 1981.

K. Hirano and G. W. Imbens, The propensity score with continuous treatments. Applied Bayesian modeling and causal inference from incomplete-data perspectives, vol.226164, pp.73-84, 2004.

D. G. Horvitz and D. J. Thompson, A generalization of sampling without replacement from a finite universe, Journal of the American Statistical Association, vol.47, issue.260, pp.663-685, 1952.

N. Jiang and L. Li, Doubly robust off-policy value evaluation for reinforcement learning, International Conference on Machine Learning (ICML), 2016.

T. Joachims, A. Swaminathan, and M. De-rijke, Deep learning with logged bandit feedback, International Conference on Learning Representations (ICLR), 2018.

S. Kakade and J. Langford, Approximately optimal approximate reinforcement learning, International Conference on Machine Learning (ICML), 2002.

N. Kallus and A. Zhou, Policy evaluation and optimization with continuous treatments, International Conference on Artificial Intelligence and Statistics (AISTATS), 2018.

J. Langford and T. Zhang, The epoch-greedy algorithm for multi-armed bandits with side information, Advances in Neural Information Processing Systems (NIPS), 2008.

D. Lefortier, A. Swaminathan, X. Gu, T. Joachims, and M. De-rijke, Large-scale validation of counterfactual learning methods: A test-bed, 2016.

L. Li, W. Chu, J. Langford, T. Moon, and X. Wang, An unbiased offline evaluation of contextual bandit algorithms with generalized linear models, Proceedings of the Workshop on On-line Trading of Exploration and Exploitation 2, 2012.

D. C. Liu and J. , On the limited memory bfgs method for large scale optimization, Mathematical programming, vol.45, pp.503-528, 1989.

A. Maurer and M. Pontil, Empirical bernstein bounds and sample variance penalization, Conference on Learning Theory (COLT, 2009.

A. B. Owen, Monte Carlo theory, methods and examples, 2013.

C. Paquette, H. Lin, D. Drusvyatskiy, J. Mairal, and Z. Harchaoui, Catalyst for gradient-based nonconvex optimization, International Conference on Artificial Intelligence and Statistics (AISTATS), 2018.
URL : https://hal.archives-ouvertes.fr/hal-01773296

J. M. Robins and A. Rotnitzky, Semiparametric efficiency in multivariate regression models with missing data, Journal of the American Statistical Association, vol.90, issue.429, pp.122-129, 1995.

R. T. Rockafellar, Monotone operators and the proximal point algorithm, SIAM journal on control and optimization, vol.14, issue.5, pp.877-898, 1976.

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, Proximal policy optimization algorithms, 2017.

Y. Su, L. Wang, M. Santacatterina, and T. Joachims, Cab: Continuous adaptive blending for policy evaluation and learning, International Conference on Machine Learning, pp.6005-6014, 2019.

R. S. Sutton, D. A. Mcallester, S. P. Singh, and Y. Mansour, Policy gradient methods for reinforcement learning with function approximation, Advances in Neural Information Processing Systems (NIPS), 2000.

A. Swaminathan and T. Joachims, Counterfactual risk minimization: Learning from logged bandit feedback, International Conference on Machine Learning (ICML), 2015.

A. Swaminathan and T. Joachims, The self-normalized estimator for counterfactual learning, Advances in Neural Information Processing Systems (NIPS), 2015.

Y. Wang, A. Agarwal, and M. Dudík, Optimal and adaptive off-policy evaluation in contextual bandits, International Conference on Machine Learning (ICML), 2017.

C. K. Williams and M. Seeger, Using the nyström method to speed up kernel machines, Adv. Neural Information Processing Systems (NIPS), 2001.

R. J. Williams, Simple statistical gradient-following algorithms for connectionist reinforcement learning, Machine learning, vol.8, issue.3-4, pp.229-256, 1992.