A. Abdolmaleki, J. T. Springenberg, Y. Tassa, R. Munos, N. Heess et al., Maximum a posteriori policy optimisation, 2018.

O. M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. Mcgrew et al., Learning dexterous inhand manipulation, The International Journal of Robotics Research, vol.39, issue.1, pp.3-20, 2020.

T. Anthony, R. Nishihara, P. Moritz, T. Salimans, and J. Schulman, Policy gradient search: Online planning and expert iteration without search trees, 2019.

P. Auer, Using confidence bounds for exploitationexploration trade-offs, Journal of Machine Learning Research, vol.3, pp.397-422, 2002.

G. Barth-maron, M. W. Hoffman, D. Budden, W. Dabney, D. Horgan et al., Distributional policy gradients, International Conference on Learning Representations, 2018.

M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling, The arcade learning environment: An evaluation platform for general agents, Journal of Artificial Intelligence Research, vol.47, pp.253-279, 2013.

S. Boyd and L. Vandenberghe, Convex optimization, 2004.

C. B. Browne, E. Powley, D. Whitehouse, S. M. Lucas, P. I. Cowling et al., A survey of monte carlo tree search methods, IEEE Transactions on Computational Intelligence and AI in games, vol.4, issue.1, pp.1-43, 2012.

S. Bubeck and N. Cesa-bianchi, Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends R in Machine Learning, vol.5, pp.1-122, 2012.

I. Csiszár, Eine informationstheoretische ungleichung und ihre anwendung auf beweis der ergodizitaet von markoffschen ketten, Magyer Tud. Akad. Mat. Kutato Int. Koezl, vol.8, pp.85-108, 1964.

P. Dhariwal, C. Hesse, O. Klimov, A. Nichol, M. Plappert et al., Openai baselines, 2017.

G. Dulac-arnold, R. Evans, H. Van-hasselt, P. Sunehag, T. Lillicrap et al., Deep reinforcement learning in large discrete action spaces, 2015.

G. Farquhar, T. Rocktäschel, M. Igl, and S. Whiteson, TreeQN and ATreeC: Differentiable treestructured models for deep reinforcement learning, 2017.

R. Fox, A. Pakman, and N. Tishby, Taming the noise in reinforcement learning via soft updates, 2015.

M. Geist, B. Scherrer, and O. Pietquin, A theory of regularized markov decision processes, 2019.
URL : https://hal.archives-ouvertes.fr/hal-02273741

. Google, , 2020.

T. Cloud and . Cloud,

J. Grill, O. D. Domingues, P. Ménard, R. Munos, and M. Valko, Planning in entropy-regularized Markov decision processes and games, Neural Information Processing Systems, 2019.
URL : https://hal.archives-ouvertes.fr/hal-02387515

A. Guez, T. Weber, I. Antonoglou, K. Simonyan, O. Vinyals et al., Learning to search with mctsnets, 2018.

T. Haarnoja, H. Tang, P. Abbeel, and S. Levine, Reinforcement learning with deep energy-based policies, Proceedings of the 34th International Conference on Machine Learning, vol.70, pp.1352-1361, 2017.