, Maximum a posteriori policy optimisation, 2018.
Learning dexterous inhand manipulation, The International Journal of Robotics Research, vol.39, issue.1, pp.3-20, 2020. ,
Policy gradient search: Online planning and expert iteration without search trees, 2019. ,
Using confidence bounds for exploitationexploration trade-offs, Journal of Machine Learning Research, vol.3, pp.397-422, 2002. ,
Distributional policy gradients, International Conference on Learning Representations, 2018. ,
The arcade learning environment: An evaluation platform for general agents, Journal of Artificial Intelligence Research, vol.47, pp.253-279, 2013. ,
, Convex optimization, 2004.
A survey of monte carlo tree search methods, IEEE Transactions on Computational Intelligence and AI in games, vol.4, issue.1, pp.1-43, 2012. ,
Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends R in Machine Learning, vol.5, pp.1-122, 2012. ,
Eine informationstheoretische ungleichung und ihre anwendung auf beweis der ergodizitaet von markoffschen ketten, Magyer Tud. Akad. Mat. Kutato Int. Koezl, vol.8, pp.85-108, 1964. ,
, Openai baselines, 2017.
, Deep reinforcement learning in large discrete action spaces, 2015.
, TreeQN and ATreeC: Differentiable treestructured models for deep reinforcement learning, 2017.
, Taming the noise in reinforcement learning via soft updates, 2015.
A theory of regularized markov decision processes, 2019. ,
URL : https://hal.archives-ouvertes.fr/hal-02273741
, , 2020.
,
Planning in entropy-regularized Markov decision processes and games, Neural Information Processing Systems, 2019. ,
URL : https://hal.archives-ouvertes.fr/hal-02387515
Learning to search with mctsnets, 2018. ,
Reinforcement learning with deep energy-based policies, Proceedings of the 34th International Conference on Machine Learning, vol.70, pp.1352-1361, 2017. ,