R. J. Williams, Simple statistical gradient-following algorithms for connectionist reinforcement learning, Machine Learning, pp.229-256, 1992.
DOI : 10.1007/978-1-4615-3618-5_2

S. Richard, D. A. Sutton, S. P. Mcallester, Y. Singh, and . Mansour, Policy gradient methods for reinforcement learning with function approximation, Advances in Neural Information Processing Systems 12, pp.1057-1063, 2000.

J. Baxter and P. L. Bartlett, Infinite-horizon policy-gradient estimation, Journal of Artificial Intelligence Research, vol.15, pp.319-350, 2001.

F. Sehnke, C. Osendorfer, T. Rückstieß, A. Graves, and J. Schmidhuber, Policy Gradients with Parameter-Based Exploration for Control, Artificial Neural Networks -ICANN 2008, pp.387-396, 2008.
DOI : 10.1007/978-3-540-87536-9_40

J. Kober and J. Peters, Policy search for motor primitives in robotics, Advances in Neural Information Processing Systems 21, pp.849-856, 2008.
DOI : 10.1007/s10994-010-5223-6
URL : https://link.springer.com/content/pdf/10.1007%2Fs10994-010-5223-6.pdf

J. Peters and S. Schaal, Natural Actor-Critic, Neurocomputing, vol.71, issue.7-9, pp.1180-1190, 2008.
DOI : 10.1016/j.neucom.2007.11.026

J. Peters, K. Mülling, and Y. Altun, Relative entropy policy search, AAAI Conference on Artificial Intelligence 24, 2010.

I. Grondman, L. Busoniu, A. Gabriel, R. Lopes, and . Babuska, A survey of actor-critic reinforcement learning: Standard and natural policy gradients. Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on, vol.42, issue.6, pp.1291-1307, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00756747

M. Peter-deisenroth and G. Neumann, A Survey on Policy Search for Robotics, Foundations and Trends in Robotics, vol.2, issue.1-2, pp.1-142, 2013.
DOI : 10.1561/2300000021

S. Kakade and J. Langford, Approximately optimal approximate reinforcement learning, International Conference on Machine Learning, pp.267-274, 2002.

P. Dimitri and . Bertsekas, Approximate policy iteration: A survey and some new methods, Journal of Control Theory and Applications, vol.9, issue.3, pp.310-335, 2011.

B. Scherrer, Approximate policy iteration schemes: A comparison, International Conference on Machine Learning 31 Conference Proceedings, pp.1314-1322, 2014.
URL : https://hal.archives-ouvertes.fr/hal-00989982

M. Pirotta, M. Restelli, A. Pecorino, and D. Calandriello, Safe policy iteration, International Conference on Machine Learning Conference Proceedings, pp.307-315, 2013.

Y. Abbasi-yadkori, L. Peter, . Bartlett, J. Stephen, and . Wright, A fast and reliable policy improvement algorithm, Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, pp.1338-1346, 2016.

M. Pirotta, M. Restelli, and L. Bascetta, Adaptive step-size for policy gradient methods, Advances in Neural Information Processing Systems 26, pp.1394-1402, 2013.

M. Pirotta, M. Restelli, and L. Bascetta, Policy gradient in Lipschitz Markov Decision Processes, Machine Learning, pp.255-283, 2015.
DOI : 10.1007/s10514-009-9132-0
URL : https://link.springer.com/content/pdf/10.1007%2Fs10994-015-5484-1.pdf

J. Schulman, S. Levine, P. Abbeel, M. I. Jordan, and P. Moritz, Trust region policy optimization, International Conference on Machine Learning Conference Proceedings, pp.1889-1897, 2015.

P. Thomas, G. Theocharous, and M. Ghavamzadeh, High confidence policy improvement, International Conference on Machine Learning Conference Proceedings, pp.2380-2388, 2015.

M. Ghavamzadeh, M. Petrik, and Y. Chow, Safe policy improvement by minimizing robust baseline regret, pp.2298-2306, 2016.

J. Nutini, M. W. Schmidt, I. H. Laradji, M. P. Friedlander, and H. A. Koepke, Coordinate descent converges faster with the gauss-southwell rule than random selection, International Conference on Machine Learning Conference Proceedings, pp.1632-1641, 2015.

T. Zhao, H. Hachiya, G. Niu, and M. Sugiyama, Analysis and improvement of policy gradient estimation, Neural Networks, vol.26, pp.118-129, 2012.
DOI : 10.1016/j.neunet.2011.09.005

V. Mnih, C. Szepesvári, and J. Audibert, Empirical Bernstein stopping, Proceedings of the 25th international conference on Machine learning, ICML '08, pp.672-679, 2008.
DOI : 10.1145/1390156.1390241
URL : https://hal.archives-ouvertes.fr/hal-00834983

J. Peters and S. Schaal, Reinforcement learning of motor skills with policy gradients, Neural Networks, vol.21, issue.4, pp.682-697, 2008.
DOI : 10.1016/j.neunet.2008.02.003

M. S. Pinsker, Information and Information Stability of Random Variables and Processes, Izv. Akad. Nauk, 1960.