J. Baxter and P. L. Bartlett, Infinite-horizon gradient-based policy search, Journal of Artificial Intelligence Research, vol.15, pp.319-350, 2001.

D. Bertsekas and J. Tsitsiklis, Neuro-Dynamic Programming, 1996.
DOI : 10.1007/0-306-48332-7_333

D. P. Bertsekas, Dynamic Programming and Optimal Control, 1995.

S. Bhatnagar, R. S. Sutton, M. Ghavamzadeh, and M. Lee, Incremental natural actor-critic algorithms, Conference on Neural Information Processing Systems (NIPS), 2007.

A. Fern, S. Yoon, and R. Givan, Approximate Policy Iteration with a Policy Language Bias: Solving Relational Markov Decision Processes, Journal of Artificial Intelligence Research, vol.25, pp.75-118, 2006.

M. Ghavamzadeh and A. Lazaric, Conservative and Greedy Approaches to Classification-based Policy Iteration, Conference on Artificial Intelligence (AAAI), 2012.
URL : https://hal.archives-ouvertes.fr/hal-00772610

V. Heidrich-meisner and C. Igel, Evolution Strategies for Direct Policy Search, Proceedings of the 10th international conference on Parallel Problem Solving from Nature: PPSN X, pp.428-437, 2008.
DOI : 10.1007/978-3-540-87700-4_43

S. Kakade, A Natural Policy Gradient, Neural Information Processing Systems (NIPS), pp.1531-1538, 2001.

S. Kakade and J. Langford, Approximately optimal approximate reinforcement learning, International Conference on Machine Learning, 2002.

J. Kober and J. Peters, Policy Search for Motor Primitives in Robotics, pp.171-203, 2011.

M. G. Lagoudakis and R. Parr, Reinforcement learning as classification: Leveraging modern classifiers, International Conference on Machine Learning, pp.424-431, 2003.

A. Lazaric, M. Ghavamzadeh, M. , and R. , Analysis of a classification-based policy iteration algorithm, International Conference on Machine Learning, pp.607-614, 2010.
URL : https://hal.archives-ouvertes.fr/inria-00482065

R. Munos, Error bounds for approximate policy iteration, International Conference on Machine Learning, pp.560-567, 2003.

R. Munos, Performance bounds in Lp norm for approximate value iteration, SIAM J. Control and Optimization, 2007.
URL : https://hal.archives-ouvertes.fr/inria-00124685

J. Peters and S. Schaal, Natural Actor-Critic, Neurocomputing, vol.71, issue.7-9, pp.1180-1190, 2008.
DOI : 10.1016/j.neucom.2007.11.026

M. L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming, 1994.
DOI : 10.1002/9780470316887

B. Scherrer and B. Lesner, On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes, Advances in Neural Information Processing Systems 25, pp.1835-1843, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00758809

B. Scherrer, V. Gabillon, M. Ghavamzadeh, and M. Geist, Approximate Modified Policy Iteration, International Conference on Machine Learning (ICML), 2012.
URL : https://hal.archives-ouvertes.fr/hal-00758882

R. Sutton and A. Barto, Reinforcement Learning: An Introduction, IEEE Transactions on Neural Networks, vol.9, issue.5, 1998.
DOI : 10.1109/TNN.1998.712192

R. S. Sutton, D. A. Mcallester, S. P. Singh, and Y. Mansour, Policy Gradient Methods for Reinforcement Learning with Function Approximation, Neural Information Processing Systems (NIPS), pp.1057-1063, 1999.