D. Aberdeen and J. Baxter, Policy-gradient learning of controllers with internal state, 2001.

V. Aleksandrov, V. Sysoyev, and V. Shemeneva, Stochastic optimization, Engineering Cybernetics, vol.5, pp.11-16, 1968.

J. Bagnell and J. Schneider, Covariant policy search, Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence, 2003.

L. Baird, Advantage updating, 1993.

A. Barto, R. Sutton, and C. Anderson, Neuron-like elements that can solve difficult learning control problems, IEEE Transaction on Systems, Man and Cybernetics, vol.13, pp.835-846, 1983.

J. Baxter and P. Bartlett, Infinite-horizon policy-gradient estimation, Journal of Artificial Intelligence Research, vol.15, pp.319-350, 2001.

J. Baxter, P. Bartlett, and L. Weaver, Experiments with infinite-horizon policy-gradient estimation, Journal of Artificial Intelligence Research, vol.15, pp.351-381, 2001.

J. Berger and R. Wolpert, The Likelihood Principle, Institute of Mathematical Statistics, 1984.

D. Bertsekas and J. Tsitsiklis, Neuro-Dynamic Programming, Athena Scientific, 1996.
DOI : 10.1007/0-306-48332-7_333

S. Bhatnagar, R. Sutton, M. Ghavamzadeh, and M. Lee, Incremental natural actor-Critic algorithms, Proceedings of Advances in Neural Information Processing Systems, pp.105-112, 2007.
DOI : 10.1016/j.automatica.2009.07.008
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.151.2177

S. Bhatnagar, R. Sutton, M. Ghavamzadeh, and M. Lee, Natural actor???critic algorithms, Automatica, vol.45, issue.11, pp.2471-2482, 2009.
DOI : 10.1016/j.automatica.2009.07.008
URL : https://hal.archives-ouvertes.fr/hal-00840470

L. Csató and M. Opper, Sparse On-Line Gaussian Processes, Neural Computation, vol.14, issue.3, pp.641-668, 2002.
DOI : 10.1109/34.735807

Y. Engel, Algorithms and Representations for Reinforcement Learning, 2005.

Y. Engel, S. Mannor, and R. Meir, Sparse Online Greedy Support Vector Regression, Proceedings of the Thirteenth European Conference on Machine Learning, pp.84-96, 2002.
DOI : 10.1007/3-540-36755-1_8

Y. Engel, S. Mannor, and R. Meir, Bayes meets Bellman: The Gaussian process approach to temporal difference learning, Proceedings of the Twentieth International Conference on Machine Learning, pp.154-161, 2003.

Y. Engel, S. Mannor, and R. Meir, Reinforcement learning with Gaussian processes, Proceedings of the 22nd international conference on Machine learning , ICML '05, pp.201-208, 2005.
DOI : 10.1145/1102351.1102377

M. Ghavamzadeh and Y. Engel, Bayesian policy gradient algorithms, Proceedings of Advances in Neural Information Processing Systems 19, pp.457-464, 2006.
URL : https://hal.archives-ouvertes.fr/hal-00776608

M. Ghavamzadeh and Y. Engel, Bayesian actor-critic algorithms, Proceedings of the 24th international conference on Machine learning, ICML '07, pp.297-304, 2007.
DOI : 10.1145/1273496.1273534
URL : https://hal.archives-ouvertes.fr/hal-00776608

P. Glynn, Stochastic approximation for Monte Carlo optimization, Proceedings of the 18th conference on Winter simulation , WSC '86, pp.356-365, 1986.
DOI : 10.1145/318242.318459

P. Glynn, Likelihood ratio gradient estimation for stochastic systems, Communications of the ACM, vol.33, issue.10, pp.75-84, 1990.
DOI : 10.1145/84537.84552

P. Glynn and P. L. Ecuyer, Likelihood ratio gradient estimation for stochastic recursions, Advances in Applied Probability, vol.27, issue.04, pp.1019-1053, 1995.
DOI : 10.2307/3213735

E. Greensmith, P. Bartlett, and J. Baxter, Variance reduction techniques for gradient estimates in reinforcement learning, Journal of Machine Learning Research, vol.5, pp.1471-1530, 2004.

T. Jaakkola and D. Haussler, Exploiting generative models in discriminative classifiers, Proceedings of Advances in Neural Information Processing Systems 11, 1999.

S. Kakade, A natural policy gradient, Proceedings of Advances in Neural Information Processing Systems 14, 2002.

H. Kimura, M. Yamamura, and S. Kobayashi, Reinforcement learning by stochastic hillclimbing on discounted reward, Proceedings of the Twelfth International Conference on Machine Learning, pp.295-303, 1995.

V. Konda and J. Tsitsiklis, Actor-Critic algorithms, Proceedings of Advances in Neural Information Processing Systems 12, pp.1008-1014, 2000.

P. Marbach, Simulated-Based Methods for Markov Decision Processes, 1998.

W. Miller, R. Sutton, and P. Werbos, Neural Networks for Control, 1990.

A. O. Hagan, Monte-Carlo is fundamentally unsound. The Statistician, pp.247-249, 1987.

A. O. Hagan, Bayes???Hermite quadrature, Journal of Statistical Planning and Inference, vol.29, issue.3, pp.245-260, 1991.
DOI : 10.1016/0378-3758(91)90002-V

J. Peters and S. Schaal, Reinforcement learning of motor skills with policy gradients, Neural Networks, vol.21, issue.4, pp.682-697, 2008.
DOI : 10.1016/j.neunet.2008.02.003

J. Peters, S. Vijayakumar, and S. Schaal, Reinforcement learning for humanoid robotics, Proceedings of the Third IEEE-RAS International Conference on Humanoid Robots, 2003.

J. Peters, S. Vijayakumar, and S. Schaal, Natural actor-critic, Proceedings of the Sixteenth European Conference on Machine Learning, pp.280-291, 2005.

H. Poincaré, Calcul des Probabilités, Georges Carré, p.1896

M. Puterman, Markov Decision Processes, 1994.
DOI : 10.1002/9780470316887

C. Rasmussen and Z. Ghahramani, Bayesian Monte Carlo, Proceedings of Advances in Neural Information Processing Systems 15, pp.489-496, 2003.

C. Rasmussen and C. Williams, Gaussian Processes in Machine Learning, 2006.
DOI : 10.1162/089976602317250933

M. Reiman and A. Weiss, Sensitivity analysis via likelihood ratios, Proceedings of the 18th conference on Winter simulation , WSC '86, 1986.
DOI : 10.1145/318242.318450

M. Reiman and A. Weiss, Sensitivity Analysis for Simulations via Likelihood Ratios, Operations Research, vol.37, issue.5, 1989.
DOI : 10.1287/opre.37.5.830

R. Rubinstein, Some Problems in Monte Carlo Optimization, 1969.

G. Rummery and M. Niranjan, On-line Q-learning using Connectionist Systems, 1994.

J. Shawe-taylor and N. Cristianini, Kernel Methods for Pattern Analysis, 2004.
DOI : 10.1017/CBO9780511809682

R. Sutton, Temporal credit assignment in reinforcement learning, 1984.

R. Sutton and A. Barto, Reinforcement Learning: An Introduction, IEEE Transactions on Neural Networks, vol.9, issue.5, 1998.
DOI : 10.1109/TNN.1998.712192

R. Sutton, D. Mcallester, S. Singh, and Y. Mansour, Policy gradient methods for reinforcement learning with function approximation, Proceedings of Advances in Neural Information Processing Systems 12, pp.1057-1063, 2000.

N. Vien, H. Yu, and T. Chung, Hessian matrix distribution for Bayesian policy gradient reinforcement learning, Information Sciences, vol.181, issue.9, pp.1671-1685, 2011.
DOI : 10.1016/j.ins.2011.01.001

L. Weaver and N. Tao, The optimal reward baseline for gradient-based reinforcement learning, Proceedings of the Seventeenth International Conference on Uncertainty in Artificial Intelligence, pp.538-545, 2001.

R. Williams, Simple statistical gradient-following algorithms for connectionist reinforcement learning, Machine Learning, pp.229-256, 1992.
DOI : 10.1007/978-1-4615-3618-5_2
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.129.8871