M. S. Abdulla and S. Bhatnagar, Reinforcement Learning Based Algorithms for Average Cost Markov Decision Processes, Discrete Event Dynamic Systems: Theory and Applications, pp.23-52, 2007.
DOI : 10.1007/s10626-006-0003-y

J. Abounadi, D. Bertsekas, and V. S. Borkar, Learning Algorithms for Markov Decision Processes with Average Cost, SIAM Journal on Control and Optimization, vol.40, issue.3, pp.681-698, 2001.
DOI : 10.1137/S0363012999361974

V. Aleksandrov, V. Sysoyev, and V. Shemeneva, Stochastic Optimization, Engineering Cybernetics, vol.5, pp.11-16, 1968.

M. H. Alrefaei and S. Andradóttir, A Simulated Annealing Algorithm with Constant Temperature for Discrete Stochastic Optimization, Management Science, vol.45, issue.5, pp.748-764, 1999.
DOI : 10.1287/mnsc.45.5.748

S. Amari, Natural Gradient Works Efficiently in Learning, Neural Computation, vol.37, issue.2, pp.251-276, 1998.
DOI : 10.1103/PhysRevLett.76.2188

T. Archibald, K. Mckinnon, T. , and L. , On the Generation of Markov Decision Processes, Journal of the Operational Research Society, vol.46, issue.3, pp.354-361, 1995.
DOI : 10.1057/jors.1995.50

L. C. Baird, Advantage Updating, pp.45433-7301, 1993.

L. C. Baird, Residual Algorithms: Reinforcement Learning with Function Approximation, Proceedings of the Twelfth International Conference on Machine Learning, pp.30-37, 1995.
DOI : 10.1016/B978-1-55860-377-6.50013-X

J. Bagnell and J. Schneider, Covariant policy search, Proceedings of International Joint Conference on Artificial Intelligence, 2003.

A. Barto, R. S. Sutton, and C. Anderson, Neuron-like elements that can solve difficult learning control problems, IEEE Transactions on Systems, Man and Cybernetics, vol.13, pp.835-846, 1983.
DOI : 10.1109/tsmc.1983.6313077

J. Baxter and P. L. Bartlett, Infinite-horizon policy-gradient estimation, Journal of Artificial Intelligence Research, vol.15, pp.319-350, 2001.

J. Baxter, P. L. Bartlett, and L. Weaver, Experiments with infinite-horizon, policygradient estimation, Journal of Artificial Intelligence Research, vol.15, pp.351-381, 2001.

J. Baxter, A. Tridgell, and L. Weaver, KnightCap: A Chess Program that Learns by Combining TD(?) with Game-Tree Search, Proceedings of the Fifteenth International Conference on Machine Learning, pp.28-36, 1998.

R. E. Bellman and S. E. Dreyfus, Functional approximations and dynamic programming, Mathematical Tables and Other Aids to Computation, pp.247-251, 1959.
DOI : 10.2307/2002797
URL : http://www.dtic.mil/get-tr-doc/pdf?AD=AD0606538

A. Benveniste, M. Metivier, and P. Priouret, Adaptive Algorithms and Stochastic Approximations, 1990.
DOI : 10.1007/978-3-642-75894-2

D. P. Bertsekas, Dynamic Programming and Optimal Control, Athena Scientific, 1995.

D. P. Bertsekas and J. N. Tsitsiklis, Parallel and Distributed Computation, 1989.

D. P. Bertsekas, V. S. Borkar, and A. Nedic, Improved temporal difference methods with linear function approximation, 2003.

S. Bhatnagar and S. Kumar, A Simultaneous Perturbation Stochastic Approximation-Based Actor???Critic Algorithm for Markov Decision Processes, IEEE Transactions on Automatic Control, vol.49, issue.4, pp.592-598, 2004.
DOI : 10.1109/TAC.2004.825622

S. Bhatnagar, Adaptive multivariate three-timescale stochastic approximation algorithms for simulation based optimization, ACM Transactions on Modeling and Computer Simulation, vol.15, issue.1, pp.74-107, 2005.
DOI : 10.1145/1044322.1044326

S. Bhatnagar, Adaptive Newton-based multivariate smoothed functional algorithms for simulation optimization, ACM Transactions on Modeling and Computer Simulation, vol.18, issue.1, pp.1-235, 2007.
DOI : 10.1145/1315575.1315577

S. Bhatnagar, R. S. Sutton, M. Ghavamzadeh, and M. Lee, Incremental Natural Actor-Critic Algorithms, Advances in Neural Information Processing Systems, pp.105-112, 2008.
DOI : 10.1016/j.automatica.2009.07.008
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.151.2177

V. S. Borkar, Stochastic approximation with two time scales, Systems & Control Letters, vol.29, issue.5, pp.291-294, 1997.
DOI : 10.1016/S0167-6911(97)90015-3

V. S. Borkar, Reinforcement Learning ??? A Bridge Between Numerical Methods and Monte Carlo, 2008.
DOI : 10.1142/9789814273633_0004

V. S. Borkar and S. P. Meyn, The O.D.E. Method for Convergence of Stochastic Approximation and Reinforcement Learning, SIAM Journal on Control and Optimization, vol.38, issue.2, pp.447-469, 2000.
DOI : 10.1137/S0363012997331639

J. A. Boyan, Least-squares temporal difference learning, Proceedings of the Sixteenth International Conference on Machine Learning, pp.49-56, 1999.

J. A. Boyan, M. , and A. W. , Generalization in reinforcement learning: Safely approximating the value function, Advances in Neural Information Processing Systems: Proceedings of the 1994 Conference, pp.369-376, 1995.

O. Brandiere, Some Pathological Traps for Stochastic Approximation, SIAM Journal on Control and Optimization, vol.36, issue.4, pp.1293-1314, 1998.
DOI : 10.1137/S036301299630759X
URL : https://hal.archives-ouvertes.fr/hal-00694262

S. J. Bradtke and A. G. Barto, Linear least-squares algorithms for temporal difference learning, Machine Learning, pp.33-57, 1996.

X. Cao and H. F. Chen, Perturbation realization, potentials and sensitivity analysis of Markov processes, IEEE Transactions on Automatic Control, vol.42, pp.1382-1393, 1997.

C. Chow and J. N. Tsitsiklis, An optimal one-way multigrid algorithm for discrete-time stochastic control, IEEE Transactions on Automatic Control, vol.36, issue.8, pp.898-914, 1991.
DOI : 10.1109/9.133184

R. H. Crites and A. G. Barto, Elevator Group Control using Multiple Reinforcement Learning Agents, Machine Learning, pp.235-262, 1998.

J. W. Daniel, Splines and efficiency in dynamic programming, Journal of Mathematical Analysis and Applications, vol.54, issue.2, pp.402-407, 1976.
DOI : 10.1016/0022-247X(76)90209-2

A. Dukkipati, M. N. Murty, and S. Bhatnagar, Information theoretic justification of Boltzmann selection and its generalization to Tsallis case, 2005 IEEE Congress on Evolutionary Computation, pp.1667-1674, 2005.
DOI : 10.1109/CEC.2005.1554889

M. Ghavamzadeh and Y. Engel, Bayesian Policy Gradient Algorithms, Advances in Neural Information Processing Systems, vol.19, pp.457-464, 2007.
URL : https://hal.archives-ouvertes.fr/hal-00776608

M. Ghavamzadeh and Y. Engel, Bayesian actor-critic algorithms, Proceedings of the 24th international conference on Machine learning, ICML '07, pp.297-304, 2007.
DOI : 10.1145/1273496.1273534
URL : https://hal.archives-ouvertes.fr/hal-00776608

P. Glynn, Likelihood ratio gradient estimation for stochastic systems, Communications of the ACM, vol.33, issue.10, pp.75-84, 1990.
DOI : 10.1145/84537.84552

G. J. Gordon, Stable function approximation in dynamic programming An expanded version was published as Technical Report CMU-CS-95-103, Proceedings of the Twelfth International Conference on Machine Learning, pp.261-268, 1995.

E. Greensmith, P. L. Bartlett, and J. Baxter, Variance reduction techniques for gradient estimates in reinforcement learning, Journal of Machine Learning Research, vol.5, pp.1471-1530, 2004.

M. W. Hirsch, Convergent activation dynamics in continuous time networks, Neural Networks, vol.2, issue.5, pp.331-349, 1989.
DOI : 10.1016/0893-6080(89)90018-X

S. Kakade, A Natural Policy Gradient, Advances in Neural Information Processing Systems, p.14, 2002.

N. Kohl and P. Stone, Policy gradient reinforcement learning for fast quadrupedal locomotion, IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA '04. 2004, pp.2619-2624, 2004.
DOI : 10.1109/ROBOT.2004.1307456

V. R. Konda and V. S. Borkar, Actor-Critic--Type Learning Algorithms for Markov Decision Processes, SIAM Journal on Control and Optimization, vol.38, issue.1, pp.94-123, 1999.
DOI : 10.1137/S036301299731669X

V. R. Konda and J. N. Tsitsiklis, OnActor-Critic Algorithms, SIAM Journal on Control and Optimization, vol.42, issue.4, pp.1143-1166, 2003.
DOI : 10.1137/S0363012901385691

H. J. Kushner and D. S. Clark, Stochastic Approximation Methods for Constrained and Unconstrained Systems, 1978.
DOI : 10.1007/978-1-4684-9352-8

H. J. Kushner and G. G. Yin, Stochastic Approximation Algorithms and Applications, 1997.
DOI : 10.1007/978-1-4899-2696-8

M. G. Lagoudakis and R. Parr, Least-Squares Policy Iteration, Journal of Machine Learning Research, vol.4, pp.1107-1149, 2003.

J. P. Lasalle and S. Lefschetz, Stability by Lyapunov's Direct Method with Applications, 1961.

M. Lee, R. S. Sutton, and M. Ghavamzadeh, Garnet Natural Actor?Critic Project, 2006.

P. Marbach and J. N. Tsitsiklis, Simulation-based optimization of Markov reward processes, IEEE Transactions on Automatic Control, vol.46, issue.2, pp.191-209, 2001.
DOI : 10.1109/9.905687

S. P. Meyn, Control Techniques for Complex Networks, 2007.

R. Pemantle, Nonconvergence to Unstable Points in Urn Models and Stochastic Approximations, The Annals of Probability, vol.18, issue.2, pp.698-712, 1990.
DOI : 10.1214/aop/1176990853

J. Peters, S. Vijayakumar, and S. Schaal, Reinforcement learning for humanoid robotics, Proceedings of the Third IEEE-RAS International Conference on Humanoid Robots, 2003.

J. Peters and S. Schaal, Natural Actor-Critic, Neurocomputing, vol.71, pp.7-9, 2008.
DOI : 10.1016/j.neucom.2007.11.026

J. Peters and S. Schaal, Reinforcement learning of motor skills with policy gradients, Neural Networks, vol.21, issue.4, 2008.
DOI : 10.1016/j.neunet.2008.02.003

M. L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming, 1994.
DOI : 10.1002/9780470316887

S. Richter, D. Aberdeen, Y. , and J. , Natural Actor-Critic for Road Traffic Optimization, Advances in Neural Information Processing Systems, vol.19, pp.1169-1176, 2007.

G. Rummery and M. Niranjan, On-line Q-learning using Connectionist Systems, 1994.

J. Rust, Numerical dynamic programming in economics, Handbook of Computational Economics, pp.614-722, 1996.

S. Singh and P. Dayan, Analytical Mean Squared Error Curves for Temporal Difference Learning, Machine Learning, vol.32, issue.1, pp.5-40, 1998.
DOI : 10.1023/A:1007495401240

R. S. Sutton, Temporal credit assignment in reinforcement learning. Doctoral dissertation, 1984.

R. S. Sutton, Learning to predict by the methods of temporal differences, Machine Learning, pp.9-44, 1988.
DOI : 10.1007/BF00115009

R. S. Sutton, Generalization in reinforcement learning: Successful examples using sparse coarse coding, Advances in Neural Information Processing Systems: Proceedings of the 1995 Conference, pp.1038-1044, 1996.

R. S. Sutton, D. Mcallester, S. Singh, and Y. Mansour, Policy gradient methods for reinforcement learning with function approximation, Advances in Neural Information Processing Systems, pp.1057-1063, 2000.

R. S. Sutton and A. Barto, Reinforcement Learning: An Introduction, IEEE Transactions on Neural Networks, vol.9, issue.5, 1998.
DOI : 10.1109/TNN.1998.712192

V. Tadic, On the Convergence of Temporal Difference Learning with Linear Function Approximation, Machine Learning, vol.42, issue.3, pp.241-267, 2001.
DOI : 10.1023/A:1007609817671

G. J. Tesauro, Temporal difference learning and TD-Gammon, Communications of the ACM, vol.38, issue.3, pp.58-68, 1995.
DOI : 10.1145/203330.203343

J. Tsitsikis, Asynchronous Stochastic Approximation and Q-learning, Machine Learning, pp.185-202, 1994.

J. Tsitsiklis and B. Van-roy, An analysis of temporal-difference learning with function approximation, IEEE Transactions on Automatic Control, vol.42, issue.5, pp.674-690, 1997.
DOI : 10.1109/9.580874

J. Tsitsikis and B. Van-roy, Average cost temporal-difference learning, Automatica, vol.35, issue.11, pp.1799-1808, 1999.
DOI : 10.1016/S0005-1098(99)00099-0

D. J. White, A Survey of Applications of Markov Decision Processes, Journal of the Operational Research Society, vol.44, issue.11, pp.1073-1096, 1993.
DOI : 10.1057/jors.1993.181

B. Widrow and S. D. Stearns, Adaptive Signal Processing, 1985.

R. J. Williams, Simple statistical gradient-following algorithms for connectionist reinforcement learning, Machine Learning, pp.229-256, 1992.