D. Bernstein, S. Zilberstein, R. Washington, and J. Bresina, Planetary rover control as a markov decision process, AAAI Spring Symposium: Game Theoretic and Decision Theoretic Agents, 2001.

D. Bertsekas and J. Tsitsiklis, Neurodynamic Programming, 1996.

W. Hoeffding, Probability inequalities for sums of bounded random variables, pp.13-30, 1963.

R. Howard, Dynamic programming and Markov processes, 1960.

M. Kearns and D. Koller, Efficient reinforcement learning in factored MDPs, Seventeenth International Joint Conference on Artificial Intelli- gence, 1999.

M. Kearns and S. Singh, Near-optimal reinforcement learning in polynomial time, Proc. 15th International Conf. on Machine Learning, pp.260-268, 1998.

P. Kumar, A Survey of Some Results in Stochastic Adaptive Control, SIAM Journal on Control and Optimization, vol.23, issue.3, pp.329-367, 1985.
DOI : 10.1137/0323023

P. Kumar and P. Varaiya, Stochastic Systems: Estimation, Identification, and Adaptive Control, 1986.
DOI : 10.1137/1.9781611974263

R. Munos, Efficient resources allocation for markov decision processes, NIPS, pp.1571-1578, 2001.

R. Munos and A. Moore, Rates of convergence for variable resolution schemes in optimal control, International Conference on Machine Learning, 2000.

M. Puterman, Markov Decision Processes, 1994.
DOI : 10.1002/9780470316887

R. Sutton, Dyna, an integrated architecture for learning, planning, and reacting, Working Notes of the AAAI Spring Symposium on Integrated Intelligent Architectures, 1991.
DOI : 10.1145/122344.122377

R. Sutton and A. Barto, Reinforcement Learning: An Introduction, IEEE Transactions on Neural Networks, vol.9, issue.5, 1998.
DOI : 10.1109/TNN.1998.712192

G. Tesauro, Temporal difference learning and TD-Gammon, Communications of the ACM, vol.38, issue.3, pp.58-68, 1995.
DOI : 10.1145/203330.203343