D. Abbasi-yadkori, C. Pal, and . Szepesvári, Improved algorithms for linear stochastic bandits, Advances in Neural Information Processing Systems, 2011.

A. Antos, C. Szepesvári, and R. Munos, Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path, Machine Learning, vol.22, issue.1, pp.89-129, 2008.
DOI : 10.1007/s10994-007-5038-2
URL : https://hal.archives-ouvertes.fr/hal-00830201

L. Baird, Residual Algorithms: Reinforcement Learning with Function Approximation, Proceedings of the Twelfth International Conference on Machine Learning, pp.30-37, 1995.
DOI : 10.1016/B978-1-55860-377-6.50013-X

D. Bertsekas, Dynamic Programming and Optimal Control, volume II, Athena Scientific, 2007.

D. Bertsekas and J. Tsitsiklis, Neuro-Dynamic Programming, Athena Scientific, 1996.
DOI : 10.1007/0-306-48332-7_333

J. Boyan, Least-squares temporal difference learning, Proceedings of the 16th International Conference on Machine Learning, pp.49-56, 1999.

S. Bradtke and A. Barto, Linear least-squares algorithms for temporal difference learning, Machine Learning, pp.33-57, 1996.
DOI : 10.1007/978-0-585-33656-5_4
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.143.857

V. De-la-peña and G. Pang, Exponential inequalities for self-normalized processes with applications, Electronic Communications in Probability, vol.14, issue.0, pp.372-381, 2009.
DOI : 10.1214/ECP.v14-1490

V. De-la-peña, M. Klass, and T. Lai, Pseudo-maximization and self-normalized processes, Probability Surveys, vol.4, issue.0, pp.172-192, 2007.
DOI : 10.1214/07-PS119

S. Delattre and S. Ga¨?ffasga¨?ffas, Nonparametric regression with martingale increment errors, Stochastic Processes and their Applications, pp.2899-2924, 2011.
DOI : 10.1016/j.spa.2011.08.002
URL : https://hal.archives-ouvertes.fr/hal-00530581

A. M. Farahmand, M. Ghavamzadeh, C. Szepesvári, and S. Mannor, Regularized policy iteration, Advances in Neural Information Processing Systems 21, pp.441-448, 2008.

A. M. Farahmand, R. Munos, and C. Szepesvári, Error propagation for approximate policy and value iteration, Advances in Neural Information Processing Systems, 2010.
URL : https://hal.archives-ouvertes.fr/hal-00830154

V. Gabillon, A. Lazaric, M. Ghavamzadeh, and B. Scherrer, Classification-based policy iteration with a critic, Proceedings of the Twenty-Eighth International Conference on Machine Learning, pp.1049-1056, 2011.
URL : https://hal.archives-ouvertes.fr/hal-00590972

M. Ghavamzadeh, A. Lazaric, O. Maillard, and R. Munos, Lstd with random projections, Advances in Neural Information Processing Systems, pp.721-729, 2010.
URL : https://hal.archives-ouvertes.fr/hal-00943120

M. Ghavamzadeh, A. Lazaric, R. Munos, and M. Hoffman, Finite-sample analysis of lasso-td, Proceedings of the 28th International Conference on Machine Learning, pp.1177-1184, 2011.
URL : https://hal.archives-ouvertes.fr/hal-00830149

L. Györfi, M. Kohler, A. Krzy?, and H. Walk, A Distribution-free Theory of Nonparametric Regression, 2002.
DOI : 10.1007/b97848

D. Hsu, S. Kakade, and T. Zhang, Random Design Analysis of Ridge Regression, Proceedings of the 25th Conference on Learning Theory, 2012.
DOI : 10.1007/s10208-014-9192-1

M. Lagoudakis and R. Parr, Least-squares policy iteration, Journal of Machine Learning Research, vol.4, pp.1107-1149, 2003.

A. Lazaric, M. Ghavamzadeh, and R. Munos, Finite-sample analysis of lstd, Proceedings of the 27th International Conference on Machine Learning, pp.615-622, 2010.
URL : https://hal.archives-ouvertes.fr/inria-00482189

R. Meir, Nonparametric time series prediction through adaptive model selection, Machine Learning, pp.5-34, 2000.

S. P. Meyn and R. L. Tweedie, Markov Chains and Stochastic Stability, 1993.

B. Avila-pires and C. Szepesvári, Statistical linear estimation with penalized estimators: an application to reinforcement learning, Proceedings of the 29th International Conference on Machine Learning, 2012.

B. Scherrer, Should one compute the temporal difference fix point or minimize the bellman residual? the unified oblique projection view, Proceedings of the 27th International Conference on Machine Learning, pp.959-966, 2010.
URL : https://hal.archives-ouvertes.fr/inria-00537403

P. Schweitzer and A. Seidmann, Generalized polynomial approximations in Markovian decision processes, Journal of Mathematical Analysis and Applications, vol.110, issue.2, pp.568-582, 1985.
DOI : 10.1016/0022-247X(85)90317-8

R. Sutton and A. Barto, Reinforcement Learning: An Introduction, IEEE Transactions on Neural Networks, vol.9, issue.5, 1998.
DOI : 10.1109/TNN.1998.712192

M. Talagrand, The Generic Chaining: Upper and Lower Bounds of Stochastic Processes, 2005.

J. Tsitsiklis and B. Van-roy, An analysis of temporal-difference learning with function approximation, IEEE Transactions on Automatic Control, vol.42, issue.5, pp.674-690, 1997.
DOI : 10.1109/9.580874

B. Yu, Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability, pp.94-116, 1994.

H. Yu, Convergence of least squares temporal difference methods under general conditions, Proceedings of the 27th International Conference on Machine Learning, pp.1207-1214, 2010.

H. Yu and D. Bertsekas, Error Bounds for Approximations from Projected Linear Equations, Mathematics of Operations Research, vol.35, issue.2, pp.306-329, 2010.
DOI : 10.1287/moor.1100.0441