. Dans-le-cas-dénombrable, une cha??necha??ne de Markov est ergodique si et seulement si elle est irréductible et apériodique, c'est-` a-dire si et seulement si : ?(x, y) ? X 2 , ?n0, ?n ? n0

C. Si, il est beaucoup plus lent que CPI+ qui converge toujours en un petit nombre d'itérations (la plupart du temps en moins de 10 itérations, et toujours en moins de 20) API(?)?la variation conservative na¨?vena¨?ve d'API qui est aussi plus simple que CPI(?)?est empiriquement proche de CPI(?), bien qu'´ etant en moyenne légèrement moins bon. CPI+, CPI(?) et PSDP ? ont une performance moyenne similaire, mais la variabilité de PSDP ? est significativement moindre, PSDP ? appara??tappara??t ainsi comme donnant les meilleurs résultats empiriques. NSPI(

=. Avec, CPI+, et CPI(?), et est proche de PSDP ? . Des expérimentations complémentaires ont montré que l'ensemble de ces observations est stable par rapport aux paramètres n S et n A . Demanì ere intéressante, les différences entre les algorithmes tendentàtendentà se dissiper lorsque la dynamique duprobì eme est de plus en plus stochastique (lorsque le facteur de branchement augmente) Ceci est conformè a notre analyse basée sur les constantes de concentrabilité : celles-ci sont toutes finies lorsque la dynamique est fortement " mélangeante

M. Akian and S. Gaubert, Policy iteration for perfect information stochastic mean payoff games with bounded first return times is strongly polynomial, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00881207

A. Antos, R. Munos, and . Cs, Szepesvári : Fitted Q-iteration in continuous action-space MDPs, Neural Information Processing Systems, pp.9-16, 2007.

A. Antos, C. Szepesvári, and R. Munos, Learning near-optimal policies with bellman-residual minimization based fitted policy iteration and a single sample path, Machine Learning, pp.89-129, 2008.
DOI : 10.1007/s10994-007-5038-2

URL : https://hal.archives-ouvertes.fr/hal-00830201

T. Archibald, K. Mckinnon, and L. Thomas, On the Generation of Markov Decision Processes, Journal of the Operational Research Society, vol.46, issue.3, pp.354-361, 1995.
DOI : 10.1057/jors.1995.50

M. G. Azar, V. Gómez, and H. J. Kappen, Dynamic Policy Programming with Function Approximation, 14th International Conference on Artificial Intelligence and Statistics (AISTATS), 2011.

J. A. Bagnell, S. M. Kakade, A. Ng, and J. Schneider, Policy search by dynamic programming, Neural Information Processing Systems, 2003.

L. Baird, Residual Algorithms: Reinforcement Learning with Function Approximation, International Conference on Machine Learning, pp.30-37, 1995.
DOI : 10.1016/B978-1-55860-377-6.50013-X

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.114.5034

D. P. Bertsekas, Approximate policy iteration: a survey and some new methods, Journal of Control Theory and Applications, vol.27, issue.3, pp.310-335, 2011.
DOI : 10.1007/s11768-011-1005-3

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.294.8653

D. P. Bertsekas and S. Ioffe, Temporal differences-based policy iteration and applications in neuro-dynamic programming, 1996.

D. P. Bertsekas and J. N. , Tsitsiklis : Neurodynamic Programming, Athena Scientific, 1996.

A. Boumaza, Scherrer : Optimal control subsumes harmonic control, IEEE International Conference on Robotics and Automation, pp.2841-2846, 2007.
DOI : 10.1109/robot.2007.363902

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.104.9611

A. Boumaza and B. Scherrer, Analyse d'un algorithme d'intelligence en essaim pour le fourragement, Revue d'intelligence artificielle, vol.22, issue.6, pp.791-816, 2008.
DOI : 10.3166/ria.22.791-816

URL : https://hal.archives-ouvertes.fr/inria-00172200

J. A. Boyan, Technical update : Least-squares temporal difference learning, Machine Learning, pp.233-246, 2002.

L. Busoniu, A. Lazaric, M. Ghavamzadeh, R. Munos, and R. Babuska, De Schutter : Least-squares methods for Policy Iteration, Wiering et M. van Otterlo, ´ editeurs : Reinforcement Learning : State of the Art, 2011.

P. Canbolat, Rothblum : (Approximate) iterated successive approximations algorithm for sequential decision processes, Annals of Operations Research, pp.1-12
DOI : 10.1007/s10479-012-1073-x

D. Choi and B. Van-roy, A generalized Kalman filter for fixed point approximation and efficient temporal-difference learning. Discrete Event Dynamic Systems, pp.207-239, 2006.
DOI : 10.1007/s10626-006-8134-8

Y. Engel, Algorithms and Representations for Reinforcement Learning, Thèse de doctorat, 2005.

D. Ernst and P. Geurts, Wehenkel : Tree-based batch mode reinforcement learning, Journal of Machine Learning Research, vol.6, pp.503-556, 2005.

A. M. Farahmand, M. Ghavamzadeh, C. Szepesvári, and S. Mannor, Regularized policy iteration, Neural Information Processing Systems, 2008.

A. M. Farahmand, R. Munos, and . Cs, Szepesvári : Error propagation for approximate policy and value iteration, Neural Information Processing Systems, pp.568-576, 2010.

J. Fearnley, Exponential lower bounds for policy iteration Dans 37th international colloquium conference on Automata, languages and programming : Part II, pp.551-562, 2010.

E. A. Feinberg, J. Huang, and B. Scherrer, Modified policy iteration algorithms are not strongly polynomial for discounted dynamic programming, Operations Research Letters, vol.42, issue.6-7, pp.429-431, 2014.
DOI : 10.1016/j.orl.2014.07.006

URL : https://hal.archives-ouvertes.fr/hal-01091370

A. Fern, S. Yoon, and R. Givan, Approximate Policy Iteration with a Policy Language Bias : Solving Relational Markov Decision Processes, Journal of Artificial Intelligence Research, vol.25, pp.75-118, 2006.

V. Gabillon, A. Lazaric, and M. Ghavamzadeh, Scherrer : Classification-based policy iteration with a critic, International Conference on Machine Learning, pp.1049-1056, 2011.

M. Geist and O. Pietquin, Algorithmic Survey of Parametric Value Function Approximation, IEEE Transactions on Neural Networks and Learning Systems, vol.24, issue.6, pp.845-867, 2013.
DOI : 10.1109/TNNLS.2013.2247418

URL : https://hal.archives-ouvertes.fr/hal-00869725

G. J. Gordon, Stable Function Approximation in Dynamic Programming, International Conference on Machine Learning, pp.261-268, 1995.
DOI : 10.1016/B978-1-55860-377-6.50040-2

C. Guestrin, D. Koller, and R. Parr, Max-norm projections for factored mdps, IJCAI, 2001.

C. Guestrin, D. Koller, R. Parr, and S. , Venkataraman : Efficient Solution Algorithms for Factored MDPs, Journal of Artificial Intelligence Research (JAIR), vol.19, pp.399-468, 2003.

L. Györfi, M. Kholer, and A. Krzyzak, Walk : A Distribution-Free Theory of Nonparametric Regression, 2002.

T. D. Hansen, P. B. Miltersen, and U. Zwick, Strategy Iteration Is Strongly Polynomial for 2-Player Turn-Based Stochastic Games with a Constant Discount Factor, Journal of the ACM, vol.60, issue.1, pp.1-1
DOI : 10.1145/2432622.2432623

T. D. Hansen and U. Zwick, Lower bounds for Howard's algorithm for finding minimum meancost cycles, pp.415-426, 2010.
DOI : 10.1007/978-3-642-17517-6_37

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.174.7896

R. Hollanders, J. C. Delvenne, and R. Jungers, The complexity of Policy Iteration is exponential for discounted Markov Decision Processes, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC), p.2012
DOI : 10.1109/CDC.2012.6426485

S. M. Kakade and J. Langford, Approximately Optimal Approximate Reinforcement Learning, International Conference on Machine Learning, pp.267-274, 2002.

M. J. Kearns and S. P. Singh, Bias-variance error bounds for temporal difference updates, pp.142-147, 2000.

M. Lagoudakis and R. Parr, Reinforcement Learning as Classification : Leveraging Modern Classifiers, International Conference on Machine Learning, pp.424-431, 2003.

M. G. Lagoudakis and R. Parr, Least-squares policy iteration, Journal of Machine Learning Research, vol.4, pp.1107-1149, 2003.

A. Lazaric, M. Ghavamzadeh, and R. Munos, Analysis of a Classification-based Policy Iteration Algorithm, International Conference on Machine Learning, pp.607-614, 2010.
URL : https://hal.archives-ouvertes.fr/inria-00482065

A. Lazaric, M. Ghavamzadeh, and R. Munos, Finite-sample analysis of least-squares policy iteration, Journal of Machine Learning Research, vol.13, pp.3041-3074, 2012.
URL : https://hal.archives-ouvertes.fr/inria-00528596

B. Lesner, Scherrer : Non-Stationary Approximate Modified Policy Iteration, 2015.

O. A. Maillard, R. Munos, A. Lazaric, and M. Ghavamzadeh, Finite Sample Analysis of Bellman Residual Minimization, Masashi Sugiyama et Qiang Yang, ´ editeurs : Asian Conference on Machine Learpning. JMLR : Workshop and Conference Proceedings, pp.309-324, 2010.
URL : https://hal.archives-ouvertes.fr/hal-00830212

Y. Mansour and S. P. Singh, On the complexity of policy iteration, Dans Uncertainty in Artificial Intelligence, pp.401-408, 1999.

M. Melekopoglou and A. Condon, On the Complexity of the Policy Improvement Algorithm for Markov Decision Processes, ORSA Journal on Computing, vol.6, issue.2, pp.188-192, 1994.
DOI : 10.1287/ijoc.6.2.188

R. Munos, Error bounds for approximate policy iteration, International Conference on Machine Learning, 2003.

R. Munos, Performance Bounds in $L_p$???norm for Approximate Value Iteration, SIAM Journal on Control and Optimization, vol.46, issue.2, pp.541-561, 2007.
DOI : 10.1137/040614384

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.466.4205

R. Munos and . Cs, Szepesvári : Finite-time bounds for fitted value iteration, Journal of Machine Learning Research, vol.9, pp.815-857, 2008.

A. Nedic and D. P. , Bertsekas : Least squares policy evaluation algorithms with linear function approximation, Theory and Applications, vol.13, pp.79-110, 2002.

M. Petrik and B. Scherrer, Biasing Approximate Dynamic Programming with a Lower Discount Factor, Neural Information Processing Systems, 2008.
URL : https://hal.archives-ouvertes.fr/inria-00337652

J. Pineau and G. J. Gordon, Thrun : Point-based value iteration : An anytime algorithm for POMDPs, International Joint Conference on Artificial Intelligence, pp.1025-1032, 2003.

B. A. Pires and . Cs, Szepesvári : Statistical linear estimation with penalized estimators : an application to reinforcement learning, pp.1535-1542, 2012.

I. Post, Ye : The simplex method is strongly polynomial for deterministic Markov decision processes, 2012.
DOI : 10.1137/1.9781611973105.105

URL : http://arxiv.org/abs/1208.5083

M. Puterman and M. Shin, Modified Policy Iteration Algorithms for Discounted Markov Decision Problems, Management Science, vol.24, issue.11, 1978.
DOI : 10.1287/mnsc.24.11.1127

L. C. Rogers, Pathwise Stochastic Optimal Control, SIAM Journal on Control and Optimization, vol.46, issue.3, pp.1116-1132, 2007.
DOI : 10.1137/050642885

R. Rubinstein and D. Kroese, The cross-entropy method : A unified approach to combinatorial optimization, Monte-Carlo simulation, and machine learning, 2004.

B. Scherrer, Performance bounds for lambda policy iteration. CoRR, 2007.
URL : https://hal.archives-ouvertes.fr/inria-00185271

B. Scherrer, Should one compute the temporal difference fix point or minimize the bellman residual ? the unified oblique projection view, International Conference on Machine Learning, 2010.
URL : https://hal.archives-ouvertes.fr/inria-00537403

B. Scherrer, Improved and Generalized Upper Bounds on the Complexity of Policy Iteration, Neural Information Processing Systems, 2013.
DOI : 10.1287/moor.2015.0753

URL : https://hal.archives-ouvertes.fr/hal-00829532

B. Scherrer, Performance Bounds for ?-Policy Iteration and Application to the Game of Tetris, Journal of Machine Learning Research, vol.14, pp.1175-1221, 2013.
URL : https://hal.archives-ouvertes.fr/inria-00185271

B. Scherrer, Approximate Policy Iteration Schemes : A Comparison, ICML - 31st International Conference on Machine Learning -2014
URL : https://hal.archives-ouvertes.fr/hal-00989982

B. Scherrer, Improved and Generalized Upper Bounds on the Complexity of Policy Iteration, Mathematics of Operations Research, vol.41, issue.3, 2016.
DOI : 10.1287/moor.2015.0753

URL : https://hal.archives-ouvertes.fr/hal-00829532

B. Scherrer, M. Ghavamzadeh, V. Gabillon, and M. Geist, Approximate Modified Policy Iteration, 29th International Conference on Machine Learning -ICML 2012, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00758882

B. Scherrer, M. Ghavamzadeh, V. Gabillon, B. Lesner, and M. Geist, Approximate Modified Policy Iteration and its Application to the Game of Tetris, Journal of Machine Learning Research, p.47, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01091341

B. Scherrer and B. Lesner, On the use of non-stationary policies for stationary infinite-horizon Markov decision processes, Neural Information Processing Systems, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00758809

B. Scherrer and C. , Thiéry : Performance bound for approximate optimistic policy iteration, 2010.

N. Schmitz, How good is Howard's policy improvement algorithm ? Zeitschrift für, Operations Research, vol.29, issue.7, pp.315-316, 1985.
DOI : 10.1007/bf01918764

R. Schoknecht, Optimality of reinforcement learning algorithms with linear function approximation, Neural Information Processing Systems, pp.1555-1562, 2002.

P. J. Schweitzer and A. Seidmann, Generalized polynomial approximations in Markovian decision processes, Journal of Mathematical Analysis and Applications, vol.110, issue.2, pp.568-582, 1985.
DOI : 10.1016/0022-247X(85)90317-8

S. Singh and R. Yee, An upper bound on the loss from approximate optimal-value functions, Machine Learning, pp.16-3227, 1994.
DOI : 10.1007/BF00993308

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.115.8149

D. A. Spielman and S. H. , Smoothed analysis of algorithms, Journal of the ACM, vol.51, issue.3, pp.385-463, 2004.
DOI : 10.1145/990308.990310

URL : http://arxiv.org/abs/math/0212413

R. S. Sutton, H. R. Maei, D. Precup, S. Bhatnagar, D. Silver et al., Fast gradient-descent methods for temporal-difference learning with linear function approximation, Proceedings of the 26th Annual International Conference on Machine Learning, ICML '09, 2009.
DOI : 10.1145/1553374.1553501

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.149.5674

R. S. Sutton, Learning to predict by the methods of temporal differences, Machine Learning, pp.9-44, 1988.
DOI : 10.1007/BF00115009

R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, IEEE Transactions on Neural Networks, vol.9, issue.5, 1998.
DOI : 10.1109/TNN.1998.712192

. Cs, Szepesvári : Algorithms for Reinforcement Learning, 2010.

I. Szita and A. L?, Learning Tetris Using the Noisy Cross-Entropy Method, Neural Computation, vol.18, issue.12, pp.2936-2941, 2006.
DOI : 10.1007/s10479-005-5732-z

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.80.6681

M. Tagorti, Sur les abstractions et projections des processus décisionnels de Markov de grande taille, Thèse de doctorat, 2015.

M. Tagorti and B. Scherrer, On the Rate of Convergence and Error Bounds for LSTD(?)
URL : https://hal.archives-ouvertes.fr/hal-01186667

C. Thiéry and B. Scherrer, Building Controllers for Tetris, ICGA Journal, vol.32, issue.1, pp.3-11, 2009.
DOI : 10.3233/ICG-2009-32102

C. Thiéry, Scherrer : Improvements on Learning Tetris with Cross Entropy, International Computer Games Association Journal, vol.32, 2009.

C. Thiéry, Scherrer : Least-squares ?-policy iteration : Bias-variance trade-off in control problems, International Conference on Machine Learning, pp.1071-1078, 2010.

A. C. Thompson, Minkowski Geometry, 1996.
DOI : 10.1017/CBO9781107325845

J. N. Tsitsiklis and B. Van-roy, Feature-Based Methods for Large Scale Dynamic Programming, Machine Learning, pp.59-94, 1996.
DOI : 10.1109/cdc.1995.478954

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.15.4433

J. N. Tsitsiklis and B. Van-roy, An analysis of temporal-difference learning with function approximation, IEEE Transactions on Automatic Control, vol.42, issue.5, pp.674-690, 1997.
DOI : 10.1109/9.580874

R. J. Williams and L. C. Baird, Tight performance bounds on greedy policies based on imperfect value functions, 1993.

Y. Ye, The Simplex and Policy-Iteration Methods Are Strongly Polynomial for the Markov Decision Problem with a Fixed Discount Rate, Mathematics of Operations Research, vol.36, issue.4, pp.593-603, 2011.
DOI : 10.1287/moor.1110.0516

B. Yu, Rates of convergence for empirical processes stationnary mixing consequences. The Annals of Probability, pp.3041-3074, 1994.

H. Yu, Convergence of least-squares temporal difference methods under general conditions, International Conference on Machine Learning, 2010.
DOI : 10.1137/100807879

URL : https://helda.helsinki.fi/bitstream/10138/17799/1/lstd_rev_Y.pdf

H. Yu and D. P. Bertsekas, Error Bounds for Approximations from Projected Linear Equations, Mathematics of Operations Research, vol.35, issue.2, pp.306-329, 2010.
DOI : 10.1287/moor.1100.0441

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.294.7789

H. Yu and D. P. , Q-learning and policy iteration algorithms for stochastic shortest path problems, Annals of Operations Research, vol.40, issue.1, pp.95-132, 2013.
DOI : 10.1007/s10479-012-1128-z

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.228.8129