une cha??necha??ne de Markov est ergodique si et seulement si elle est irréductible et apériodique, c'est-` a-dire si et seulement si : ?(x, y) ? X 2 , ?n0, ?n ? n0 ,
il est beaucoup plus lent que CPI+ qui converge toujours en un petit nombre d'itérations (la plupart du temps en moins de 10 itérations, et toujours en moins de 20) API(?)?la variation conservative na¨?vena¨?ve d'API qui est aussi plus simple que CPI(?)?est empiriquement proche de CPI(?), bien qu'´ etant en moyenne légèrement moins bon. CPI+, CPI(?) et PSDP ? ont une performance moyenne similaire, mais la variabilité de PSDP ? est significativement moindre, PSDP ? appara??tappara??t ainsi comme donnant les meilleurs résultats empiriques. NSPI( ,
CPI+, et CPI(?), et est proche de PSDP ? . Des expérimentations complémentaires ont montré que l'ensemble de ces observations est stable par rapport aux paramètres n S et n A . Demanì ere intéressante, les différences entre les algorithmes tendentàtendentà se dissiper lorsque la dynamique duprobì eme est de plus en plus stochastique (lorsque le facteur de branchement augmente) Ceci est conformè a notre analyse basée sur les constantes de concentrabilité : celles-ci sont toutes finies lorsque la dynamique est fortement " mélangeante ,
Policy iteration for perfect information stochastic mean payoff games with bounded first return times is strongly polynomial, 2013. ,
URL : https://hal.archives-ouvertes.fr/hal-00881207
Szepesvári : Fitted Q-iteration in continuous action-space MDPs, Neural Information Processing Systems, pp.9-16, 2007. ,
Learning near-optimal policies with bellman-residual minimization based fitted policy iteration and a single sample path, Machine Learning, pp.89-129, 2008. ,
DOI : 10.1007/s10994-007-5038-2
URL : https://hal.archives-ouvertes.fr/hal-00830201
On the Generation of Markov Decision Processes, Journal of the Operational Research Society, vol.46, issue.3, pp.354-361, 1995. ,
DOI : 10.1057/jors.1995.50
Dynamic Policy Programming with Function Approximation, 14th International Conference on Artificial Intelligence and Statistics (AISTATS), 2011. ,
Policy search by dynamic programming, Neural Information Processing Systems, 2003. ,
Residual Algorithms: Reinforcement Learning with Function Approximation, International Conference on Machine Learning, pp.30-37, 1995. ,
DOI : 10.1016/B978-1-55860-377-6.50013-X
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.114.5034
Approximate policy iteration: a survey and some new methods, Journal of Control Theory and Applications, vol.27, issue.3, pp.310-335, 2011. ,
DOI : 10.1007/s11768-011-1005-3
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.294.8653
Temporal differences-based policy iteration and applications in neuro-dynamic programming, 1996. ,
Tsitsiklis : Neurodynamic Programming, Athena Scientific, 1996. ,
Scherrer : Optimal control subsumes harmonic control, IEEE International Conference on Robotics and Automation, pp.2841-2846, 2007. ,
DOI : 10.1109/robot.2007.363902
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.104.9611
Analyse d'un algorithme d'intelligence en essaim pour le fourragement, Revue d'intelligence artificielle, vol.22, issue.6, pp.791-816, 2008. ,
DOI : 10.3166/ria.22.791-816
URL : https://hal.archives-ouvertes.fr/inria-00172200
Technical update : Least-squares temporal difference learning, Machine Learning, pp.233-246, 2002. ,
De Schutter : Least-squares methods for Policy Iteration, Wiering et M. van Otterlo, ´ editeurs : Reinforcement Learning : State of the Art, 2011. ,
Rothblum : (Approximate) iterated successive approximations algorithm for sequential decision processes, Annals of Operations Research, pp.1-12 ,
DOI : 10.1007/s10479-012-1073-x
A generalized Kalman filter for fixed point approximation and efficient temporal-difference learning. Discrete Event Dynamic Systems, pp.207-239, 2006. ,
DOI : 10.1007/s10626-006-8134-8
Algorithms and Representations for Reinforcement Learning, Thèse de doctorat, 2005. ,
Wehenkel : Tree-based batch mode reinforcement learning, Journal of Machine Learning Research, vol.6, pp.503-556, 2005. ,
Regularized policy iteration, Neural Information Processing Systems, 2008. ,
Szepesvári : Error propagation for approximate policy and value iteration, Neural Information Processing Systems, pp.568-576, 2010. ,
Exponential lower bounds for policy iteration Dans 37th international colloquium conference on Automata, languages and programming : Part II, pp.551-562, 2010. ,
Modified policy iteration algorithms are not strongly polynomial for discounted dynamic programming, Operations Research Letters, vol.42, issue.6-7, pp.429-431, 2014. ,
DOI : 10.1016/j.orl.2014.07.006
URL : https://hal.archives-ouvertes.fr/hal-01091370
Approximate Policy Iteration with a Policy Language Bias : Solving Relational Markov Decision Processes, Journal of Artificial Intelligence Research, vol.25, pp.75-118, 2006. ,
Scherrer : Classification-based policy iteration with a critic, International Conference on Machine Learning, pp.1049-1056, 2011. ,
Algorithmic Survey of Parametric Value Function Approximation, IEEE Transactions on Neural Networks and Learning Systems, vol.24, issue.6, pp.845-867, 2013. ,
DOI : 10.1109/TNNLS.2013.2247418
URL : https://hal.archives-ouvertes.fr/hal-00869725
Stable Function Approximation in Dynamic Programming, International Conference on Machine Learning, pp.261-268, 1995. ,
DOI : 10.1016/B978-1-55860-377-6.50040-2
Max-norm projections for factored mdps, IJCAI, 2001. ,
Venkataraman : Efficient Solution Algorithms for Factored MDPs, Journal of Artificial Intelligence Research (JAIR), vol.19, pp.399-468, 2003. ,
Walk : A Distribution-Free Theory of Nonparametric Regression, 2002. ,
Strategy Iteration Is Strongly Polynomial for 2-Player Turn-Based Stochastic Games with a Constant Discount Factor, Journal of the ACM, vol.60, issue.1, pp.1-1 ,
DOI : 10.1145/2432622.2432623
Lower bounds for Howard's algorithm for finding minimum meancost cycles, pp.415-426, 2010. ,
DOI : 10.1007/978-3-642-17517-6_37
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.174.7896
The complexity of Policy Iteration is exponential for discounted Markov Decision Processes, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC), p.2012 ,
DOI : 10.1109/CDC.2012.6426485
Approximately Optimal Approximate Reinforcement Learning, International Conference on Machine Learning, pp.267-274, 2002. ,
Bias-variance error bounds for temporal difference updates, pp.142-147, 2000. ,
Reinforcement Learning as Classification : Leveraging Modern Classifiers, International Conference on Machine Learning, pp.424-431, 2003. ,
Least-squares policy iteration, Journal of Machine Learning Research, vol.4, pp.1107-1149, 2003. ,
Analysis of a Classification-based Policy Iteration Algorithm, International Conference on Machine Learning, pp.607-614, 2010. ,
URL : https://hal.archives-ouvertes.fr/inria-00482065
Finite-sample analysis of least-squares policy iteration, Journal of Machine Learning Research, vol.13, pp.3041-3074, 2012. ,
URL : https://hal.archives-ouvertes.fr/inria-00528596
Scherrer : Non-Stationary Approximate Modified Policy Iteration, 2015. ,
Finite Sample Analysis of Bellman Residual Minimization, Masashi Sugiyama et Qiang Yang, ´ editeurs : Asian Conference on Machine Learpning. JMLR : Workshop and Conference Proceedings, pp.309-324, 2010. ,
URL : https://hal.archives-ouvertes.fr/hal-00830212
On the complexity of policy iteration, Dans Uncertainty in Artificial Intelligence, pp.401-408, 1999. ,
On the Complexity of the Policy Improvement Algorithm for Markov Decision Processes, ORSA Journal on Computing, vol.6, issue.2, pp.188-192, 1994. ,
DOI : 10.1287/ijoc.6.2.188
Error bounds for approximate policy iteration, International Conference on Machine Learning, 2003. ,
Performance Bounds in $L_p$???norm for Approximate Value Iteration, SIAM Journal on Control and Optimization, vol.46, issue.2, pp.541-561, 2007. ,
DOI : 10.1137/040614384
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.466.4205
Szepesvári : Finite-time bounds for fitted value iteration, Journal of Machine Learning Research, vol.9, pp.815-857, 2008. ,
Bertsekas : Least squares policy evaluation algorithms with linear function approximation, Theory and Applications, vol.13, pp.79-110, 2002. ,
Biasing Approximate Dynamic Programming with a Lower Discount Factor, Neural Information Processing Systems, 2008. ,
URL : https://hal.archives-ouvertes.fr/inria-00337652
Thrun : Point-based value iteration : An anytime algorithm for POMDPs, International Joint Conference on Artificial Intelligence, pp.1025-1032, 2003. ,
Szepesvári : Statistical linear estimation with penalized estimators : an application to reinforcement learning, pp.1535-1542, 2012. ,
Ye : The simplex method is strongly polynomial for deterministic Markov decision processes, 2012. ,
DOI : 10.1137/1.9781611973105.105
URL : http://arxiv.org/abs/1208.5083
Modified Policy Iteration Algorithms for Discounted Markov Decision Problems, Management Science, vol.24, issue.11, 1978. ,
DOI : 10.1287/mnsc.24.11.1127
Pathwise Stochastic Optimal Control, SIAM Journal on Control and Optimization, vol.46, issue.3, pp.1116-1132, 2007. ,
DOI : 10.1137/050642885
The cross-entropy method : A unified approach to combinatorial optimization, Monte-Carlo simulation, and machine learning, 2004. ,
Performance bounds for lambda policy iteration. CoRR, 2007. ,
URL : https://hal.archives-ouvertes.fr/inria-00185271
Should one compute the temporal difference fix point or minimize the bellman residual ? the unified oblique projection view, International Conference on Machine Learning, 2010. ,
URL : https://hal.archives-ouvertes.fr/inria-00537403
Improved and Generalized Upper Bounds on the Complexity of Policy Iteration, Neural Information Processing Systems, 2013. ,
DOI : 10.1287/moor.2015.0753
URL : https://hal.archives-ouvertes.fr/hal-00829532
Performance Bounds for ?-Policy Iteration and Application to the Game of Tetris, Journal of Machine Learning Research, vol.14, pp.1175-1221, 2013. ,
URL : https://hal.archives-ouvertes.fr/inria-00185271
Approximate Policy Iteration Schemes : A Comparison, ICML - 31st International Conference on Machine Learning -2014 ,
URL : https://hal.archives-ouvertes.fr/hal-00989982
Improved and Generalized Upper Bounds on the Complexity of Policy Iteration, Mathematics of Operations Research, vol.41, issue.3, 2016. ,
DOI : 10.1287/moor.2015.0753
URL : https://hal.archives-ouvertes.fr/hal-00829532
Approximate Modified Policy Iteration, 29th International Conference on Machine Learning -ICML 2012, 2012. ,
URL : https://hal.archives-ouvertes.fr/hal-00758882
Approximate Modified Policy Iteration and its Application to the Game of Tetris, Journal of Machine Learning Research, p.47, 2015. ,
URL : https://hal.archives-ouvertes.fr/hal-01091341
On the use of non-stationary policies for stationary infinite-horizon Markov decision processes, Neural Information Processing Systems, 2012. ,
URL : https://hal.archives-ouvertes.fr/hal-00758809
Thiéry : Performance bound for approximate optimistic policy iteration, 2010. ,
How good is Howard's policy improvement algorithm ? Zeitschrift für, Operations Research, vol.29, issue.7, pp.315-316, 1985. ,
DOI : 10.1007/bf01918764
Optimality of reinforcement learning algorithms with linear function approximation, Neural Information Processing Systems, pp.1555-1562, 2002. ,
Generalized polynomial approximations in Markovian decision processes, Journal of Mathematical Analysis and Applications, vol.110, issue.2, pp.568-582, 1985. ,
DOI : 10.1016/0022-247X(85)90317-8
An upper bound on the loss from approximate optimal-value functions, Machine Learning, pp.16-3227, 1994. ,
DOI : 10.1007/BF00993308
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.115.8149
Smoothed analysis of algorithms, Journal of the ACM, vol.51, issue.3, pp.385-463, 2004. ,
DOI : 10.1145/990308.990310
URL : http://arxiv.org/abs/math/0212413
Fast gradient-descent methods for temporal-difference learning with linear function approximation, Proceedings of the 26th Annual International Conference on Machine Learning, ICML '09, 2009. ,
DOI : 10.1145/1553374.1553501
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.149.5674
Learning to predict by the methods of temporal differences, Machine Learning, pp.9-44, 1988. ,
DOI : 10.1007/BF00115009
Reinforcement Learning: An Introduction, IEEE Transactions on Neural Networks, vol.9, issue.5, 1998. ,
DOI : 10.1109/TNN.1998.712192
Szepesvári : Algorithms for Reinforcement Learning, 2010. ,
Learning Tetris Using the Noisy Cross-Entropy Method, Neural Computation, vol.18, issue.12, pp.2936-2941, 2006. ,
DOI : 10.1007/s10479-005-5732-z
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.80.6681
Sur les abstractions et projections des processus décisionnels de Markov de grande taille, Thèse de doctorat, 2015. ,
On the Rate of Convergence and Error Bounds for LSTD(?) ,
URL : https://hal.archives-ouvertes.fr/hal-01186667
Building Controllers for Tetris, ICGA Journal, vol.32, issue.1, pp.3-11, 2009. ,
DOI : 10.3233/ICG-2009-32102
Scherrer : Improvements on Learning Tetris with Cross Entropy, International Computer Games Association Journal, vol.32, 2009. ,
Scherrer : Least-squares ?-policy iteration : Bias-variance trade-off in control problems, International Conference on Machine Learning, pp.1071-1078, 2010. ,
Minkowski Geometry, 1996. ,
DOI : 10.1017/CBO9781107325845
Feature-Based Methods for Large Scale Dynamic Programming, Machine Learning, pp.59-94, 1996. ,
DOI : 10.1109/cdc.1995.478954
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.15.4433
An analysis of temporal-difference learning with function approximation, IEEE Transactions on Automatic Control, vol.42, issue.5, pp.674-690, 1997. ,
DOI : 10.1109/9.580874
Tight performance bounds on greedy policies based on imperfect value functions, 1993. ,
The Simplex and Policy-Iteration Methods Are Strongly Polynomial for the Markov Decision Problem with a Fixed Discount Rate, Mathematics of Operations Research, vol.36, issue.4, pp.593-603, 2011. ,
DOI : 10.1287/moor.1110.0516
Rates of convergence for empirical processes stationnary mixing consequences. The Annals of Probability, pp.3041-3074, 1994. ,
Convergence of least-squares temporal difference methods under general conditions, International Conference on Machine Learning, 2010. ,
DOI : 10.1137/100807879
URL : https://helda.helsinki.fi/bitstream/10138/17799/1/lstd_rev_Y.pdf
Error Bounds for Approximations from Projected Linear Equations, Mathematics of Operations Research, vol.35, issue.2, pp.306-329, 2010. ,
DOI : 10.1287/moor.1100.0441
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.294.7789
Q-learning and policy iteration algorithms for stochastic shortest path problems, Annals of Operations Research, vol.40, issue.1, pp.95-132, 2013. ,
DOI : 10.1007/s10479-012-1128-z
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.228.8129