Improved and Generalized Upper Bounds on the Complexity of Policy Iteration

Bruno Scherrer 1, 2, *
* Auteur correspondant
1 Probabilités et statistiques
IECL - Institut Élie Cartan de Lorraine
2 BIGS - Biology, genetics and statistics
Inria Nancy - Grand Est, IECL - Institut Élie Cartan de Lorraine
Abstract : Given a Markov Decision Process (MDP) with $n$ states and a total number $m$ of actions, we study the number of iterations needed by Policy Iteration (PI) algorithms to converge to the optimal $\gamma$-discounted policy. We consider two variations of PI: Howard's PI that changes the actions in all states with a positive advantage, and Simplex-PI that only changes the action in the state with maximal advantage. We show that Howard's PI terminates after at most $O \left(\frac{m}{1-\gamma}\log\left(\frac{1}{1-\gamma}\right)\right)$ iterations, improving by a factor $O(\log n)$ a result by Hansen et al., while Simplex-PI terminates after at most $O\left( \frac{nm}{1-\gamma}\log\left(\frac{1}{1-\gamma}\right)\right)$ iterations, improving by a factor $O(\log n)$ a result by Ye. Under some structural properties of the MDP, we then consider bounds that are independent of the discount factor~$\gamma$: quantities of interest are bounds $\tau_t$ and $\tau_r$---uniform on all states and policies---respectively on the \emph{expected time spent in transient states} and \emph{the inverse of the frequency of visits in recurrent states} given that the process starts from the uniform distribution. Indeed, we show that Simplex-PI terminates after at most $\tilde O \left( n^3 m^2 \tau_t \tau_r \right)$ iterations. This extends a recent result for deterministic MDPs by Post \& Ye, in which $\tau_t \le 1$ and $\tau_r \le n$; in particular it shows that Simplex-PI is strongly polynomial for a much larger class of MDPs. We explain why similar results seem hard to derive for Howard's PI. Finally, under the additional (restrictive) assumption that the state space is partitioned in two sets, respectively states that are transient and recurrent for all policies, we show that both Howard's PI and Simplex-PI terminate after at most $\tilde O(m(n^2\tau_t+n\tau_r))$ iterations.
Liste complète des métadonnées

Littérature citée [16 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/hal-00829532
Contributeur : Bruno Scherrer <>
Soumis le : mercredi 10 février 2016 - 09:57:21
Dernière modification le : jeudi 11 janvier 2018 - 06:26:22
Document(s) archivé(s) le : samedi 12 novembre 2016 - 16:02:32

Fichiers

reportv2.pdf
Fichiers produits par l'(les) auteur(s)

Licence


Distributed under a Creative Commons Paternité 4.0 International License

Identifiants

Collections

Citation

Bruno Scherrer. Improved and Generalized Upper Bounds on the Complexity of Policy Iteration. Mathematics of Operations Research, INFORMS, 2016, 〈10.1287/moor.2015.0753〉. 〈hal-00829532v4〉

Partager

Métriques

Consultations de la notice

305

Téléchargements de fichiers

250