Improved and generalized upper bounds on the complexity of policy iteration

Bruno Scherrer

doi:10.1287/moor.2015.0753

Article Dans Une Revue Mathematics of Operations Research Année : 2016

Improved and generalized upper bounds on the complexity of policy iteration

(1, 2)

1
2

Bruno Scherrer

Fonction : Auteur correspondant
PersonId : 1406
IdHAL : bruno-scherrer
IdRef : 073360708

Connectez-vous pour contacter l'auteur

Biology, genetics and statistics

Institut Élie Cartan de Lorraine

Résumé

Given a Markov Decision Process (MDP) with $n$ states and a total number $m$ of actions, we study the number of iterations needed by Policy Iteration (PI) algorithms to converge to the optimal $\gamma$-discounted policy. We consider two variations of PI: Howard's PI that changes the actions in all states with a positive advantage, and Simplex-PI that only changes the action in the state with maximal advantage. We show that Howard's PI terminates after at most $O \left(\frac{m}{1-\gamma}\log\left(\frac{1}{1-\gamma}\right)\right)$ iterations, improving by a factor $O(\log n)$ a result by Hansen et al., while Simplex-PI terminates after at most $O\left( \frac{nm}{1-\gamma}\log\left(\frac{1}{1-\gamma}\right)\right)$ iterations, improving by a factor $O(\log n)$ a result by Ye. Under some structural properties of the MDP, we then consider bounds that are independent of the discount factor~$\gamma$: quantities of interest are bounds $\tau_t$ and $\tau_r$---uniform on all states and policies---respectively on the \emph{expected time spent in transient states} and \emph{the inverse of the frequency of visits in recurrent states} given that the process starts from the uniform distribution. Indeed, we show that Simplex-PI terminates after at most $\tilde O \left( n^3 m^2 \tau_t \tau_r \right)$ iterations. This extends a recent result for deterministic MDPs by Post \& Ye, in which $\tau_t \le 1$ and $\tau_r \le n$; in particular it shows that Simplex-PI is strongly polynomial for a much larger class of MDPs. We explain why similar results seem hard to derive for Howard's PI. Finally, under the additional (restrictive) assumption that the state space is partitioned in two sets, respectively states that are transient and recurrent for all policies, we show that both Howard's PI and Simplex-PI terminate after at most $\tilde O(m(n^2\tau_t+n\tau_r))$ iterations.

Domaines

Recherche opérationnelle [math.OC] Complexité [cs.CC] Intelligence artificielle [cs.AI] Mathématique discrète [cs.DM]

Fichier principal

reportv2.pdf (305.3 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Bruno Scherrer : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-00829532

Soumis le : mercredi 10 février 2016-09:57:21

Dernière modification le : lundi 11 septembre 2023-17:22:02

Archivage à long terme le : samedi 12 novembre 2016-16:02:32

Dates et versions

hal-00829532 , version 1 (03-06-2013)

hal-00829532 , version 2 (06-06-2013)

hal-00829532 , version 3 (24-06-2013)

hal-00829532 , version 4 (10-02-2016)

Licence

Paternité

Identifiants

HAL Id : hal-00829532 , version 4
ARXIV : 1306.0386
DOI : 10.1287/moor.2015.0753

Citer

Bruno Scherrer. Improved and generalized upper bounds on the complexity of policy iteration. Mathematics of Operations Research, 2016, 41 (3), pp.758-774. ⟨10.1287/moor.2015.0753⟩. ⟨hal-00829532v4⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS INRIA IECN UNIV-LORRAINE INRIA2 TDS-MACS IECLPS

390 Consultations

688 Téléchargements

Improved and generalized upper bounds on the complexity of policy iteration

Résumé

Domaines

Dates et versions

Licence

Identifiants

Citer

Exporter

Collections

Altmetric

Partager