Improved and generalized upper bounds on the complexity of policy iteration

Bruno Scherrer 1, 2, *
* Corresponding author
1 BIGS - Biology, genetics and statistics
Inria Nancy - Grand Est, IECL - Institut Élie Cartan de Lorraine
Abstract : Given a Markov Decision Process (MDP) with $n$ states and a total number $m$ of actions, we study the number of iterations needed by Policy Iteration (PI) algorithms to converge to the optimal $\gamma$-discounted policy. We consider two variations of PI: Howard's PI that changes the actions in all states with a positive advantage, and Simplex-PI that only changes the action in the state with maximal advantage. We show that Howard's PI terminates after at most $O \left(\frac{m}{1-\gamma}\log\left(\frac{1}{1-\gamma}\right)\right)$ iterations, improving by a factor $O(\log n)$ a result by Hansen et al., while Simplex-PI terminates after at most $O\left( \frac{nm}{1-\gamma}\log\left(\frac{1}{1-\gamma}\right)\right)$ iterations, improving by a factor $O(\log n)$ a result by Ye. Under some structural properties of the MDP, we then consider bounds that are independent of the discount factor~$\gamma$: quantities of interest are bounds $\tau_t$ and $\tau_r$---uniform on all states and policies---respectively on the \emph{expected time spent in transient states} and \emph{the inverse of the frequency of visits in recurrent states} given that the process starts from the uniform distribution. Indeed, we show that Simplex-PI terminates after at most $\tilde O \left( n^3 m^2 \tau_t \tau_r \right)$ iterations. This extends a recent result for deterministic MDPs by Post \& Ye, in which $\tau_t \le 1$ and $\tau_r \le n$; in particular it shows that Simplex-PI is strongly polynomial for a much larger class of MDPs. We explain why similar results seem hard to derive for Howard's PI. Finally, under the additional (restrictive) assumption that the state space is partitioned in two sets, respectively states that are transient and recurrent for all policies, we show that both Howard's PI and Simplex-PI terminate after at most $\tilde O(m(n^2\tau_t+n\tau_r))$ iterations.
Complete list of metadatas

Cited literature [16 references]  Display  Hide  Download

https://hal.inria.fr/hal-00829532
Contributor : Bruno Scherrer <>
Submitted on : Wednesday, February 10, 2016 - 9:57:21 AM
Last modification on : Friday, October 5, 2018 - 3:03:51 PM
Long-term archiving on : Saturday, November 12, 2016 - 4:02:32 PM

Files

reportv2.pdf
Files produced by the author(s)

Licence


Distributed under a Creative Commons Attribution 4.0 International License

Identifiers

Collections

Citation

Bruno Scherrer. Improved and generalized upper bounds on the complexity of policy iteration. Mathematics of Operations Research, INFORMS, 2016, 41 (3), pp.758-774. ⟨10.1287/moor.2015.0753⟩. ⟨hal-00829532v4⟩

Share

Metrics

Record views

400

Files downloads

353