Skip to Main content Skip to Navigation
Conference papers

Revisiting Peng's Q(λ) for for modern reinforcement learning

Abstract : Off-policy multi-step reinforcement learning algorithms consist of conservative and nonconservative algorithms: the former actively cut traces, whereas the latter do not. Recently, Munos et al. (2016) proved the convergence of conservative algorithms to an optimal Q-function. In contrast, non-conservative algorithms are thought to be unsafe and have a limited or no theoretical guarantee. Nonetheless, recent studies have shown that non-conservative algorithms empirically outperform conservative ones. Motivated by the empirical results and the lack of theory, we carry out theoretical analyses of Peng's Q(λ), a representative example of non-conservative algorithms. We prove that it also converges to an optimal policy provided that the behavior policy slowly tracks a greedy policy in a way similar to conservative policy iteration. Such a result has been conjectured to be true but has not been proven. We also experiment with Peng's Q(λ) in complex continuous control tasks, confirming that Peng's Q(λ) often outperforms conservative algorithms despite its simplicity. These results indicate that Peng's Q(λ), which was thought to be unsafe, is a theoretically-sound and practically effective algorithm.
Document type :
Conference papers
Complete list of metadata
Contributor : Michal Valko Connect in order to contact the contributor
Submitted on : Friday, July 16, 2021 - 5:53:08 PM
Last modification on : Tuesday, February 15, 2022 - 11:02:04 AM
Long-term archiving on: : Sunday, October 17, 2021 - 7:47:19 PM


Files produced by the author(s)


  • HAL Id : hal-03289292, version 1


Tadashi Kozuno, yunhao Tang, Mark Rowland, Rémi Munos, Steven Kapturowski, et al.. Revisiting Peng's Q(λ) for for modern reinforcement learning. International Conference on Machine Learning, Jul 2021, Vienna / Virtual, Austria. ⟨hal-03289292⟩



Record views


Files downloads