Bayesian Policy Gradient and Actor-Critic Algorithms

Mohammad Ghavamzadeh 1 Yaakov Engel 2 Michal Valko 1
1 SEQUEL - Sequential Learning
Inria Lille - Nord Europe, CRIStAL - Centre de Recherche en Informatique, Signal et Automatique de Lille (CRIStAL) - UMR 9189
2 Machine Learning
Rafael Inc.
Abstract : Policy gradient methods are reinforcement learning algorithms that adapt a parameterized policy by following a performance gradient estimate. Many conventional policy gradient methods use Monte-Carlo techniques to estimate this gradient. The policy is improved by adjusting the parameters in the direction of the gradient estimate. Since Monte-Carlo methods tend to have high variance, a large number of samples is required to attain accurate estimates, resulting in slow convergence. In this paper, we first propose a Bayesian framework for policy gradient, based on modeling the policy gradient as a Gaussian process. This reduces the number of samples needed to obtain accurate gradient estimates. Moreover, estimates of the natural gradient as well as a measure of the uncertainty in the gradient estimates, namely, the gradient covariance, are provided at little extra cost. Since the proposed Bayesian framework considers system trajectories as its basic observable unit, it does not require the dynamics within trajectories to be of any particular form, and thus, can be easily extended to partially observable problems. On the downside, it cannot take advantage of the Markov property when the system is Markovian. To address this issue, we proceed to supplement our Bayesian policy gradient framework with a new actor-critic learning model in which a Bayesian class of non-parametric critics, based on Gaussian process temporal difference learning, is used. Such critics model the action-value function as a Gaussian process, allowing Bayes’ rule to be used in computing the posterior distribution over action-value functions, conditioned on the observed data. Appropriate choices of the policy parameterization and of the prior covariance (kernel) between action-values allow us to obtain closed-form expressions for the posterior distribution of the gradient of the expected return with respect to the policy parameters. We perform detailed experimental comparisons of the proposed Bayesian policy gradient and actor-critic algorithms with classic Monte-Carlo based policy gradient methods, as well as with each other, on a number of reinforcement learning problems.
Type de document :
Article dans une revue
Journal of Machine Learning Research, Journal of Machine Learning Research, 2016, 17 (66), pp.1-53
Liste complète des métadonnées

Littérature citée [48 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/hal-00776608
Contributeur : Mohammad Ghavamzadeh <>
Soumis le : mardi 15 septembre 2015 - 23:58:48
Dernière modification le : jeudi 11 janvier 2018 - 06:27:32
Document(s) archivé(s) le : mercredi 26 avril 2017 - 19:03:45

Fichier

jmlr-BPG-BAC.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : hal-00776608, version 2

Citation

Mohammad Ghavamzadeh, Yaakov Engel, Michal Valko. Bayesian Policy Gradient and Actor-Critic Algorithms. Journal of Machine Learning Research, Journal of Machine Learning Research, 2016, 17 (66), pp.1-53. 〈hal-00776608v2〉

Partager

Métriques

Consultations de la notice

429

Téléchargements de fichiers

217