Samples Are Useful? Not Always: denoising policy gradient updates using variance explained

Yannis Flet-Berliac; Philippe Preux

Pré-Publication, Document De Travail Année : 2019

Samples Are Useful? Not Always: denoising policy gradient updates using variance explained

(1, 2, 3) , (1, 2, 3)

1
2
3

Yannis Flet-Berliac

Fonction : Auteur
PersonId : 174111
IdHAL : yannis-flet-berliac
ORCID : 0000-0002-1191-0048

Sequential Learning

Université de Lille

Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189

Philippe Preux

Fonction : Auteur
PersonId : 5488
IdHAL : preux-philippe
IdRef : 059896353

Sequential Learning

Université de Lille

Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189

Résumé

Policy gradient algorithms in reinforcement learning optimize the policy directly and rely on efficiently sampling an environment. However, while most sampling procedures are based solely on sampling the agent's policy, other measures directly accessible through these algorithms could be used to improve sampling before each policy update. Following this line of thoughts, we propose the use of SAUNA, a method where transitions are rejected from the gradient updates if they do not meet a particular criterion, and kept otherwise. This criterion, the fraction of variance explained $\mathcal{V}^{ex}$, is a measure of the discrepancy between a model and actual samples. In this work, $\mathcal{V}^{ex}$ is used to evaluate the impact each transition will have on learning: this criterion refines sampling and improves the policy gradient algorithm. In this paper: (a) We introduce and explore $\mathcal{V}^{ex}$, the criterion used for denoising policy gradient updates. (b) We conduct experiments across a variety of benchmark environments, including standard continuous control problems. Our results show better performance with SAUNA. (c) We investigate why $\mathcal{V}^{ex}$ provides a reliable assessment for the selection of samples that will positively impact learning. (d) We show how this criterion can work as a dynamic tool to adjust the ratio between exploration and exploitation.

Mots clés

policy gradient reinforcement learning sampling

Domaines

Apprentissage [cs.LG] Machine Learning [stat.ML] Intelligence artificielle [cs.AI]

Fichier principal

main.pdf (8.64 Mo)

Gym/HalfCheetah-v2-nofilter.pdf (43.32 Ko)

Gym/HalfCheetah-v2.pdf (32.36 Ko)

Gym/Hopper-v2-nofilter.pdf (45.81 Ko)

Gym/Hopper-v2.pdf (32.82 Ko)

Gym/InvertedDoublePendulum-v2-nofilter.pdf (48.34 Ko)

Gym/InvertedDoublePendulum-v2.pdf (35.88 Ko)

Gym/InvertedPendulum-v2-nofilter.pdf (46.88 Ko)

Gym/InvertedPendulum-v2.pdf (33.93 Ko)

Gym/Reacher-v2-nofilter.pdf (46.32 Ko)

Gym/Reacher-v2.pdf (34.41 Ko)

Gym/Swimmer-v2-nofilter.pdf (42.32 Ko)

Gym/Swimmer-v2.pdf (30.86 Ko)

Gym/Walker2d-v2-nofilter.pdf (46.03 Ko)

Gym/Walker2d-v2.pdf (33.08 Ko)

Gym/frames.pdf (2.84 Mo)

Roboschool/RoboschoolHumanoidFlagrunHarder-v1-128actors.pdf (1.05 Mo)

Roboschool/RoboschoolHumanoidFlagrunHarder-v1-512-256-128.pdf (1.15 Mo)

Roboschool/RoboschoolHumanoidFlagrunHarder-v1-64-64.pdf (992.75 Ko)

figures/diagram.pdf (61.32 Ko)

figures/ev.pdf (1.29 Mo)

figures/vexp.pdf (58.28 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Yannis Flet-Berliac : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-02091547

Soumis le : mercredi 25 septembre 2019-16:15:32

Dernière modification le : vendredi 16 février 2024-11:12:07

Dates et versions

hal-02091547 , version 1 (08-04-2019)

hal-02091547 , version 2 (10-04-2019)

hal-02091547 , version 3 (24-09-2019)

hal-02091547 , version 4 (25-09-2019)

hal-02091547 , version 5 (13-05-2020)

hal-02091547 , version 6 (20-11-2020)

Identifiants

HAL Id : hal-02091547 , version 4

Citer

Yannis Flet-Berliac, Philippe Preux. Samples Are Useful? Not Always: denoising policy gradient updates using variance explained. 2019. ⟨hal-02091547v4⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

333 Consultations

690 Téléchargements

Samples Are Useful? Not Always: denoising policy gradient updates using variance explained

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Partager