Direct Value Learning: Reinforcement Learning and Anti-Imitation

Riad Akrour 1, 2 Basile Mayeur 2, 1 Michele Sebag 1, 2
2 TAO - Machine Learning and Optimisation
CNRS - Centre National de la Recherche Scientifique : UMR8623, Inria Saclay - Ile de France, UP11 - Université Paris-Sud - Paris 11, LRI - Laboratoire de Recherche en Informatique
Abstract : The value function, at the core of the Bellmanian Reinforcement Learning framework, associates to each state the discounted expected cumulative reward which can be gathered after visiting this state. Given an (optimal) value function, an (optimal) policy is most simply derived by "greedification", heading in each time step toward the neighbor state with maximal value. Following the Bellman equations, the value function can be built by (approximate) dynamic programming, albeit facing severe scalability limitations in large state and action spaces. An alternative, inspired from the Energy-based learning framework (LeCun et al. 2006), is investigated in this paper, searching for a pseudo-value function such that it induces the same local order on the state space as a (nearly) optimal value function. By construction, the greedification of such a pseudo-value induces the same policy as the value function itself. The presented Direct Value Learning (DiVa) approach proceeds by directly learning the pseudo-value, taking some inspiration from the Inverse Reinforcement Learning (IRL) approach. In IRL, expert demonstrations are used to infer the reward function. Quite the contrary, DiVa uses bad demonstrations to infer the pseudo-value. Bad demonstrations are notoriously easier to generate than expert ones; typically, applying a random policy on a good initial state (e.g., a bicycle in equilibrium) will on average lead to visit states with decreasing values (the bicycle ultimately falls down). DiVa thus uses bad demonstrations, generated from weak prior knowledge, to learn a pseudo-value along a standard learning-to-rank approach. The derived pseudo-value directly induces a policy in the model-based RL framework, when the transition function is known. In the model-free RL setting, the state pseudo-value is exploited using off-policy learning, to infer a state-action pseudo-value and induce a policy. The proposed DiVa approach and the use of bad demonstrations to achieve direct value learning is original to our best knowledge. The loss of optimality of the pseudo value-based policy is analyzed and it is shown that it is bounded under mild assumptions. Finally, the experimental validation of DiVa on the mountain car, the bicycle and the swing-up pendulum problems comparatively demonstrates the simplicity and the merits of the approach.
Type de document :
[Research Report] RR-8836, INRIA; CNRS; Université Paris-Sud 11. 2015, pp.18
Liste complète des métadonnées

Littérature citée [31 références]  Voir  Masquer  Télécharger
Contributeur : Marc Schoenauer <>
Soumis le : lundi 4 janvier 2016 - 15:57:25
Dernière modification le : jeudi 11 janvier 2018 - 06:22:14
Document(s) archivé(s) le : jeudi 7 avril 2016 - 16:50:18


Fichiers produits par l'(les) auteur(s)


  • HAL Id : hal-01249377, version 1


Riad Akrour, Basile Mayeur, Michele Sebag. Direct Value Learning: Reinforcement Learning and Anti-Imitation. [Research Report] RR-8836, INRIA; CNRS; Université Paris-Sud 11. 2015, pp.18. 〈hal-01249377〉



Consultations de la notice


Téléchargements de fichiers