Skip to Main content Skip to Navigation
New interface
Reports (Research report)

Direct Value Learning: Reinforcement Learning and Anti-Imitation

Riad Akrour 1, 2 Basile Mayeur 2, 1 Michèle Sebag 1, 2 
2 TAO - Machine Learning and Optimisation
LRI - Laboratoire de Recherche en Informatique, UP11 - Université Paris-Sud - Paris 11, Inria Saclay - Ile de France, CNRS - Centre National de la Recherche Scientifique : UMR8623
Abstract : The value function, at the core of the Bellmanian Reinforcement Learning framework, associates to each state the discounted expected cumulative reward which can be gathered after visiting this state. Given an (optimal) value function, an (optimal) policy is most simply derived by "greedification", heading in each time step toward the neighbor state with maximal value. Following the Bellman equations, the value function can be built by (approximate) dynamic programming, albeit facing severe scalability limitations in large state and action spaces. An alternative, inspired from the Energy-based learning framework (LeCun et al. 2006), is investigated in this paper, searching for a pseudo-value function such that it induces the same local order on the state space as a (nearly) optimal value function. By construction, the greedification of such a pseudo-value induces the same policy as the value function itself. The presented Direct Value Learning (DiVa) approach proceeds by directly learning the pseudo-value, taking some inspiration from the Inverse Reinforcement Learning (IRL) approach. In IRL, expert demonstrations are used to infer the reward function. Quite the contrary, DiVa uses bad demonstrations to infer the pseudo-value. Bad demonstrations are notoriously easier to generate than expert ones; typically, applying a random policy on a good initial state (e.g., a bicycle in equilibrium) will on average lead to visit states with decreasing values (the bicycle ultimately falls down). DiVa thus uses bad demonstrations, generated from weak prior knowledge, to learn a pseudo-value along a standard learning-to-rank approach. The derived pseudo-value directly induces a policy in the model-based RL framework, when the transition function is known. In the model-free RL setting, the state pseudo-value is exploited using off-policy learning, to infer a state-action pseudo-value and induce a policy. The proposed DiVa approach and the use of bad demonstrations to achieve direct value learning is original to our best knowledge. The loss of optimality of the pseudo value-based policy is analyzed and it is shown that it is bounded under mild assumptions. Finally, the experimental validation of DiVa on the mountain car, the bicycle and the swing-up pendulum problems comparatively demonstrates the simplicity and the merits of the approach.
Document type :
Reports (Research report)
Complete list of metadata

Cited literature [31 references]  Display  Hide  Download
Contributor : Marc Schoenauer Connect in order to contact the contributor
Submitted on : Monday, January 4, 2016 - 3:57:25 PM
Last modification on : Thursday, October 27, 2022 - 3:38:33 AM
Long-term archiving on: : Thursday, April 7, 2016 - 4:50:18 PM


Files produced by the author(s)


  • HAL Id : hal-01249377, version 1


Riad Akrour, Basile Mayeur, Michèle Sebag. Direct Value Learning: Reinforcement Learning and Anti-Imitation. [Research Report] RR-8836, INRIA; CNRS; Université Paris-Sud 11. 2015, pp.18. ⟨hal-01249377⟩



Record views


Files downloads