Anti Imitation-Based Policy Learning

Michèle Sebag; Riad Akrour; Basile Mayeur; Marc Schoenauer

doi:10.1007/978-3-319-46227-1_35

Communication Dans Un Congrès Année : 2016

Anti Imitation-Based Policy Learning

(1, 2) , (1, 2) , (1, 2) , (1, 2)

1
2

Michèle Sebag

Fonction : Auteur
PersonId : 836537

Laboratoire de Recherche en Informatique

Machine Learning and Optimisation

Riad Akrour

Fonction : Auteur

Laboratoire de Recherche en Informatique

Machine Learning and Optimisation

Basile Mayeur

Fonction : Auteur

Laboratoire de Recherche en Informatique

Machine Learning and Optimisation

Marc Schoenauer

Fonction : Auteur
PersonId : 739309
IdHAL : evomarc
ORCID : 0000-0003-1450-6830
IdRef : 057775575

Laboratoire de Recherche en Informatique

Machine Learning and Optimisation

Résumé

The Anti Imitation-based Policy Learning (AIPoL) approach, taking inspiration from the Energy-based learning framework (LeCun et al. 2006), aims at a pseudo-value function such that it induces the same order on the state space as a (nearly optimal) value function. By construction , the greedification of such a pseudo-value induces the same policy as the value function itself. The approach assumes that, thanks to prior knowledge, not-to-be-imitated demonstrations can easily be generated. For instance, applying a random policy on a good initial state (e.g., a bicycle in equilibrium) will on average lead to visit states with decreasing values (the bicycle ultimately falls down). Such a demonstration , that is, a sequence of states with decreasing values, is used along a standard learning-to-rank approach to define a pseudo-value function. If the model of the environment is known, this pseudo-value directly induces a policy by greedification. Otherwise, the bad demonstrations are exploited together with off-policy learning to learn a pseudo-Q-value function and likewise thence derive a policy by greedification. To our best knowledge the use of bad demonstrations to achieve policy learning is original. The theoretical analysis shows that the loss of optimality of the pseudo value-based policy is bounded under mild assumptions, and the empirical validation of AIPoL on the mountain car, the bicycle and the swing-up pendulum problems demonstrates the simplicity and the merits of the approach.

Domaines

Intelligence artificielle [cs.AI]

Fichier principal

mainDIVA.pdf (1.12 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

Marc Schoenauer : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-01387652

Soumis le : mercredi 26 octobre 2016-18:41:57

Dernière modification le : lundi 12 février 2024-09:48:04

Dates et versions

hal-01387652 , version 1 (26-10-2016)

Identifiants

HAL Id : hal-01387652 , version 1
DOI : 10.1007/978-3-319-46227-1_35

Citer

Michèle Sebag, Riad Akrour, Basile Mayeur, Marc Schoenauer. Anti Imitation-Based Policy Learning. Machine Learning and Knowledge Discovery in Databases - European Conference, ECML-PKDD 2016, Sep 2016, Riva del Garda, Afghanistan. pp.559 - 575, ⟨10.1007/978-3-319-46227-1_35⟩. ⟨hal-01387652⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS INRIA UMR8623 CENTRALESUPELEC INRIA2 LRI-AO UNIV-PARIS-SACLAY GS-COMPUTER-SCIENCE

228 Consultations

308 Téléchargements

Anti Imitation-Based Policy Learning

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager