Stochastic Shortest Path: Minimax, Parameter-Free and Towards Horizon-Free Regret

Jean Tarbouriech; Runlong Zhou; Simon S Du; Matteo Pirotta; Michal Valko; Alessandro Lazaric

Communication Dans Un Congrès Année : 2021

Stochastic Shortest Path: Minimax, Parameter-Free and Towards Horizon-Free Regret

(1, 2) , (3) , (4) , (1) , (5) , (1)

1
2
3
4
5

Jean Tarbouriech

Fonction : Auteur
PersonId : 1105513

Facebook AI Research [Paris]

Scool

Runlong Zhou

Fonction : Auteur
PersonId : 1120365

Tsinghua University [Beijing]

Simon S Du

Fonction : Auteur
PersonId : 1120366

Paul G. Allen School of Computer Science and Engineering [Seattle]

Matteo Pirotta

Fonction : Auteur
PersonId : 1105514

Facebook AI Research [Paris]

Michal Valko

Fonction : Auteur
PersonId : 1120367

DeepMind [Paris]

Alessandro Lazaric

Fonction : Auteur
PersonId : 1105515

Facebook AI Research [Paris]

Résumé

We study the problem of learning in the stochastic shortest path (SSP) setting, where an agent seeks to minimize the expected cost accumulated before reaching a goal state. We design a novel model-based algorithm EB-SSP that carefully skews the empirical transitions and perturbs the empirical costs with an exploration bonus to induce an optimistic SSP problem whose associated value iteration scheme is guaranteed to converge. We prove that EB-SSP achieves the minimax regret rate O(B* √ SAK), where K is the number of episodes, S is the number of states, A is the number of actions, and B* bounds the expected cumulative cost of the optimal policy from any state, thus closing the gap with the lower bound. Interestingly, EB-SSP obtains this result while being parameter-free, i.e., it does not require any prior knowledge of B*, nor of T*, which bounds the expected time-to-goal of the optimal policy from any state. Furthermore, we illustrate various cases (e.g., positive costs, or general costs when an order-accurate estimate of T* is available) where the regret only contains a logarithmic dependence on T*, thus yielding the first (nearly) horizon-free regret bound beyond the finite-horizon MDP setting.

Domaines

Machine Learning [stat.ML] Apprentissage [cs.LG]

Fichier principal

Stochastic Shortest Path.pdf (676.49 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Jean Tarbouriech : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-03479782

Soumis le : mardi 14 décembre 2021-14:58:00

Dernière modification le : mercredi 24 janvier 2024-09:54:24

Archivage à long terme le : mardi 15 mars 2022-19:13:05

Dates et versions

hal-03479782 , version 1 (14-12-2021)

Identifiants

HAL Id : hal-03479782 , version 1

Citer

Jean Tarbouriech, Runlong Zhou, Simon S Du, Matteo Pirotta, Michal Valko, et al.. Stochastic Shortest Path: Minimax, Parameter-Free and Towards Horizon-Free Regret. Neural Information Processing Systems (NeurIPS), Dec 2021, Virtual/Sydney, Australia. ⟨hal-03479782⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS INRIA CRISTAL INRIA2 UNIV-LILLE CRISTAL-SCOOL

19 Consultations

37 Téléchargements

Stochastic Shortest Path: Minimax, Parameter-Free and Towards Horizon-Free Regret

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager