Online Stochastic Optimization under Correlated Bandit Feedback

Mohammad Gheshlaghi Azar 1 Alessandro Lazaric 2 Emma Brunskill 3
2 SEQUEL - Sequential Learning
LIFL - Laboratoire d'Informatique Fondamentale de Lille, Inria Lille - Nord Europe, LAGIS - Laboratoire d'Automatique, Génie Informatique et Signal
Abstract : In this paper we consider the problem of online stochastic optimization of a locally smooth func-tion under bandit feedback. We introduce the high-confidence tree (HCT) algorithm, a novel anytime X -armed bandit algorithm, and derive regret bounds matching the performance of state-of-the-art algorithms in terms of the dependency on number of steps and the near-optimality di-mension. The main advantage of HCT is that it handles the challenging case of correlated ban-dit feedback (reward), whereas existing meth-ods require rewards to be conditionally indepen-dent. HCT also improves on the state-of-the-art in terms of the memory requirement, as well as requiring a weaker smoothness assumption on the mean-reward function in comparison with the existing anytime algorithms. Finally, we discuss how HCT can be applied to the problem of policy search in reinforcement learning and we report preliminary empirical results.
Document type :
Conference papers
Complete list of metadatas

Cited literature [22 references]  Display  Hide  Download

https://hal.inria.fr/hal-01080138
Contributor : Alessandro Lazaric <>
Submitted on : Tuesday, November 4, 2014 - 3:26:14 PM
Last modification on : Tuesday, March 26, 2019 - 3:42:25 PM
Long-term archiving on : Thursday, February 5, 2015 - 11:05:37 AM

File

paper (1).pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-01080138, version 1

Citation

Mohammad Gheshlaghi Azar, Alessandro Lazaric, Emma Brunskill. Online Stochastic Optimization under Correlated Bandit Feedback. 31st International Conference on Machine Learning, Jun 2014, Beijing, China. ⟨hal-01080138⟩

Share

Metrics

Record views

369

Files downloads

108