Use of variance estimation in the multi-armed bandit problem

Jean-Yves Audibert 1 Rémi Munos 2 Csaba Szepesvari 3
2 SEQUEL - Sequential Learning
LIFL - Laboratoire d'Informatique Fondamentale de Lille, Inria Lille - Nord Europe, LAGIS - Laboratoire d'Automatique, Génie Informatique et Signal
Abstract : An important aspect of most decision making problems concerns the appropriate balance between exploitation (acting optimally according to the partial knowledge acquired so far) and exploration of the environment (acting sub-optimally in order to refine the current knowledge and improve future decisions). A typical example of this so-called exploration versus exploitation dilemma is the multi-armed bandit problem, for which many strategies have been developed. Here we investigate policies based the choice of the arm having the highest upper-confidence bound, where the bound takes into account the empirical variance of the different arms. Such an algorithm was found earlier to outperform its peers in a series of numerical experiments. The main contribution of this paper is the theoretical investigation of this algorithm. Our contribution here is twofold. First, we prove that with probability at least $1-\beta$, the regret after $n$ plays of a variant of the UCB algorithm (called $\beta$-UCB) is upper-bounded by a constant, that scales linearly with $\log(1/\beta)$, but which is independent from $n$. We also analyse a variant which is closer to the algorithm suggested earlier. We prove a logarithmic bound on the expected regret of this algorithm and argue that the bound scales favourably with the variance of the suboptimal arms.
Document type :
Other publications
Complete list of metadatas

Cited literature [3 references]  Display  Hide  Download

https://hal.inria.fr/inria-00203496
Contributor : Rémi Munos <>
Submitted on : Thursday, January 10, 2008 - 12:11:52 PM
Last modification on : Thursday, February 21, 2019 - 10:52:49 AM
Long-term archiving on : Tuesday, April 13, 2010 - 4:56:00 PM

File

ucbtuned.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : inria-00203496, version 1

Collections

Citation

Jean-Yves Audibert, Rémi Munos, Csaba Szepesvari. Use of variance estimation in the multi-armed bandit problem. 2006. ⟨inria-00203496⟩

Share

Metrics

Record views

484

Files downloads

299