Use of variance estimation in the multi-armed bandit problem

Jean-Yves Audibert 1 Rémi Munos 2 Csaba Szepesvari 3
2 SEQUEL - Sequential Learning
LIFL - Laboratoire d'Informatique Fondamentale de Lille, Inria Lille - Nord Europe, LAGIS - Laboratoire d'Automatique, Génie Informatique et Signal
Abstract : An important aspect of most decision making problems concerns the appropriate balance between exploitation (acting optimally according to the partial knowledge acquired so far) and exploration of the environment (acting sub-optimally in order to refine the current knowledge and improve future decisions). A typical example of this so-called exploration versus exploitation dilemma is the multi-armed bandit problem, for which many strategies have been developed. Here we investigate policies based the choice of the arm having the highest upper-confidence bound, where the bound takes into account the empirical variance of the different arms. Such an algorithm was found earlier to outperform its peers in a series of numerical experiments. The main contribution of this paper is the theoretical investigation of this algorithm. Our contribution here is twofold. First, we prove that with probability at least $1-\beta$, the regret after $n$ plays of a variant of the UCB algorithm (called $\beta$-UCB) is upper-bounded by a constant, that scales linearly with $\log(1/\beta)$, but which is independent from $n$. We also analyse a variant which is closer to the algorithm suggested earlier. We prove a logarithmic bound on the expected regret of this algorithm and argue that the bound scales favourably with the variance of the suboptimal arms.
Type de document :
Autre publication
NIPS Workshop on On-line Trading of Exploration and ExploitationWorkshop. 2006
Liste complète des métadonnées

Littérature citée [3 références]  Voir  Masquer  Télécharger
Contributeur : Rémi Munos <>
Soumis le : jeudi 10 janvier 2008 - 12:11:52
Dernière modification le : jeudi 11 janvier 2018 - 06:22:13
Document(s) archivé(s) le : mardi 13 avril 2010 - 16:56:00


Fichiers produits par l'(les) auteur(s)


  • HAL Id : inria-00203496, version 1



Jean-Yves Audibert, Rémi Munos, Csaba Szepesvari. Use of variance estimation in the multi-armed bandit problem. NIPS Workshop on On-line Trading of Exploration and ExploitationWorkshop. 2006. 〈inria-00203496〉



Consultations de la notice


Téléchargements de fichiers