Algorithms for Markovian bandits : Indexability and Learning

Kimang Khun

Résumé

Markovian bandits are a subclass of multi-armed bandit problems where one has to activate a set of arms at each decision time. The activated arms evolve in an active Markovian manner. Those that are not activated (i.e., are passive) either remain frozen – then one falls into the category of rested Markovian bandits – or evolve in a passive Markovian manner – the setting is then called restless Markovian bandit. Such problems suffer from the curse of dimensionality that often makes the exact solution computationally prohibitive. So, one has to resort to heuristics such as index policy. Two celebrated indices are Gittins index for rested bandits and Whittle index for restless bandits.This thesis focuses on two setups: (1) index computation when all model parameters are known and (2) algorithm design when the parameters are unknown.For index computation, we first cover the ambiguities in the classical condition that guarantees the existence of the Whittle index in restless bandits. We then refine this condition to assure the uniqueness of the index when it exists. We then develop an algorithm to test the refined condition and compute the Whittle indices of any restless bandit arm. We prove that the theoretical complexity of this algorithm is O(S2.5286), where S is the number of arm’s states. We also develop an efficient implementation of this algorithm in python programming language. It can compute indices of restless arms with several thousands of states in less than a few seconds.For learning in rested bandits, we show that MB-PSRL and MB-UCBVI, respectively the modified versions of PSRL and UCBVI algorithms, can leverage Gittins index policy to have a regret bound and a runtime scalable in the number of arms. Fur- thermore, we show that MB-UCRL2, a modified version of UCRL2, also has a regret bound scalable in the number of arms. Yet, we show that MB-UCRL2 and any modification of UCRL2’s variants to rested bandit likely have a runtime exponential in the number of arms. When learning in restless bandits, the regret of algorithms depends heavily on the structure of the bandit. So, we study how the structure of arms translates into the structure of the bandit. We show that no learning algorithms can perform uniformly well over the general class of restless bandits. We also show that defining a subclass of restless bandits with a desirable learning structure by relying on the assumption of arms is difficult.

Un bandit markovien est un problème de décision séquentielle dans lequel un sous-ensemble de bras doiventêtre activés à chaque instant, et les bras évoluent de manière markovienne. Il y a deux catégories de banditsmarkoviens. Si les bras qui ne sont pas activés restent figés, on entre alors dans la catégorie des banditsmarkoviens avec repos. S’ils évoluent de manière markovienne, on parle alors de bandit markovien sans repos.En général, les bandits markoviens souffrent de la malédiction de la dimension qui rend souvent la solutionexacte prohibitive en terme de calculs. Il faut donc recourir à des heuristiques telles que les politiques d’indice.Deux indices célèbres sont l’indice de Gittins pour les bandits avec repos et l’indice de Whittle pour les banditssans repos.Cette thèse se concentre sur deux questions : (1) le calcul d’indices lorsque tous les paramètres du modèle sontconnus et (2) les algorithmes d’apprentissage lorsque les paramètres sont inconnus.Pour le calcul de l’indice, nous relevons les ambiguïtés de la définition classique de l’indexabilité et proposonsune définition qui assure l’unicité de l’indice de Whittle quand ce dernier existe. Nous développons ensuiteun algorithme testant l’indexabilité et calculant les indices de Whittle. La complexité théorique de notrealgorithme est O(S2.5286), où S est le nombre d’états du bras.Pour l’apprentissage dans les bandits avec repos, nous montrons que MB-PSRL et MB-UCBVI, des versionsmodifiées des algorithmes PSRL et UCBVI, peuvent tirer parti de la politique d’indice de Gittins pour avoirune garantie de regret et un temps d’exécution qui passent à l’échelle avec le nombre de bras. De plus, nousmontrons que MB-UCRL2, une version modifiée de UCRL2, possède également une garantie de regret quipasse à l’échelle. Cependant, MB-UCRL2 a un temps d’exécution exponentiel dans le nombre de bras. Lors del’apprentissage dans les bandits sans repos, la garantie de regret dépend fortement de la structure du bandit.Ainsi, nous étudions comment la structure des bras se traduit dans la structure du bandit. Nous exposons unesous-classe de bandits sans repos qui ne sont pas apprenables. Nous montrons également qu’il est difficile deconstruire des hypothèses sur les bras qui rendent les bandits sans repos apprenables efficacement.

Algorithms for Markovian bandits : Indexability and Learning

Des algorithmes pour les bandits markoviens : indexabilité et apprentissage

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager