S. Agrawal and N. Goyal, Analysis of thompson sampling for the multi-armed bandit problem, Conference on Learning Theory, pp.39-40, 2012.

P. Auer, Using upper confidence bounds for online learning, Proceedings. 41st Annual Symposium on, pp.270-279, 2000.

P. Auer, Using confidence bounds for exploitationexploration trade-offs, Journal of Machine Learning Research, vol.3, pp.397-422, 2002.

R. , Sleeping-EXP3G) ? 2 KT log K + K 2 K+3 T log T + K2 K+3 (log T )

T. ?-k-2-k+4-t-log and . K+3,

, The latter can also be performed with a computational cost of O(tK). Yet, the algorithm needs to keep in memory the empirical distribution of S 1, As for the complexity, the only difference with Alg. 1 comes from the computation ofq t (i)