Analysis of thompson sampling for the multi-armed bandit problem, Conference on Learning Theory, pp.39-40, 2012. ,
Using upper confidence bounds for online learning, Proceedings. 41st Annual Symposium on, pp.270-279, 2000. ,
Using confidence bounds for exploitationexploration trade-offs, Journal of Machine Learning Research, vol.3, pp.397-422, 2002. ,
, Sleeping-EXP3G) ? 2 KT log K + K 2 K+3 T log T + K2 K+3 (log T )
,
, The latter can also be performed with a computational cost of O(tK). Yet, the algorithm needs to keep in memory the empirical distribution of S 1, As for the complexity, the only difference with Alg. 1 comes from the computation ofq t (i)