Y. Abbasi-yadkori, D. Pál, and C. Szepesvári, Improved algorithms for linear stochastic bandits, Advances in Neural Information Processing Systems, vol.24, pp.2312-2320, 2011.

P. Auer, Using confidence bounds for exploitation-exploration trade-offs, J. Mach. Learn. Res, vol.3, pp.397-422, 2003.

P. Auer, T. Jaksch, and R. Ortner, Near-optimal regret bounds for reinforcement learning, Advances in Neural Information Processing Systems 21, pp.89-96, 2009.

S. Bubeck and N. Cesa-bianchi, Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends R in Machine Learning, vol.5, pp.1-122, 2012.

A. Dasdan, S. S. Irani, and R. K. Gupta, Efficient algorithms for optimum cycle mean and optimum cost to time ratio problems, Proceedings of the 36th Annual ACM/IEEE Design Automation Conference, DAC '99, pp.37-42, 1999.

S. Filippi, O. Cappé, and A. Garivier, Optimally Sensing a Single Channel Without Prior Information: The Tiling Algorithm and Regret Bounds, IEEE Journal of Selected Topics in Signal Processing, vol.5, issue.1, pp.68-76, 2010.
URL : https://hal.archives-ouvertes.fr/hal-00408867

H. Heidari, M. Kearns, and A. Roth, Tight policy regret bounds for improving and decaying bandits, Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI'16, pp.1562-1570, 2016.

J. Herlocker, J. Konstan, A. Borchers, and J. Riedl, An algorithmic framework for performing collaborative filtering, Proceedings of the 1999 Conference on Research and Development in Information Retrieval, 1999.

T. Jaksch, R. Ortner, and P. Auer, Near-optimal regret bounds for reinforcement learning, J. Mach. Learn. Res, vol.11, pp.1563-1600, 2010.

K. G. Jamieson and A. Talwalkar, Non-stochastic best arm identification and hyperparameter optimization, AISTATS, 2016.

K. Kapoor, K. Subbian, J. Srivastava, and P. Schrater, Just in time recommendations: Modeling the dynamics of boredom in activity streams, Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, WSDM '15, pp.233-242, 2015.

R. M. Karp, A characterization of the minimum cycle mean in a digraph, vol.23, pp.309-311, 1978.

J. Komiyama and T. Qin, Time-Decaying Bandits for Non-stationary Systems, pp.460-466, 2014.

Y. Koren, Collaborative filtering with temporal dynamics, Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '09, pp.447-456, 2009.

Y. Koren, R. Bell, and C. Volinsky, Matrix factorization techniques for recommender systems, Computer, vol.42, issue.8, pp.30-37, 2009.

L. Li, W. Chu, J. Langford, and R. E. Schapire, A contextual-bandit approach to personalized news article recommendation, pp.661-670, 2010.

L. Li, W. Chu, J. Langford, and X. Wang, Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms, Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, WSDM '11, pp.297-306, 2011.

R. Ortner, Online regret bounds for markov decision processes with deterministic transitions, Algorithmic Learning Theory, pp.123-137, 2008.

R. Ortner, D. Ryabko, P. Auer, and R. Munos, Regret bounds for restless markov bandits, Theor. Comput. Sci, vol.558, pp.62-76, 2014.
URL : https://hal.archives-ouvertes.fr/hal-00765450

G. Shani, D. Heckerman, and R. I. Brafman, An mdp-based recommender system, J. Mach. Learn. Res, vol.6, pp.1265-1295, 2005.

M. Soare, A. Lazaric, and R. Munos, Best-Arm Identification in Linear Bandits, NIPSAdvances in Neural Information Processing Systems, vol.27, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01075701

R. S. Sutton and A. G. Barto, Introduction to Reinforcement Learning, 1998.

A. Swaminathan and T. Joachims, Batch learning from logged bandit feedback through counterfactual risk minimization, J. Mach. Learn. Res, vol.16, issue.1, pp.1731-1755, 2015.

C. Tekin and M. Liu, Online Learning of Rested and Restless Bandits, IEEE Transactions on Information Theory, vol.58, issue.8, 2012.