J. Abernethy, C. Lee, A. Sinha, and A. Tewari, Online linear optimization via smoothing, Proceedings of The 27th Conference on Learning Theory, vol.35, pp.807-823, 2014.

S. Agrawal and N. Goyal, Further optimal regret bounds for thompson sampling, AISTATS, pp.99-107, 2013.

S. Arora, E. Hazan, and S. Kale, The multiplicative weights update method: A meta-algorithm and applications, Theory of Computing, vol.8, pp.121-164, 2012.

J. Audibert and S. Bubeck, Minimax policies for bandits games, Proceedings of the 22nd Annual Conference on Learning Theory, 2009.

P. Auer and R. Ortner, UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem, Periodica Mathematica Hungarica, vol.61, pp.31-5303, 2010.

P. Auer, N. Cesa-bianchi, Y. Freund, and R. E. Schapire, Gambling in a rigged casino: The adversarial multi-armed bandit problem, Foundations of Computer Science, 1995. Proceedings., 36th Annual Symposium on, pp.322-331, 1995.

P. Auer, N. Cesa-bianchi, and P. Fischer, Finite-time analysis of the multiarmed bandit problem, Mach. Learn, vol.47, issue.2-3, pp.235-256, 2002.

P. Auer, N. Cesa-bianchi, Y. Freund, and R. E. Schapire, The nonstochastic multiarmed bandit problem, SIAM J. Comput, vol.32, issue.1, pp.97-5397, 2002.

S. Bubeck and N. Cesa-bianchi, Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, 2012.

O. Cappé, A. Garivier, O. Maillard, R. Munos, and G. Stoltz, Kullback-leibler upper confidence bounds for optimal sequential allocation, The Annals of Statistics, vol.41, issue.3, pp.1516-1541, 2013.

O. Catoni, Challenging the empirical mean and empirical variance: A deviation study, Probabilités et Statistiques, vol.48, issue.4, p.2012
URL : https://hal.archives-ouvertes.fr/hal-00517206

N. Cesa-bianchi and P. Fischer, Finite-time regret bounds for the multiarmed bandit problem, ICML, pp.100-108, 1998.

A. Garivier, E. Kaufmann, and T. Lattimore, On explore-then-commit strategies, NIPS, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01322906

L. P. Kaelbling, M. L. Littman, and A. W. Moore, Reinforcement learning: A survey, Journal of artificial intelligence research, vol.4, pp.237-285, 1996.

E. Kaufmann, N. Korda, and R. Munos, Thompson sampling: An asymptotically optimal finite-time analysis, ALT'12, pp.199-213, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00830033

V. Kuleshov and D. Precup, Algorithms for multi-armed bandit problems, 2014.

T. L. Lai and H. Robbins, Asymptotically efficient adaptive allocation rules, Advances in Applied Mathematics, vol.6, pp.4-22, 1985.

I. Osband, B. Van-roy, and Z. Wen, Generalization and exploration via randomized value functions, 2016.

T. Perkins and D. Precup, A convergent form of approximate policy iteration, Advances in Neural Information Processing Systems 15, pp.1595-1602, 2003.

Y. Seldin and A. Slivkins, One practical algorithm for both stochastic and adversarial bandits, Proceedings of the 30th International Conference on Machine Learning (ICML 2014), pp.1287-1295, 2014.

S. P. Singh, T. Jaakkola, M. L. Littman, and C. Szepesvári, Convergence results for single-step on-policy reinforcement-learning algorithms, Machine Learning, vol.38, pp.287-308, 2000.

R. Sutton, Integrated architectures for learning, planning, and reacting based on approximating dynamic programming, Proceedings of the Seventh International Conference on Machine Learning, pp.216-224, 1990.
DOI : 10.1016/b978-1-55860-141-3.50030-4
URL : http://papersdb.cs.ualberta.ca/~papersdb/uploaded_files/paper_p160-sutton.pdf.stjohn

R. Sutton and A. Barto, Reinforcement Learning: An Introduction, 1998.

R. S. Sutton, D. A. Mcallester, S. P. Singh, and Y. Mansour, Policy gradient methods for reinforcement learning with function approximation, Advances in Neural Information Processing Systems, vol.12, pp.1057-1063, 1999.

. Cs and . Szepesvári, Algorithms for Reinforcement Learning, Synthesis Lectures on Artificial Intelligence and Machine Learning, 2010.

J. Vermorel and M. Mohri, Multi-armed bandit algorithms and empirical evaluation, European conference on machine learning, pp.437-448, 2005.
DOI : 10.1007/11564096_42
URL : https://link.springer.com/content/pdf/10.1007%2F11564096_42.pdf