J. Audibert and S. Bubeck, Minimax policies for adversarial and stochastic bandits, 22nd annual conference on learning theory, 2009.
URL : https://hal.archives-ouvertes.fr/hal-00834882

J. Y. Audibert, R. Munos, and C. Szepesvári, Exploration-exploitation trade-off using variance estimates in multi-armed bandits, Theoretical Computer Science, 2008.

P. Auer, N. Cesa-bianchi, and P. Fischer, Finitetime analysis of the multiarmed bandit problem, Machine Learning, vol.47, issue.2/3, pp.235-256, 2002.
DOI : 10.1023/A:1013689704352

P. Auer, Y. Nicoì-o-cesa-bianchi, R. E. Freund, and . Schapire, The Nonstochastic Multiarmed Bandit Problem, SIAM Journal on Computing, vol.32, issue.1, 2002.
DOI : 10.1137/S0097539701398375

A. Blum, Y. Mansour, and R. Meir, From External to Internal Regret, In In COLT, pp.621-636, 2005.
DOI : 10.1007/11503415_42

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.138.5182

S. Bubeck, Bandits Games and Clustering Foundations, 2010.
URL : https://hal.archives-ouvertes.fr/tel-00845565

G. Nicoì-o-cesa-bianchi and . Lugosi, Potentialbased algorithms in on-line prediction and game theory, Machine Learning, vol.51, issue.3, pp.239-261, 2003.
DOI : 10.1023/A:1022901500417

D. Ernst, G. Stan, J. Goncalves, and L. Wehenkel, Clinical data based optimal STI strategies for HIV: a reinforcement learning approach, Proceedings of the 45th IEEE Conference on Decision and Control, pp.65-72, 2006.
DOI : 10.1109/CDC.2006.377527

URL : https://hal.archives-ouvertes.fr/hal-00121732

D. Foster and R. Vohra, Asymptotic calibration, Biometrika, vol.85, issue.2, pp.379-390, 1996.
DOI : 10.1093/biomet/85.2.379

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.133.8037

P. Dean, R. Foster, and . Vohra, Regret in the on-line decision problem, Games and Economic Behavior, vol.29, issue.12, pp.7-35, 1999.

Y. Freund and R. E. Schapire, A decisiontheoretic generalization of on-line learning and an application to boosting, EuroCOLT '95: Proceedings of the Second European Conference on Computational Learning Theory, pp.23-37, 1995.

S. Hart and A. Mas, A Simple Adaptive Procedure Leading to Correlated Equilibrium, Econometrica, vol.68, issue.5, pp.1127-1150, 2000.
DOI : 10.1111/1468-0262.00153

M. Hutter-varun-kanade, H. B. Mcmahan, and B. Bryan, Feature reinforcement learning: Part I: Unstructured MDPs Sleeping experts and bandits with stochastic action availability and adversarial rewards, AISTATS, pp.3-24, 2009.

R. D. Kleinberg, A. Niculescu-mizil, and Y. Sharma, Regret bounds for sleeping experts and bandits, Conference on Learning Theory, 2008.
DOI : 10.1007/s10994-010-5178-7

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.143.6257

R. Ortner, Online regret bounds for markov decision processes with deterministic transitions, ALT '08: Proceedings of the 19th international conference on Algorithmic Learning Theory, pp.123-137, 2008.
DOI : 10.1016/j.tcs.2010.04.005

URL : http://doi.org/10.1016/j.tcs.2010.04.005

H. Robbins, Some aspects of the sequential design of experiments, Bulletin of the American Mathematical Society, vol.58, issue.5, pp.527-535, 1952.
DOI : 10.1090/S0002-9904-1952-09620-8

D. Ryabko and M. Hutter, On the possibility of learning in reactive environments with arbitrary dependence, Theoretical Computer Science, vol.405, issue.3, pp.274-284, 2008.
DOI : 10.1016/j.tcs.2008.06.039

URL : https://hal.archives-ouvertes.fr/hal-00639569

G. Stoltz, Incomplete information and internal regret in prediction of individual sequences, 2005.
URL : https://hal.archives-ouvertes.fr/tel-00009759