Online linear optimization via smoothing, Proceedings of The 27th Conference on Learning Theory, vol.35, pp.807-823, 2014. ,
Further optimal regret bounds for thompson sampling, AISTATS, pp.99-107, 2013. ,
The multiplicative weights update method: A meta-algorithm and applications, Theory of Computing, vol.8, pp.121-164, 2012. ,
Minimax policies for bandits games, Proceedings of the 22nd Annual Conference on Learning Theory, 2009. ,
UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem, Periodica Mathematica Hungarica, vol.61, pp.31-5303, 2010. ,
Gambling in a rigged casino: The adversarial multi-armed bandit problem, Foundations of Computer Science, 1995. Proceedings., 36th Annual Symposium on, pp.322-331, 1995. ,
Finite-time analysis of the multiarmed bandit problem, Mach. Learn, vol.47, issue.2-3, pp.235-256, 2002. ,
The nonstochastic multiarmed bandit problem, SIAM J. Comput, vol.32, issue.1, pp.97-5397, 2002. ,
Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, 2012. ,
Kullback-leibler upper confidence bounds for optimal sequential allocation, The Annals of Statistics, vol.41, issue.3, pp.1516-1541, 2013. ,
Challenging the empirical mean and empirical variance: A deviation study, Probabilités et Statistiques, vol.48, issue.4, p.2012 ,
URL : https://hal.archives-ouvertes.fr/hal-00517206
Finite-time regret bounds for the multiarmed bandit problem, ICML, pp.100-108, 1998. ,
On explore-then-commit strategies, NIPS, 2016. ,
URL : https://hal.archives-ouvertes.fr/hal-01322906
Reinforcement learning: A survey, Journal of artificial intelligence research, vol.4, pp.237-285, 1996. ,
Thompson sampling: An asymptotically optimal finite-time analysis, ALT'12, pp.199-213, 2012. ,
URL : https://hal.archives-ouvertes.fr/hal-00830033
Algorithms for multi-armed bandit problems, 2014. ,
Asymptotically efficient adaptive allocation rules, Advances in Applied Mathematics, vol.6, pp.4-22, 1985. ,
Generalization and exploration via randomized value functions, 2016. ,
A convergent form of approximate policy iteration, Advances in Neural Information Processing Systems 15, pp.1595-1602, 2003. ,
One practical algorithm for both stochastic and adversarial bandits, Proceedings of the 30th International Conference on Machine Learning (ICML 2014), pp.1287-1295, 2014. ,
Convergence results for single-step on-policy reinforcement-learning algorithms, Machine Learning, vol.38, pp.287-308, 2000. ,
Integrated architectures for learning, planning, and reacting based on approximating dynamic programming, Proceedings of the Seventh International Conference on Machine Learning, pp.216-224, 1990. ,
DOI : 10.1016/b978-1-55860-141-3.50030-4
URL : http://papersdb.cs.ualberta.ca/~papersdb/uploaded_files/paper_p160-sutton.pdf.stjohn
Reinforcement Learning: An Introduction, 1998. ,
Policy gradient methods for reinforcement learning with function approximation, Advances in Neural Information Processing Systems, vol.12, pp.1057-1063, 1999. ,
Algorithms for Reinforcement Learning, Synthesis Lectures on Artificial Intelligence and Machine Learning, 2010. ,
Multi-armed bandit algorithms and empirical evaluation, European conference on machine learning, pp.437-448, 2005. ,
DOI : 10.1007/11564096_42
URL : https://link.springer.com/content/pdf/10.1007%2F11564096_42.pdf