. Asmuth, ou peut être que d'autres techniques myopic pourraient produire de meilleurs résultats En particulier, utiliser des techniques optimistes sur la fonction de valeur comme BOSS, pas le même impact sur les récompenses dépendant de la croyance, car ces récompenses évoluent au cours de l'exécution, 2009.

M. Araya-lópez, O. Buffet, V. Thomas, and F. Charpillet, A POMDP extension with belief-dependent rewards, Advances in Neural Information Processing Systems, 2010.

J. Asmuth, L. Li, M. Littman, A. Nouri, and D. Wingate, A Bayesian sampling approach to exploration in reinforcement learning, Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence (UAI'09), 2009.

R. Bellman, The theory of dynamic programming, Bulletin of the American Mathematical Society, vol.60, issue.6, pp.503-516, 1954.
DOI : 10.1090/S0002-9904-1954-09848-8

R. Brafman and M. Tennenholtz, R-max -a general polynomial time algorithm for near-optimal reinforcement learning, Journal of Machine Learning Research, vol.3, pp.213-231, 2003.

O. ?im?ek and A. G. Barto, An intrinsic reward mechanism for efficient exploration, Proceedings of the 23rd international conference on Machine learning , ICML '06, pp.833-840, 2006.
DOI : 10.1145/1143844.1143949

C. Dimitrakakis, Tree Exploration for Bayesian RL Exploration, 2008 International Conference on Computational Intelligence for Modelling Control & Automation, pp.1029-1034, 2008.
DOI : 10.1109/CIMCA.2008.32
URL : http://arxiv.org/abs/0902.0392

M. Duff, Optimal learning : Computational procedures for Bayes-adaptive Markov decision processes, 2002.

J. C. Gittins, Bandit processes and dynamic allocation indices, Journal of the Royal Statistical Society, vol.41, issue.2, pp.148-177, 1979.
DOI : 10.1002/9780470980033

A. Jonsson and A. Barto, Active Learning of Dynamic Bayesian Networks in Markov Decision Processes, Proceedings of the 7th International Conference on Abstraction, Reformulation, and Approximation, SARA'07, pp.273-284, 2007.
DOI : 10.1007/978-3-540-73580-9_22

J. Kolter and A. Ng, Near-Bayesian exploration in polynomial time, Proceedings of the 26th Annual International Conference on Machine Learning, ICML '09, 2009.
DOI : 10.1145/1553374.1553441

A. Y. Ng, D. Harada, and S. Russell, Policy invariance under reward transformations : Theory and application to reward shaping, Proceedings of the Sixteenth International Conference on Machine Learning, pp.278-287, 1999.

P. Poupart, N. Vlassis, J. Hoey, and K. Regan, An analytic solution to discrete Bayesian reinforcement learning, Proceedings of the 23rd international conference on Machine learning , ICML '06, 2006.
DOI : 10.1145/1143844.1143932

M. Puterman, Markov Decision Processes : Discrete Stochastic Dynamic Programming, 1994.
DOI : 10.1002/9780470316887

T. Rauber, T. Braun, and K. Berns, Probabilistic distance measures of the Dirichlet and Beta distributions, Pattern Recognition, vol.41, issue.2, pp.637-645, 2008.
DOI : 10.1016/j.patcog.2007.06.023

N. Roy and S. Thrun, Coastal navigation with mobile robots, Advances in Neural Information Processing Systems 12, pp.1043-1049, 1999.

J. Sorg, S. Singh, and R. Lewis, Variance-based rewards for approximate Bayesian reinforcement learning, Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence, 2010.

M. J. Strens, A Bayesian framework for reinforcement learning, Proceedings of the International Conference on Machine Learning (ICML'00), pp.943-950, 2000.

R. Sutton and A. Barto, Reinforcement Learning: An Introduction, IEEE Transactions on Neural Networks, vol.9, issue.5, 1998.
DOI : 10.1109/TNN.1998.712192

C. Szepesvári, Reinforcement Learning Algorithms for MDPs ? A Survey, 2009.