R. 1. Gittins and J. C. , Bandit processes and dynamic allocation indices, Journal of the Royal Statistical Society. Series BMethodological), vol.41, issue.2, pp.148-177, 1979.
DOI : 10.1002/9780470980033

J. C. Gittins and D. M. Jones, A dynamic allocation index for the discounted multiarmed bandit problem, Biometrika, vol.66, issue.3, pp.561-565, 1979.
DOI : 10.1093/biomet/66.3.561

P. Whittle, Multi-armed bandits and the gittins index, Journal of the Royal Statistical Society. Series BMethodological), vol.42, issue.2, pp.143-149, 1980.

P. Varaiya, J. Walrand, and C. Buyukkoc, Extensions of the multiarmed bandit problem: The discounted case, IEEE Transactions on Automatic Control, vol.30, issue.5, pp.426-439, 1985.
DOI : 10.1109/TAC.1985.1103989

M. Katehakis and A. Veinott, The Multi-Armed Bandit Problem: Decomposition and Computation, Mathematics of Operations Research, vol.12, issue.2, pp.262-268, 1987.
DOI : 10.1287/moor.12.2.262

I. Sonin, A generalized Gittins index for a Markov chain and its recursive calculation, Statistics & Probability Letters, vol.78, issue.12, pp.1526-1533, 2008.
DOI : 10.1016/j.spl.2008.01.049

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.99.1058

J. Nino-mora, Fast-Pivoting Algorithm for the Gittins Index and Optimal Stopping of a Markov Chain, INFORMS Journal on Computing, vol.19, issue.4, pp.596-606, 2007.
DOI : 10.1287/ijoc.1060.0206

C. J. Watkins, Learning from delayed rewards, 1989.

N. Cesa-bianchi and P. Fischer, Finite-time regret bounds for the multiarmed bandit problem, pp.100-108, 1998.

P. Auer, N. Cesa-bianchi, and P. Fischer, Finite time analysis of the multiarmed bandit problem, Machine Learning, vol.47, issue.2/3, pp.235-256, 2002.
DOI : 10.1023/A:1013689704352

M. Tokic, Adaptive ??-Greedy Exploration in Reinforcement Learning Based on Value Differences, KI'10 Proceedings of the 33rd annual German conference on Advances in artificial intelligence, 2010.
DOI : 10.1007/978-3-642-16111-7_23

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.458.464

J. Vermorel and M. Mohri, Multi-armed Bandit Algorithms and Empirical Evaluation, pp.437-448, 2005.
DOI : 10.1007/11564096_42

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.109.4518

L. P. Kaelbling, Learning in embedded systems, 1993.

P. Auer, N. Cesa-bianchi, Y. Freund, and R. E. Schapire, Gambling in a rigged casino: The adversarial multi-armed bandit problem, Proceedings of IEEE 36th Annual Foundations of Computer Science, pp.322-331, 1995.
DOI : 10.1109/SFCS.1995.492488

K. S. Narendra and M. A. Thathachar, Learning Automat: An Introduction, 1989.

M. Thathachar and P. Sastry, Estimator algorithms for learning automata, the Platinum Jubilee Conference on Systems and Signal Processing, pp.29-32, 1986.

B. Oommen and M. Agache, Continuous and discretized pursuit learning schemes: various algorithms and their comparison, IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics), vol.31, issue.3, pp.277-287, 2001.
DOI : 10.1109/3477.931507

URL : http://ce.sharif.edu/courses/83-84/1/ce717/resources/root/oa01.pdf

T. Norheim, T. Bradland, O. C. Granmo, and B. J. Oommen, A generic solution to multi-armed bernoulli bandit problems based on random sampling from sibling conjugate priors, pp.2010-2046, 2010.

R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, IEEE Transactions on Neural Networks, vol.9, issue.5, 1998.
DOI : 10.1109/TNN.1998.712192

J. Wyatt, Exploration and inference in learning from reinforcement, 1997.

R. Dearden, N. Friedman, and S. Russell, Bayesian q-learning, the 15th National Conf. on Artificial Intelligence, pp.761-768, 1998.

W. R. Thompson, ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES, Biometrika, vol.25, issue.3-4, pp.285-294, 1933.
DOI : 10.1093/biomet/25.3-4.285

O. Granmo, Solving two???armed Bernoulli bandit problems using a Bayesian learning automaton, International Journal of Intelligent Computing and Cybernetics, vol.3, issue.2, pp.207-234, 2010.
DOI : 10.1145/1102351.1102472

O. C. Granmo and S. Berg, Solving Non-Stationary Bandit Problems by Random Sampling from Sibling Kalman Filters, IEA-AIE 2010, 2010.
DOI : 10.1007/978-3-642-13033-5_21

M. Agache and B. J. Oommen, Generalized pursuit learning schemes: new families of continuous and discretized learning automata, IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics), vol.32, issue.6, pp.738-749, 2002.
DOI : 10.1109/TSMCB.2002.1049608

URL : http://ce.sharif.edu/courses/84-85/2/ce717/resources/root/ao02.pdf

T. M. Mitchell, Machine Learning, 1997.