A. T. Mckinnon-k and . Thomas-l, On the generation of Markov decision processes, Journal of the Operational Research Society, vol.46, pp.354-361, 1995.

B. D. Tsitsiklis-j, Neuro-Dynamic Programming, 1996.

L. A. and G. M. Munos-r, Finite-sample analysis of least-squares policy iteration, Journal of Machine Learning Research, vol.13, pp.3041-3074, 2012.

N. A. Bertsekas-d, Least squares policy evaluation algorithms with linear function approximation, Theory and Applications, vol.13, pp.79-110, 2002.

S. B. Lesner-b, On the use of non-stationary policies for stationary infinite-horizon Markov decision processes, NIPS 2012 Adv.in Neural Information Processing Systems, 2012.

T. J. Roy-b, An analysis of temporal-difference learning with function approximation, IEEE Transactions on Automatic Control, 1997.