Y. Abbasi-yadkori and C. Szepesvári, Regret bounds for the adaptive control of linear quadratic systems, In COLT, pp.1-26, 2011.

Y. Abbasi-yadkori, D. Pál, and C. Szepesvári, Improved algorithms for linear stochastic bandits, Advances in Neural Information Processing Systems 24 -NIPS, pp.2312-2320, 2011.

A. Anandkumar, D. Hsu, M. Sham, and . Kakade, A method of moments for mixture models and hidden markov models, 2012.

A. Anandkumar, R. Ge, D. Hsu, M. Sham, M. Kakade et al., Tensor decompositions for learning latent variable models, The Journal of Machine Learning Research, vol.15, issue.1, pp.2773-2832, 2014.

A. Atrash and J. Pineau, Efficient planning and tracking in pomdps with large observation spaces, AAAI Workshop on Statistical and Empirical Approaches for Spoken Dialogue Systems, 2006.

P. Auer, P. Nicoì-o-cesa-bianchi, and . Fischer, Finite-time analysis of the multiarmed bandit problem, Machine Learning, vol.47, issue.2/3, pp.235-256, 2002.
DOI : 10.1023/A:1013689704352

P. Auer, T. Jaksch, and R. Ortner, Near-optimal regret bounds for reinforcement learning, Advances in neural information processing systems, pp.89-96, 2009.

J. A. Bagnell, M. Sham, J. G. Kakade, A. Y. Schneider, and . Ng, Policy search by dynamic programming, Advances in Neural Information Processing Systems 16, pp.831-838, 2004.

L. Peter, A. Bartlett, and . Tewari, REGAL: A regularization based algorithm for reinforcement learning in weakly communicating MDPs, Proceedings of the 25th Annual Conference on Uncertainty in Artificial Intelligence, 2009.

A. G. Barto, R. S. Sutton, and C. W. Anderson, Neuronlike adaptive elements that can solve difficult learning control problems, IEEE Transactions on Systems, Man, and Cybernetics, vol.13, issue.5, pp.13-834, 1983.
DOI : 10.1109/TSMC.1983.6313077

J. Baxter and P. L. Bartlett, Infinite-horizon policy-gradient estimation, J. Artif. Int. Res, vol.15, issue.1, pp.319-350, 2001.

D. Bertsekas and J. Tsitsiklis, Neuro-Dynamic Programming, Athena Scientific, 1996.

B. Boots, M. Sajid, G. J. Siddiqi, and . Gordon, Closing the learning-planning loop with predictive state representations, The International Journal of Robotics Research, vol.24, issue.7, pp.954-966, 2011.
DOI : 10.1177/0278364911404092

I. Ronen, M. Brafman, and . Tennenholtz, R-max-a general polynomial time algorithm for nearoptimal reinforcement learning, The Journal of Machine Learning Research, vol.3, pp.213-231, 2003.

Y. Li, B. Yin, and H. Xi, Finding optimal memoryless policies of POMDPs under the expected average reward criterion, European Journal of Operational Research, vol.211, issue.3, pp.556-567, 2011.
DOI : 10.1016/j.ejor.2010.12.014

L. Michael and . Littman, Memoryless policies: Theoretical limitations and practical results, Proceedings of the Third International Conference on Simulation of Adaptive Behavior : From Animals to Animats 3: From Animals to Animats 3, SAB94, pp.238-245, 1994.

L. Michael, R. S. Littman, S. Sutton, and . Singh, Predictive representations of state, Advances In Neural Information Processing Systems 14, pp.1555-1561, 2001.

J. Loch, P. Satinder, and . Singh, Using eligibility traces to find the best memoryless policy in partially observable markov decision processes, ICML, pp.323-331, 1998.

O. Madani, On the computability of infinite-horizon partially observable markov decision processes, AAAI98 Fall Symposium on Planning with POMDPs, 1998.

L. Meng and B. Zheng, The optimal perturbation bounds of the Moore???Penrose inverse under the Frobenius norm, Linear Algebra and its Applications, vol.432, issue.4, pp.956-963, 2010.
DOI : 10.1016/j.laa.2009.10.009

Y. Andrew, M. Ng, and . Jordan, Pegasus: A policy search method for large mdps and pomdps, Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence, UAI'00, pp.406-415, 2000.

P. Ortner and R. Auer, Logarithmic online regret bounds for undiscounted reinforcement learning, Advances in Neural Information Processing Systems, vol.19, p.49, 2007.

R. Ortner, O. Maillard, and D. Ryabko, Selecting Near-Optimal Approximate State Representations in Reinforcement Learning, Algorithmic Learning Theory, pp.140-154, 2014.
DOI : 10.1007/978-3-319-11662-4_11

URL : https://hal.archives-ouvertes.fr/hal-01057562

C. Papadimitriou and J. N. Tsitsiklis, The Complexity of Markov Decision Processes, Mathematics of Operations Research, vol.12, issue.3, pp.441-450, 1987.
DOI : 10.1287/moor.12.3.441

T. J. Perkins, Reinforcement learning for POMDPs based on action values and stochastic optimization, Proceedings of the Eighteenth National Conference on Artificial Intelligence and Fourteenth Conference on Innovative Applications of Artificial Intelligence (AAAI/IAAI 2002), pp.199-204, 2002.

S. Png, J. Pineau, and B. Chaib-draa, Building adaptive dialogue systems via bayes-adaptive pomdps. Selected Topics in Signal Processing, IEEE Journal, vol.6, issue.8, pp.917-927, 2012.

P. Poupart and N. Vlassis, Model-based bayesian reinforcement learning in partially observable domains, International Symposium on Artificial Intelligence and Mathematics (ISAIM), 2008.