Abstract : We consider continuous state, continuous action batch reinforcement learning where the goal is to learn a good policy from a sufficiently rich trajectory generated by some policy. We study a variant of fitted Q-iteration, where the greedy action selection is replaced by searching for a policy in a restricted set of candidate policies by maximizing the average action values. We provide a rigorous analysis of this algorithm, proving what we believe is the first finite-time bound for value-function based algorithms for continuous state and action problems.
https://hal.inria.fr/inria-00203359 Contributor : Rémi MunosConnect in order to contact the contributor Submitted on : Wednesday, January 9, 2008 - 5:08:45 PM Last modification on : Thursday, January 20, 2022 - 4:12:31 PM Long-term archiving on: : Thursday, September 27, 2012 - 2:00:51 PM