hal-00590972, version 1
Classification-based Policy Iteration with a Critic
Abstract: In this paper, we study the effect of adding a value function approximation component (critic) to rollout classification-based policy iteration (RCPI) algorithms. The idea is to use the critic to approximate the return after we truncate the rollout trajectories. This allows us to control the bias and variance of the rollout estimates of the action-value function. %that are strongly related to the length of the rollout trajectories. Therefore, the introduction of a critic can improve the accuracy of the rollout estimates, and as a result, enhance the performance of the RCPI algorithm. We present a new RCPI algorithm, called {\em direct policy iteration with critic} (DPI-Critic), and provide its finite-sample analysis when the critic is based on LSTD and BRM methods. We empirically evaluate the performance of DPI-Critic and compare it with DPI and LSPI in two benchmark reinforcement learning problems.
- 1:
- INRIA – CNRS : UMR8146 – Université Lille I - Sciences et technologies – Université Lille III - Sciences humaines et sociales – Ecole Centrale de Lille
- 2:
- INRIA – CNRS : UMR7503 – Université Henri Poincaré - Nancy I – Université Nancy II – Institut National Polytechnique de Lorraine (INPL)
- Domain : Statistics/Other Statistics
- hal-00590972, version 1
- http://hal.archives-ouvertes.fr/hal-00590972
- oai:hal.archives-ouvertes.fr:hal-00590972
- From:
- Submitted on: Thursday, 5 May 2011 20:00:05
- Updated on: Saturday, 7 May 2011 13:00:08


Associated documents
Export