We would like to thank the reviewers for their useful comments. Just after the submission, we fixed a small error in Theorem 2 (our final results needs to be stated with p=q'=1), slightly reorganized the paper, and improved the discussion on the influence of m. Though we plan to go on working on the paper, we provide this updated version for information. Rev. 1: Appendix G in the supplementary material contains some empirical evaluations of CBMPI. Following Rev. 3's advice, we may focus more on CBMPI and include these results in the final version of the paper. Rev. 2: 1. and 2.: We were not aware of the work by Canbolat and Rothblum and we would like to thank the reviewer for bringing this to our attention. Our error propagation analysis is indeed based on that of Thiery and Scherrer (2010), that is the appendix of the article by Thiery and Scherrer we already cite. With respect to these two works, that we will reference more clearly, the following differences can be underlined: - While Canbolat and Rothblum only consider the error in the greedy step and Thiery and Scherrer only consider the error in the value update, our work considers both these sources of error (this is required for the analysis of CBMPI). - Comparing to the proof of Thiery and Scherrer, here we are dealing with an extra error term that needs to be propagated carefully. - Both these works provide bounds when the error is controlled in max-norm. Here we consider the more general Lp-norm setting that is better suited to regression approximation. This extension involves a few complications (see Munos (2003, 2007) and Farahmand et al.) that we took care of them in our work. On this topic, we consider Definition 1 and Lemma 3 as contributions of our work, because we believe these results may facilitate deriving performance bounds in Lp-norm. - At a more technical level, the bound of Canbolat and Rothblum for approximation (Th2 there) are on the distance ||v*-vk|| while ours is on the loss v*-v(pik). If we derive a bound on the loss (using e.g. Th1 in their paper), this leads to a bound that is looser than ours. In particular, this does not allow to recover the standard bounds for AVI/API as we managed to. 3. We will address this (along with the minor comments) in the final version of the paper. 4. We are not sure we understand this question. It is standard and stronger to state the final performance w.r.t. the optimal value function. Stating the result w.r.t. the value function of the best policy in the policy space is like ignoring the approximation error, which shows the approximation power of the selected function space (policy space). line 630: The concentrability coefficient involves the Radon-Nikodym derivative that measures the point-wise mismatch between distributions. We will make this more clear in the final version of the paper. Rev. 3: - Regarding the role of m, this was indeed unclear in the version you reviewed. Sorry. We tried to make it clearer in the updated version of the paper (in particular in Remarks 4 and 5). In Remark 5, assuming a fixed budget at each iteration, we describe quantitatively the influence of m: the bigger m, the smaller the influence of the overall (approximation+estimation) value error, but the bigger the influence of the estimation error of the classifier. - Regarding the presentation: see our response to Rev. 1.