&. Abbeel, Y. Andrew, and . Ng, Apprenticeship learning via inverse reinforcement learning, Twenty-first international conference on Machine learning , ICML '04, 2004.
DOI : 10.1145/1015330.1015430

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.2.92

. Akrour, Preference-Based Policy Learning, numéro 6911 de LNCS, p.1227, 2011.
DOI : 10.1007/978-3-642-23780-5_11

URL : https://hal.archives-ouvertes.fr/inria-00625001

. Akrour, APRIL: Active Preference Learning-Based Reinforcement Learning, ECML/PKDD, pp.116131-75, 2012.
DOI : 10.1007/978-3-642-33486-3_8

URL : https://hal.archives-ouvertes.fr/hal-00722744

. Akrour, Programming by Feedback, Int. Conf. on Machine Learning (ICML), ACM Int. Conf. Proc. Series, p.2014
URL : https://hal.archives-ouvertes.fr/hal-00980839

. Antos, András Antos, Rémi Munos and Csaba Szepesvári. Fitted Qiteration in continuous action-space MDPs, 2007.

. Antos, András Antos, Csaba Szepesvári and Rémi Munos. Learning near-optimal policies with Bellman-residual minimization based tted policy Bibliography iteration and a single sample path, Machine Learning, p.89129, 2008.

&. Atkeson, . Schaal, G. Christopher, S. Atkeson, and . Schaal, Learning tasks from a single demonstration, Proceedings of International Conference on Robotics and Automation, p.17061712, 1997.
DOI : 10.1109/ROBOT.1997.614389

&. Atkeson, . Schaal, G. Christopher, S. Atkeson, and . Schaal, Robot Learning From Demonstration, Proceedings of the Fourteenth International Conference on Machine Learning, ICML '97, pp.1220-1245, 1997.

. Auer, Peter Auer, Nicolo Cesa-Bianchi and Paul Fischer. Finite-time Analysis of the Multiarmed Bandit Problem, Machine Learning, vol.47, p.235256, 2002.

]. A. Auger, Convergence Results for the (1,?)-SA-ES using the Theory of ?-irreducible Markov Chains, Theoretical Computer Science, vol.334, issue.64, p.3569, 2005.

&. Bain and . Sammut, A framework for behavioural cloning, Machine Intelligence, pp.103129-103154, 1995.

. Bakir, Machine learning with structured outputs, 2006.

&. Bellman, R. Bellman, and S. E. Dreyfus, Applied dynamic programming, 1962.
DOI : 10.1515/9781400874651

]. Bellman, Dynamic programming, 1957.

. Bergeron, Multiple instance ranking, Proceedings of the 25th international conference on Machine learning, ICML '08, pp.48-55, 2008.
DOI : 10.1145/1390156.1390163

&. Billard, ]. A. Grollman-2013, D. Billard, and . Grollman, Robot learning by demonstration, p.3824, 2013.

. Bou-ammar, Automatically Mapped Transfer between Reinforcement Learning Tasks via Three- Way Restricted Boltzmann Machines, Lecture Notes in Computer Science, vol.8189, issue.2, pp.449464-2013, 2013.

. Bradtke, Linear least-squares algorithms for temporal dierence learning, Machine Learning, p.2233, 1996.

. Brochu, Eric Brochu, Nando de Freitas and Abhijeet Ghosh Active Preference Learning with Discrete Choice Data, NIPS, p.43, 0118.

. Brochu, Active Preference Learning with Discrete Choice Data, Proc. NIPS, pp.409416-63, 2008.

. Brochu, A Bayesian Interactive Optimization Approach to Procedural Animation Design, Z. Popovic and M. A. Otaduy, editeurs, Symposium on Computer Animation, pages 103 112. Eurographics Association, 2010.

C. Thomas, G. L. Brown, and . Peterson, An Enquiry Into the Method of Paired Comparison-Reliability, Scaling, and Thurstones Law of Comparative Judgment, Gen Tech. Rep. United States Department of Agriculture, 2009.

E. Brunskill and L. Li, PAC-inspired Option Discovery in Lifelong Reinforcement Learning, Proc. ICML 2014 JMLR Proceedings, pages 19. JMLR.org, p.2014, 2014.

. Burges, Advances in neural information processing systems 26, 27th annual conference on neural information processing systems 2013. proceedings of a meeting held december 5-8, 2013, lake tahoe, nevada, united states, pp.2013-134

&. Cakmak, M. Thomaz, A. L. Cakmak, and . Thomaz, Designing robot learners that ask good questions, Proceedings of the seventh annual ACM/IEEE international conference on Human-Robot Interaction, HRI '12, pp.1724-1751
DOI : 10.1145/2157689.2157693

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.650.7240

&. Chu, W. Ghahramani, Z. Chu, and . Ghahramani, Extensions of gaussian processes for ranking: semi-supervised and active learning, NIPS Workshop on Learning to Rank, p.40, 2005.

&. Chu, W. Ghahramani, Z. Chu, and . Ghahramani, Preference learning with Gaussian processes, Proceedings of the 22nd international conference on Machine learning , ICML '05, pp.137144-137185, 2005.
DOI : 10.1145/1102351.1102369

. Coates, Learning for control from multiple demonstrations, Proceedings of the 25th international conference on Machine learning, ICML '08, pp.144151-144176, 2008.
DOI : 10.1145/1390156.1390175

&. Cortes and . Vapnik, Corinna Cortes and Vladimir Vapnik. Support-Vector Networks, Machine Learning, p.273297, 1995.

. Dekel, Ofer Dekel, Shai Shalev-Shwartz and Yoram Singer. The Forgetron: A Kernel-Based Perceptron on a Budget, SIAM J. Comput, vol.37, p.13421372, 2008.

. Delarboulas, Pierre Delarboulas, Marc Schoenauer and Michèle Sebag. Open-Ended Evolutionary Robotics: An Information Theoretic Approach, Lecture Notes in Computer Science, vol.6238, issue.1, pp.334343-2010, 2010.

&. Duda, ]. R. Hart, P. E. Duda, and . Hart, Pattern classication and scene analysis, pp.60-61, 1973.

. Erbas, Towards imitation-enhanced Reinforcement Learning in multi-agent systems, 2011 IEEE Symposium on Artificial Life (ALIFE), p.613, 2011.
DOI : 10.1109/ALIFE.2011.5954652

. Erbas, Embodied imitation-enhanced reinforcement learning in multi-agent systems, Adaptive Behavior, vol.3, issue.4
DOI : 10.1162/1064546053278955

. Ernst, Iteratively Extending Time Horizon Reinforcement Learning, Proceedings of the 14th European Conference on Machine Learning, p.96107, 2003.
DOI : 10.1007/978-3-540-39857-8_11

URL : http://orbi.ulg.ac.be/jspui/handle/2268/9361

. Ernst, Damien Ernst, Pierre Geurts and Louis Wehenkel. Tree-Based Batch Mode Reinforcement Learning, Journal of Machine Learning Research, vol.6, p.503556, 2005.

. Farahmand, Regularized Policy Iteration, p.441448, 2008.

. Fern, Approximate Policy Iteration with a Policy Language Bias: Solving Relational Markov Decision Processes, J. Artif. Int. Res, vol.25, issue.1, p.75118, 2006.

. Fonteneau, A Cautious Approach to Generalization in Reinforcement Learning, Proc, 2010.

. Freund, Selective Sampling Using the Query by Committee Algorithm, Machine Learning, p.133168, 1997.

&. Gelly, D. Silver-sylvain-gelly, and . Silver, Combining Online and Oine Knowledge in UCT, International Conference of Machine Learning, 2007.

. Geramifard, iLSTD: Eligibility Traces and Convergence Analysis, Advances in Neural Information Processing Systems 19, p.441448, 2007.

. Ghavamzadeh, Odalric Maillard and Rémi Munos. LSTD with Random Projections, Advances in Neural Information Processing Systems 23, p.721729, 2010.

. Grith, Policy Shaping: Integrating Human Feedback with Reinforcement Learning, Burges et al

. Groÿ, Autonomous Self-Assembly in Swarm-Bots, IEEE Transactions on Robotics, vol.22, issue.6, p.11151130, 2006.
DOI : 10.1109/TRO.2006.882919

&. Hansen, A. Hansen, and . Ostermeier, Completely Derandomized Self-Adaptation in Evolution Strategies, Evolutionary Computation, vol.9, issue.2, pp.159195-159215, 0110.
DOI : 10.1016/0004-3702(95)00124-7

. Herbrich, Ralf Herbrich, Thore Graepel and Colin Campbell. Bayes Point Machines, Journal of Machine Learning Research, vol.1, pp.245-279, 2001.

]. M. Herdy, Evolution strategies with subjective selection, 1996.
DOI : 10.1007/3-540-61723-X_966

. Hester, RTMBA: A Real-Time Model-Based Reinforcement Learning Architecture for robot control, 2012 IEEE International Conference on Robotics and Automation, pp.8590-2012
DOI : 10.1109/ICRA.2012.6225072

]. W. Hockley, Analysis of response time distributions in the study of cognitive processes., Journal of Experimental Psychology: Learning, Memory, and Cognition, vol.10, issue.4, p.598615, 1984.
DOI : 10.1037/0278-7393.10.4.598

]. H. Hoos, Programming by optimization, Communications of the ACM, vol.55, issue.2, pp.7080-2012, 2012.
DOI : 10.1145/2076450.2076469

. Hüllermeier, Label Ranking by Learning Pairwise Preferences, Artif. Intell, vol.172, issue.16-17, p.18971916, 2008.

. Jain, Learning Trajectory Preferences for Manipulators via Iterative Improvement, Burges et al. [Burges et al. 2013], pp.575583-53

T. Jaksch, R. Ortner, and P. Auer, Near-optimal Regret Bounds for Reinforcement Learning, J. Mach. Learn. Res, vol.11, p.15631600, 2010.

. Jones, Efcient Global Optimization of Expensive Black-Box Functions, J. of Global Optimization, vol.13, issue.44, pp.455492-455535, 1998.

M. Kaariainen, Lower Bounds for Reductions, Atomic Learning Workshop, 2006.

. Moore, Reinforcement Learning, Anesthesia & Analgesia, vol.112, issue.2, p.237285, 1996.
DOI : 10.1213/ANE.0b013e31820334a7

&. Kakade and . Tewari, Kakade and Ambuj Tewari On the Generalization Ability of Online Strongly Convex Programming Algorithms, p.801808, 2008.

]. K. Kaneko and I. Tsuda, Complex systems: Chaos and beyond, 2000.
DOI : 10.1007/978-3-642-56861-9

. Khamassi, ActorCritic models of reinforcement learning in the basal ganglia: from natural to articial rats, Adaptive Behavior, vol.13, issue.2, p.131148, 2005.

. Kim, Learning from Limited Demonstrations, Burges et al. [Burges et al. 2013], pp.28592867-28592894

&. Knox, ]. W. Stone, P. Knox, and . Stone, Interactively shaping agents via human reinforcement, Proceedings of the fifth international conference on Knowledge capture, K-CAP '09, pp.916-956, 2009.
DOI : 10.1145/1597735.1597738

&. Knox, . B. Stone-2010-]-w, P. Knox, and . Stone, Combining manual feedback with subsequent MDP reward signals for reinforcement learning, Wiebe van der Hoek, Gal A. Kaminka, Yves Lespérance, pp.512-2010

&. Knox, . B. Stone-2012-]-w, P. Knox, and . Stone, Reinforcement learning from simultaneous human and MDP reward, IFAAMAS, vol.40, pp.475482-2012
DOI : 10.1109/roman.2012.6343862

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.294.1705

. Knox, Training a Robot via Human Feedback: A Case Study, Int. Conf. on Social Robotics, pp.460470-93, 2013.
DOI : 10.1007/978-3-319-02675-6_46

&. Kocsis and . Szepesvári, Levente Kocsis and Csaba Szepesvári, 2006.

&. Koller, D. Parr, R. Koller, and . Parr, Policy Iteration for Factored MDPs, Proceedings of the Sixteenth Conference on Uncertainty in Articial Intelligence (UAI-00, p.326334, 2000.

. Kolter, Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion, NIPS, pp.56-61, 2007.

&. Lagoudakis, . Parr, R. Lagoudakis, and . Parr, Least-Squares Policy Iteration, Journal of Machine Learning Research (JMLR), vol.4, issue.101 106, pp.11071149-11071167, 2003.

&. Lagoudakis, . Parr, G. Michail, R. Lagoudakis, and . Parr, Reinforcement Learning as Classication: Leveraging Modern Classiers, Proceedings of the Twentieth International Conference on Machine Learning, pp.431-447, 2003.

. Lange, Autonomous reinforcement learning on raw visual input data in a real world application, The 2012 International Joint Conference on Neural Networks (IJCNN), pp.18-2012
DOI : 10.1109/IJCNN.2012.6252823

. Lecun, O-Road Obstacle Avoidance through End-to-End Learning, NIPS -Advances in Neural Information Processing Systems 18, 2006.

&. Lehman, ]. J. Stanley, K. O. Lehman, and . Stanley, Exploiting Open-Endedness to solve problems through the search for novelty, Proc. of the Eleventh International Conference on AI Life (AILife-08), pp.329336-68, 2008.

. Levine, Feature Construction for Inverse Reinforcement Learning, NIPS 23, p.13421350, 2010.

&. Lim, H. Auer-2012-]-shiau, P. Lim, and . Auer, Autonomous Exploration For Navigating In MDPs, Shie Mannor

. Lipson, Evolutionary Robotics for Legged Machines: From Simulation to Physical Reality, IAS, p.1118, 2006.

. Littman, Predictive Representations of State, Neural Information Processing Systems, p.15551561, 2002.

&. Liu, F. T. Wineld-alan, and . Wineld, Modeling and Optimization of Adaptive Foraging in Swarm Robotic Systems, The International Journal of Robotics Research, vol.29, issue.14, p.17431760, 2010.
DOI : 10.1177/0278364910375139

. Liu, Locomotion control of quadruped robots based on CPG-inspired workspace trajectory generation, 2011.

]. D. Lizotte, Practical Bayesian Optimization, 2008.

. Lörincz, Mind Model Seems Necessary for the Emergence of Communication, Neural Information Processing -Letters and Reviews, vol.11, issue.4-6, p.109121, 2007.

]. R. Luce-1959 and . Luce, Individual choice behavior, 1959.

. Lund, Aude Billard and Auke Ijspeert. Evolutionary Robotics A Children's Game, Proceedings of IEEE 5th International Conference on Evolutionary Computation, pp.154158-154197, 1998.

. Maei, Toward O-Policy Learning Control with Function Approximation, Omnipress, p.719726, 2010.

. Maes, Structured prediction with reinforcement learning, Machine Learning, p.271301, 2009.
DOI : 10.1007/s10994-009-5140-8

URL : https://hal.archives-ouvertes.fr/hal-01172474

M. Fard, Bellman Error Based Feature Generation using Random Projections on Sparse Spaces, Advances in Neural Information Processing Systems 26, pp.30303038-2013

. Mnih, Ioannis Antonoglou, Daan Wierstra and Martin Riedmiller. Playing Atari with Deep Reinforcement Learning, p.17, 1312.

]. R. Munos, Error Bounds for Approximate Policy Iteration, ICML, p.560567, 2003.

R. Ng, S. Ng, and . Russell, Algorithms for Inverse Reinforcement Learning, Proc. 17th ICML, pp.663670-663695, 2000.

. Ng, Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping, Ivan Bratko and Saso Dzeroski, p.278287, 1999.

[. Dowd, The distributed co-evolution of an embodied simulator and controller for swarm robot behaviours, Proc. IROS, p.49955000, 2011.

&. Regan, ]. J. Noë, A. O-'regan, and . Noë, A sensorimotor account of vision and visual consciousness, Behavioral and Brain Sciences, vol.24, issue.80, pp.939973-60, 2001.

P. Oudeyer, A. Baranes, and F. Kaplan, Intrinsically Motivated Exploration for Developmental and Active Sensorimotor Learning, From Motor Learning to Interaction Learning in Robots, numéro 264 de Studies in Computational Intelligence, p.107
DOI : 10.1007/978-3-642-05181-4_6

&. Peters, S. Peters, and . Schaal, Reinforcement learning of motor skills with policy gradients, Neural Networks, vol.21, issue.4, pp.682697-58, 2008.
DOI : 10.1016/j.neunet.2008.02.003

. Pineau, TREATING EPILEPSY VIA ADAPTIVE NEUROSTIMULATION: A REINFORCEMENT LEARNING APPROACH, International Journal of Neural Systems, vol.19, issue.04, p.227240, 2009.
DOI : 10.1142/S0129065709001987

&. Poggio, ]. T. Girosi, F. Poggio, and . Girosi, Networks for approximation and learning, Proceedings of the IEEE, vol.78, issue.9, p.14811497, 1990.
DOI : 10.1109/5.58326

]. Pomerleau, ALVINN: An Autonomous Land Vehicle In a Neural Network, Advances in Neural Information Processing Systems 1, pp.25-26, 1989.

L. Martin and . Puterman, Markov decision processes: Discrete stochastic dynamic programming, 1994.

&. Randløv and . Alstrøm, Jette Randløv and Preben Alstrøm Learning to drive a bicycle using reinforcement learn-ing and shaping, Proc. 15th Intl Conf. on Machine Learning, p.463471, 1998.

. Ranzato, Ecient Learning of Sparse Representations with an Energy-Based Model, NIPS, p.11371144, 2006.

. Ratli, Maximum margin planning, ICML, p.729736, 2006.

R. Nathan, Learning to Search: Structured Prediction Techniques for Imitation Learning, 2009.

&. Reeves, B. Nass, C. Reeves, and . Nass, How people treat computers, television, and new media like real people and places, 1996.

. Saxena, Robotic Grasping of Novel Objects using Vision, The International Journal of Robotics Research, vol.13, issue.3, 2008.
DOI : 10.1177/0278364907087172

. Secretan, Picbreeder: A Case Study in Collaborative Evolutionary Exploration of Design Space, Evolutionary Computation, vol.9, issue.4, p.373403, 2011.
DOI : 10.1109/MCG.1996.481558

. Seo, Gaussian Process Regression: Active Data Selection and Test Point Rejection, IJCNN (3), p.241246, 2000.
DOI : 10.1007/978-3-642-59802-9_4

]. R. Shepard, Stimulus and response generalization: A stochastic model relating generalization to distance in psychological space, Psychometrika, vol.3, issue.4, p.325345, 1957.
DOI : 10.1007/BF02288967

&. Siegel, S. Siegel, and N. Castellan, Nonparametric statistics for the behavioral sciences. McGrawHill, Inc., second édition, pp.43-53, 1988.

. Singh, Convergence Results for Single-Step On-Policy Reinforcement- Learning Algorithms, MACHINE LEARNING, p.287308, 1998.

]. B. Skinner, How to Teach Animals, Scientic American, vol.185, p.2629, 1951.

. Snoek, Practical Bayesian Optimization of Machine Learning Algorithms, NIPS, pp.29602968-2012

]. E. Sondik, The Optimal Control of Partially Observable Markov Decision Processes, 1971.

. Stanley, A Hypercube-Based Encoding for Evolving Large-Scale Neural Networks, Artificial Life, vol.21, issue.2, 2009.
DOI : 10.1109/5.784219

. Stirling, Energy-ecient indoor search by swarms of simulated ying robots without global information, Swarm Intelligence, vol.4, issue.2, p.117143, 2010.

. Strehl, PAC model-free reinforcement learning, Proceedings of the 23rd international conference on Machine learning , ICML '06, pp.881888-56, 2006.
DOI : 10.1145/1143844.1143955

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.120.326

&. Stulp and . Sigaud, Freek Stulp and Olivier Sigaud Robot Skill Learning: From Reinforcement Learning to Evolution Strategies, Paladyn. Journal of Behavioral Robotics, vol.4, issue.1, p.4961, 2013.

. Suga, Daisuke Nagao, Shigeki Sugano and Tetsuya Ogata. Interactive evolution of human-robot communication in real Bibliography world, Proc. IEEE/RSJ IROS'05, p.1438, 2005.

&. Sutton, . Barto-1998a-]-r, A. Sutton, and . Barto, Reinforcement Learning: An Introduction, IEEE Transactions on Neural Networks, vol.9, issue.5, 1998.
DOI : 10.1109/TNN.1998.712192

&. Sutton, . Barto, S. Richard, A. G. Sutton, and . Barto, Reinforcement Learning: An Introduction, IEEE Transactions on Neural Networks, vol.9, issue.5, 1998.
DOI : 10.1109/TNN.1998.712192

. Sutton, Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning, Artificial Intelligence, vol.112, issue.1-2, p.181211, 1999.
DOI : 10.1016/S0004-3702(99)00052-1

. Sutton, Fast gradient-descent methods for temporal-dierence learning with linear function approximation, Andrea Pohoreckyj Danyluk, Léon Bottou and Michael L, 2009.

&. Syed, R. E. Syed, and . Schapire, A Game-Theoretic Approach to Apprenticeship Learning, NIPS, 2007.

]. Takagi, Interactive evolutionary computation: fusion of the capabilities of EC optimization and human evaluation, Proceedings of the IEEE, vol.89, issue.9, p.12751296, 2001.
DOI : 10.1109/5.949485

L. Andrea, C. Thomaz, and . Breazeal, Reinforcement Learning with Human Teachers: Evidence of Feedback and Guidance with Implications for Learning Performance, Proceedings of the 21st National Conference on Articial Intelligence, pp.1000-1005, 2006.

]. E. Thorndike, Animal Intelligence, Science, vol.8, issue.198, 1911.
DOI : 10.1126/science.8.198.520

. Trianni, Cooperative hole avoidance in a swarm-bot, Robotics and Autonomous Systems, vol.54, issue.2, p.97103, 2006.
DOI : 10.1016/j.robot.2005.09.018

. Tsochantaridis, Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann and Yasemin Altun. Large Margin Methods for Structured and Interdependent Output Variables, Journal of Machine Learning Research, vol.6, p.14531484, 2005.

. Vamplew, Constructing Stochastic Mixture Policies for Episodic Multiobjective Reinforcement Learning Tasks, Australasian Conference on Articial Intelligence, pp.340-349, 2009.
DOI : 10.1007/978-3-642-10439-8_35

&. Viappiani, P. Boutilier, C. Viappiani, and . Boutilier, Optimal Bayesian Recommendation Sets and Myopically Optimal Choice Query Sets, Bibliography In NIPS, vol.41, issue.80, pp.23522360-77, 2010.

]. E. Wasserstrom, Numerical Solutions by the Continuation Method, SIAM Review, vol.15, issue.1, pp.89119-93, 1973.
DOI : 10.1137/1015003

S. Whiteson and P. Stone, Evolutionary Function Approximation for Reinforcement Learning, Journal of Machine Learning Research, vol.7, issue.877917, 2006.

. Whiteson, Critical Factors in the Empirical Performance of Temporal Dierence and Evolutionary Methods for Reinforcement Learning, Journal of Autonomous Agents and Multi-Agent Systems, vol.21, issue.94, pp.127-87, 2010.

. Whiteson, Critical factors in the empirical performance of temporal dierence and evolutionary methods for reinforcement learning, Autonomous Agents and Multi- Agent Systems, vol.21, issue.20, pp.135-154, 2010.

&. Wiering, . Van-otterlo-2012-]-m, M. Wiering, and . Van-otterlo, Reinforcement learning: State-of-the-art. Adaptation, Learning, and Optimization, p.2012
DOI : 10.1007/978-3-642-27645-3

. Wilson, A Bayesian Approach for Policy Learning from Trajectory Preference Queries, NIPS, pp.11421150-111, 2012.

[. Y. Lecun, LeCun et al. Handwritten Digit Recognition with a Back- Propagation Network, p.396404, 1989.

&. Yue, T. Yue, and . Joachims, Interactively optimizing information retrieval systems as a dueling bandits problem, Proceedings of the 26th Annual International Conference on Machine Learning, ICML '09, p.151, 2009.
DOI : 10.1145/1553374.1553527

M. Kosorok and D. Zeng, Reinforcement learning design for cancer clinical trials, pp.48-83, 2009.