. .. Influence-balancing, 47 4.6.2 Character-level Penn Treebank language model, p.49

, Can Recurrent Neural Networks Warp Time? 52

, From time warping invariance to gating

T. .. , , p.58

.. .. Experiments,

. .. Framework, 68 6.1.2 Return, state value function and goal formulation

. .. Reinforcement, 72 6.2.1 Policy improvement Q-learning and SARSA

.. .. Contributions,

, Reproducing Recurrent World Models Facilitate Policy Evolution 77

. .. Methods, 79 7.1.1 Reproducibility of the original results, 80 7.1.4 Mixture Density Recurrent Neural Network (MDN-RNN) training 80 7.1.5 Controller training with CMAES

. .. Results,

.. .. Conclusion,

, 2 There is No Q-Function in Continuous Time, p.90

. .. Limit, Reinforcement Learning with a Continuous-Time

. .. Experiments and . Cooijmans, On the variance of unbiased online recurrent optimization, 2016.

K. Doya-;-doya, Reinforcement learning in continuous time and space, Neural computation, vol.12, issue.1, pp.219-245, 2000.

[. Duchi, Adaptive subgradient methods for online learning and stochastic optimization, 2010.

E. Hihi, S. Bengio-;-el-hihi, and Y. Bengio, Hierarchical recurrent neural networks for long-term dependencies, Proceedings of the 8th International Conference on Neural Information Processing Systems, NIPS'95, pp.493-499, 1995.

[. Espeholt, Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures, 2018.

[. Ganea, Advances in Neural Information Processing Systems, vol.31, pp.5345-5355, 2018.

F. A. Gers and J. Schmidhuber, Recurrent nets that time and count, Proceedings of the IEEE-INNS-ENNS International Joint Conference on, vol.3, pp.189-194, 2000.

. Gers, F. A. Schmidhuber-;-gers, and J. Schmidhuber, Long short-term memory learns context free and context sensitive languages, Artificial Neural Nets and Genetic Algorithms, pp.134-137, 2001.

[. Gers, Learning to forget: Continual prediction with lstm, 1999.

[. Gers, Learning to forget: Continual prediction with LSTM, Neural Comput, vol.12, issue.10, pp.2451-2471, 2000.

A. Graves, Generating sequences with recurrent neural networks, 2013.

[. Graves, Speech recognition with deep recurrent neural networks, Acoustics, speech and signal processing (icassp), 2013 ieee international conference on, pp.6645-6649, 2013.

[. Graves, Neural turing machines, 2014.

[. Graves, Hybrid computing using a neural network with dynamic external memory, Nature, vol.538, issue.7626, p.471, 2016.

[. Gruslys, Memory-efficient backpropagation through time, pp.4125-4133, 2016.

[. Gruslys, Memory-efficient backpropagation through time. CoRR, abs/1606.03401, Ha and Schmidhuber, 2016.

[. Hansen, CMA-ES/pycma on Github, 2019.

A. Hansen, N. Hansen, and A. Auger, Evolution strategies and CMA-ES (covariance matrix adaptation), Genetic and Evolutionary Computation Conference, GECCO '14, pp.513-534, 2014.

[. Helfrich, Orthogonal recurrent neural networks with scaled cayley transform, International Conference on Machine Learning, pp.2034-2042, 2016.

[. Henderson, Deep reinforcement learning that matters, 2017.

[. Hessel, Rainbow: Combining improvements in deep reinforcement learning. CoRR, abs/1710.02298, 1991.

S. Hochreiter and J. Schmidhuber, Long short-term memory, Neural Comput, vol.9, issue.8, pp.1735-1780, 1997.

[. Jaderberg, Decoupled neural interfaces using synthetic gradients, 2016.

H. Jaeger-;-jaeger, Tutorial on training recurrent neural networks, covering BPPT, RTRL, EKF and the "echo state network" approach, 2002.

[. Jaeger, Optimization and Applications of Echo State Networks with Leaky-Integrator Neurons, Neural Networks, vol.20, issue.3, pp.335-352, 2007.

[. Jing, Gated orthogonal recurrent units: On learning to forget, Neural computation, vol.31, issue.4, pp.765-783, 2019.

[. Jozefowicz, An empirical exploration of recurrent network architectures, Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pp.2342-2350, 2015.

B. Kingma, D. P. Kingma, and J. Ba, Adam: A method for stochastic optimization, 2014.

W. Kingma, D. P. Kingma, and M. Welling, Auto-encoding variational bayes, 2013.

[. Koutnik, , 2014.

[. Lample, Neural architectures for named entity recognition, 2016.

[. Le, Object recognition with gradient-based learning. Shape, contour and grouping in computer vision, pp.823-823, 1999.

[. Lillicrap, Continuous control with deep reinforcement learning. CoRR, 2015.

[. Lucas, Mixed batches and symmetric discriminators for GAN training, Effective approaches to attention-based neural machine translation, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01791126

[. Maass, Real-time computing without stable states: A new framework for neural computation based on perturbations, Neural Comput, vol.14, issue.11, pp.2531-2560, 2002.

M. Mahoney, Large text compression benchmark, 2011.

[. Marcus, Building a large annotated corpus of English: The Penn Treebank, Computational Linguistics, vol.19, issue.2, pp.313-330, 1993.

S. Martens, J. Martens, and I. Sutskever, Learning recurrent neural networks with hessian-free optimization, Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp.1033-1040, 2011.

P. Massé, Autour de l'Usage des gradients en apprentissage statistique, 2017.

[. Mhammedi, Efficient orthogonal parametrisation of recurrent neural networks using householder reflections, Proceedings of the 34th International Conference on Machine Learning, vol.70, pp.2401-2409, 2017.

[. Mikolov, Subword language modeling with neural networks, 2012.

[. Mnih, Asynchronous methods for deep reinforcement learning, International conference on machine learning, pp.1928-1937, 2016.

[. Mnih, , 2015.

, Human-level control through deep reinforcement learning, Nature, vol.518, issue.7540, p.529

. Movellan, A Monte Carlo EM approach for partially observable diffusion processes: Theory and applications to neural networks, Neural Comput, vol.14, issue.7, pp.1507-1544, 2002.

M. C. Mozer, Induction of multiscale temporal structure, Advances in neural information processing systems, pp.275-282, 1992.

[. Mujika, Approximating realtime recurrent learning with random kronecker factors, Advances in Neural Information Processing Systems, pp.6594-6603, 2018.

. Munos, R. Bourgine-;-munos, and P. Bourgine, Reinforcement learning for continuous stochastic control problems, Advances in neural information processing systems, pp.1029-1035, 1998.

[. Ollivier, Training recurrent networks online without backtracking, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01228954

, Learning dexterous in-hand manipulation, Openai five, 2018.

[. Pascanu, Understanding the exploding gradient problem, 2012.

B. A. Pearlmutter, Gradient calculations for dynamic recurrent neural networks: A survey, IEEE Transactions on Neural networks, vol.6, issue.5, pp.1212-1228, 1995.

. Salimans, Evolution strategies as a scalable alternative to reinforcement learning, 2017.

[. Saxe, Exact solutions to the nonlinear dynamics of learning in deep linear neural networks, 2013.

. Schulman, Trust region policy optimization, International Conference on Machine Learning, pp.1889-1897, 2015.

. Schulman, On the computational power of neural nets, Proximal policy optimization algorithms, vol.50, pp.132-150, 1995.

[. Silver, Mastering chess and shogi by self-play with a general reinforcement learning algorithm, 2017.

[. Simard, Tangent prop -a formalism for specifying selected invariances in an adaptive network, Proceedings. 2004 IEEE International Joint Conference on, vol.2, pp.843-848, 1991.

B. Sutton, R. S. Sutton, and A. G. Barto, Introduction to reinforcement learning, vol.135, 1998.

Y. Takamura, K. Takamura, and S. Yamane, Improving minimal gated unit for sequential data, 2019.

[. Tallec, Making deep q-learning methods robust to time discretization, 2019.

O. Tallec, C. Tallec, and Y. Ollivier, Unbiased online recurrent optimization, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01972587

O. Tallec, C. Tallec, and Y. Ollivier, Unbiasing truncated backpropagation through time, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01660627

O. Tallec, C. Tallec, and Y. Ollivier, Can recurrent neural networks warp time, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01812064

G. Tesauro, Temporal difference learning and td-gammon, Communications of the ACM, vol.38, issue.3, pp.58-68, 1995.

[. Thornton, Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude, COURS-ERA: Neural networks for machine learning, vol.4, pp.26-31, 2012.

. Tieleman, T. Hinton-;-tieleman, and G. Hinton, Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURS-ERA: Neural networks for machine learning, vol.4, pp.26-31, 2012.

G. E. Uhlenbeck and L. S. Ornstein, On the theory of the Brownian motion, Physical review, vol.36, issue.5, p.823, 1930.

[. Vaswani, Attention is all you need, Advances in neural information processing systems, pp.5998-6008, 2017.

[. Wang, Backpropagation through time: what does it do and how to do it, Proceedings of IEEE, vol.78, pp.1550-1560, 1990.

[. Weston, , 2014.

R. J. Williams-;-williams, Simple statistical gradient-following algorithms for connectionist reinforcement learning, Machine learning, vol.8, issue.3-4, pp.229-256, 1992.

Z. Williams, R. J. Williams, and D. Zipser, A learning algorithm for continually running fully recurrent neural networks, Neural Comput, vol.1, issue.2, pp.270-280, 1989.

[. Wisdom, Full-capacity unitary recurrent neural networks, Advances in Neural Information Processing Systems, vol.29, pp.4880-4888, 2016.

[. Zhang, Natural environment benchmarks for reinforcement learning, 2018.

[. Zilly, Recurrent highway networks, 2016.

, On certain environments, network inputs are normalized by applying a mean-std normalization, with mean and standard deviations computed on each individual input features

, ? D is a cyclic buffer of size 1000000

, ? nb steps is set to 10, and 256 environments are run in parallel to accelerate the training procedure, totalling 2560 environment interactions between learning steps

, ? The physical ? is set to 0.8. It is always scaled as ? ?t (even for unscaled DQN and DDPG)

, ? N , the batch size is set to 256

, RMSprop is used as an optimizer without momentum, and with ? = 1 ? ?t (or 1 ? ?t 0 for unscaled DDPG and DQN)

, ? Exploration is always performed as described in the main text. The OU process used as parameters ? = 7.5, ? = 1

, Unless otherwise stated, ? 1 :=? Q marked with a (C), discrete actions environments with a (D)

?. Ant, State normalization is used. Discretization range

?. Cheetah, State normalization is used. Discretization range