47 4.6.2 Character-level Penn Treebank language model, p.49 ,
, Can Recurrent Neural Networks Warp Time? 52
, From time warping invariance to gating
, , p.58
,
68 6.1.2 Return, state value function and goal formulation ,
72 6.2.1 Policy improvement Q-learning and SARSA ,
,
, Reproducing Recurrent World Models Facilitate Policy Evolution 77
79 7.1.1 Reproducibility of the original results, 80 7.1.4 Mixture Density Recurrent Neural Network (MDN-RNN) training 80 7.1.5 Controller training with CMAES ,
,
,
, 2 There is No Q-Function in Continuous Time, p.90
, Reinforcement Learning with a Continuous-Time
, On the variance of unbiased online recurrent optimization, 2016.
Reinforcement learning in continuous time and space, Neural computation, vol.12, issue.1, pp.219-245, 2000. ,
Adaptive subgradient methods for online learning and stochastic optimization, 2010. ,
Hierarchical recurrent neural networks for long-term dependencies, Proceedings of the 8th International Conference on Neural Information Processing Systems, NIPS'95, pp.493-499, 1995. ,
, Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures, 2018.
, Advances in Neural Information Processing Systems, vol.31, pp.5345-5355, 2018.
Recurrent nets that time and count, Proceedings of the IEEE-INNS-ENNS International Joint Conference on, vol.3, pp.189-194, 2000. ,
Long short-term memory learns context free and context sensitive languages, Artificial Neural Nets and Genetic Algorithms, pp.134-137, 2001. ,
Learning to forget: Continual prediction with lstm, 1999. ,
Learning to forget: Continual prediction with LSTM, Neural Comput, vol.12, issue.10, pp.2451-2471, 2000. ,
Generating sequences with recurrent neural networks, 2013. ,
Speech recognition with deep recurrent neural networks, Acoustics, speech and signal processing (icassp), 2013 ieee international conference on, pp.6645-6649, 2013. ,
, Neural turing machines, 2014.
Hybrid computing using a neural network with dynamic external memory, Nature, vol.538, issue.7626, p.471, 2016. ,
Memory-efficient backpropagation through time, pp.4125-4133, 2016. ,
Memory-efficient backpropagation through time. CoRR, abs/1606.03401, Ha and Schmidhuber, 2016. ,
CMA-ES/pycma on Github, 2019. ,
Evolution strategies and CMA-ES (covariance matrix adaptation), Genetic and Evolutionary Computation Conference, GECCO '14, pp.513-534, 2014. ,
Orthogonal recurrent neural networks with scaled cayley transform, International Conference on Machine Learning, pp.2034-2042, 2016. ,
Deep reinforcement learning that matters, 2017. ,
Rainbow: Combining improvements in deep reinforcement learning. CoRR, abs/1710.02298, 1991. ,
Long short-term memory, Neural Comput, vol.9, issue.8, pp.1735-1780, 1997. ,
Decoupled neural interfaces using synthetic gradients, 2016. ,
Tutorial on training recurrent neural networks, covering BPPT, RTRL, EKF and the "echo state network" approach, 2002. ,
Optimization and Applications of Echo State Networks with Leaky-Integrator Neurons, Neural Networks, vol.20, issue.3, pp.335-352, 2007. ,
Gated orthogonal recurrent units: On learning to forget, Neural computation, vol.31, issue.4, pp.765-783, 2019. ,
An empirical exploration of recurrent network architectures, Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pp.2342-2350, 2015. ,
Adam: A method for stochastic optimization, 2014. ,
Auto-encoding variational bayes, 2013. ,
, , 2014.
, Neural architectures for named entity recognition, 2016.
Object recognition with gradient-based learning. Shape, contour and grouping in computer vision, pp.823-823, 1999. ,
, Continuous control with deep reinforcement learning. CoRR, 2015.
Mixed batches and symmetric discriminators for GAN training, Effective approaches to attention-based neural machine translation, 2015. ,
URL : https://hal.archives-ouvertes.fr/hal-01791126
Real-time computing without stable states: A new framework for neural computation based on perturbations, Neural Comput, vol.14, issue.11, pp.2531-2560, 2002. ,
Large text compression benchmark, 2011. ,
Building a large annotated corpus of English: The Penn Treebank, Computational Linguistics, vol.19, issue.2, pp.313-330, 1993. ,
Learning recurrent neural networks with hessian-free optimization, Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp.1033-1040, 2011. ,
Autour de l'Usage des gradients en apprentissage statistique, 2017. ,
Efficient orthogonal parametrisation of recurrent neural networks using householder reflections, Proceedings of the 34th International Conference on Machine Learning, vol.70, pp.2401-2409, 2017. ,
, Subword language modeling with neural networks, 2012.
Asynchronous methods for deep reinforcement learning, International conference on machine learning, pp.1928-1937, 2016. ,
, , 2015.
, Human-level control through deep reinforcement learning, Nature, vol.518, issue.7540, p.529
A Monte Carlo EM approach for partially observable diffusion processes: Theory and applications to neural networks, Neural Comput, vol.14, issue.7, pp.1507-1544, 2002. ,
Induction of multiscale temporal structure, Advances in neural information processing systems, pp.275-282, 1992. ,
Approximating realtime recurrent learning with random kronecker factors, Advances in Neural Information Processing Systems, pp.6594-6603, 2018. ,
Reinforcement learning for continuous stochastic control problems, Advances in neural information processing systems, pp.1029-1035, 1998. ,
Training recurrent networks online without backtracking, 2015. ,
URL : https://hal.archives-ouvertes.fr/hal-01228954
, Learning dexterous in-hand manipulation, Openai five, 2018.
Understanding the exploding gradient problem, 2012. ,
Gradient calculations for dynamic recurrent neural networks: A survey, IEEE Transactions on Neural networks, vol.6, issue.5, pp.1212-1228, 1995. ,
, Evolution strategies as a scalable alternative to reinforcement learning, 2017.
Exact solutions to the nonlinear dynamics of learning in deep linear neural networks, 2013. ,
Trust region policy optimization, International Conference on Machine Learning, pp.1889-1897, 2015. ,
On the computational power of neural nets, Proximal policy optimization algorithms, vol.50, pp.132-150, 1995. ,
Mastering chess and shogi by self-play with a general reinforcement learning algorithm, 2017. ,
Tangent prop -a formalism for specifying selected invariances in an adaptive network, Proceedings. 2004 IEEE International Joint Conference on, vol.2, pp.843-848, 1991. ,
Introduction to reinforcement learning, vol.135, 1998. ,
Improving minimal gated unit for sequential data, 2019. ,
Making deep q-learning methods robust to time discretization, 2019. ,
Unbiased online recurrent optimization, 2017. ,
URL : https://hal.archives-ouvertes.fr/hal-01972587
Unbiasing truncated backpropagation through time, 2017. ,
URL : https://hal.archives-ouvertes.fr/hal-01660627
, Can recurrent neural networks warp time, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01812064
Temporal difference learning and td-gammon, Communications of the ACM, vol.38, issue.3, pp.58-68, 1995. ,
Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude, COURS-ERA: Neural networks for machine learning, vol.4, pp.26-31, 2012. ,
Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURS-ERA: Neural networks for machine learning, vol.4, pp.26-31, 2012. ,
On the theory of the Brownian motion, Physical review, vol.36, issue.5, p.823, 1930. ,
Attention is all you need, Advances in neural information processing systems, pp.5998-6008, 2017. ,
Backpropagation through time: what does it do and how to do it, Proceedings of IEEE, vol.78, pp.1550-1560, 1990. ,
, , 2014.
Simple statistical gradient-following algorithms for connectionist reinforcement learning, Machine learning, vol.8, issue.3-4, pp.229-256, 1992. ,
A learning algorithm for continually running fully recurrent neural networks, Neural Comput, vol.1, issue.2, pp.270-280, 1989. ,
Full-capacity unitary recurrent neural networks, Advances in Neural Information Processing Systems, vol.29, pp.4880-4888, 2016. ,
Natural environment benchmarks for reinforcement learning, 2018. ,
, Recurrent highway networks, 2016.
, On certain environments, network inputs are normalized by applying a mean-std normalization, with mean and standard deviations computed on each individual input features
, ? D is a cyclic buffer of size 1000000
, ? nb steps is set to 10, and 256 environments are run in parallel to accelerate the training procedure, totalling 2560 environment interactions between learning steps
, ? The physical ? is set to 0.8. It is always scaled as ? ?t (even for unscaled DQN and DDPG)
, ? N , the batch size is set to 256
, RMSprop is used as an optimizer without momentum, and with ? = 1 ? ?t (or 1 ? ?t 0 for unscaled DDPG and DQN)
, ? Exploration is always performed as described in the main text. The OU process used as parameters ? = 7.5, ? = 1
, Unless otherwise stated, ? 1 :=? Q marked with a (C), discrete actions environments with a (D)
State normalization is used. Discretization range ,
State normalization is used. Discretization range ,