S. Amari, Natural Gradient Works Efficiently in Learning, Neural Computation, vol.37, issue.2, pp.251-276
DOI : 10.1103/PhysRevLett.76.2188

M. Andrychowicz, M. Denil, S. G. Colmenarejo, M. W. Hoffman, D. Pfau et al., Learning to learn by gradient descent by gradient descent, pp.3981-3989, 2016.

Y. Bengio, P. Simard, and P. Frasconi, Learning long-term dependencies with gradient descent is difficult, IEEE Transactions on Neural Networks, vol.5, issue.2, pp.157-166
DOI : 10.1109/72.279181

URL : http://www.research.microsoft.com/~patrice/PDF/long_term.pdf

L. Bottou and O. Bousquet, The tradeoffs of large scale learning, Advances in Neural Information Processing Systems, pp.161-168, 2008.

R. Collobert, K. Kavukcuoglu, and C. Farabet, Torch7: A matlab-like environment for machine learning, BigLearn, NIPS Workshop, 2011.

A. Defazio, F. R. Bach, and S. Lacoste-julien, SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives, pp.14-1646, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01016843

J. Duchi, E. Hazan, and Y. Singer, Adaptive subgradient methods for online learning and stochastic optimization, 2010.

D. Erhan, Y. Bengio, A. Courville, P. Manzagol, P. Vincent et al., Why does unsupervised pre-training help deep learning?, J. Mach. Learn. Res, vol.11, pp.625-660, 2010.

A. P. George and W. B. Powell, Adaptive stepsizes for recursive estimation with applications in approximate dynamic programming, Machine Learning, vol.6, issue.1, pp.167-198, 2006.
DOI : 10.1007/978-3-642-82336-7

X. Glorot and Y. Bengio, Understanding the difficulty of training deep feedforward neural networks, Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp.249-256, 2010.

I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, 2016.

N. Hansen and A. Ostermeier, Completely Derandomized Self-Adaptation in Evolution Strategies, Evolutionary Computation, vol.9, issue.2, pp.159-195, 2001.
DOI : 10.1016/0004-3702(95)00124-7

URL : http://www.mitpressjournals.org/userimages/ContentEditor/1164817256746/lib_rec_form.pdf

M. Hardt, B. Recht, and Y. Singer, Train faster, generalize better: Stability of stochastic gradient descent

D. Hinkley, Inference about the change-point from cumulative sum tests, Biometrika, vol.58, issue.3, pp.509-523, 1970.
DOI : 10.1093/biomet/58.3.509

S. Ioffe and C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, pp.6-11, 2015.

D. P. Kingma and J. Ba, Adam: A method for stochastic optimization. CoRR, abs/1412, 2014.

A. Krizhevsky, Learning multiple layers of features from tiny images, 2009.

Y. Le-cun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE, pp.2278-2324, 1998.

Y. Nesterov, A method of solving a convex programming problem with convergence rate O(1/sqr(k)), Soviet Mathematics Doklady, vol.27, pp.372-376, 1983.

E. Page, CONTINUOUS INSPECTION SCHEMES, Biometrika, vol.41, issue.1-2, pp.100-115, 1954.
DOI : 10.1093/biomet/41.1-2.100

R. Pascanu, T. Mikolov, and Y. Bengio, On the difficulty of training recurrent neural networks, Proceedings of the 30th International Conference on International Conference on Machine Learning, pp.1310-1318

B. Polyak, Some methods of speeding up the convergence of iteration methods, USSR Computational Mathematics and Mathematical Physics, vol.4, issue.5, pp.1-17, 1964.
DOI : 10.1016/0041-5553(64)90137-5

H. Robbins and S. Monro, A Stochastic Approximation Method, The Annals of Mathematical Statistics, vol.22, issue.3, pp.400-407, 1951.
DOI : 10.1214/aoms/1177729586

T. Schaul, S. Zhang, and Y. Le-cun, No More Pesky Learning Rates, Proceedings of the 30th International Conference on International Conference on Machine Learning, pp.343-351, 2013.

I. Sutskever, J. Martens, G. Dahl, and G. Hinton, On the importance of initialization and momentum in deep learning, Proceedings of the 30th International Conference on International Conference on Machine Learning, pp.1139-1147

T. Tieleman and G. Hinton, Lecture 6.5 -RMSProp, COURSERA: Neural Networks for Machine Learning, 2012.

D. Wierstra, T. Schaul, T. Glasmachers, Y. Sun, J. Peters et al., Natural Evolution Strategies, 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence), pp.949-980
DOI : 10.1109/CEC.2008.4631255

URL : http://arxiv.org/pdf/1106.4487

M. D. Zeiler, ADADELTA: an adaptive learning rate method

M. Zinkevich, Online convex programming and generalized infinitesimal gradient ascent, Proceedings of the Twentieth International Conference on International Conference on Machine Learning, ICML'03, pp.928-935, 2003.