Natural Gradient Works Efficiently in Learning, Neural Computation, vol.37, issue.2, pp.251-276 ,
DOI : 10.1103/PhysRevLett.76.2188
Learning to learn by gradient descent by gradient descent, pp.3981-3989, 2016. ,
Learning long-term dependencies with gradient descent is difficult, IEEE Transactions on Neural Networks, vol.5, issue.2, pp.157-166 ,
DOI : 10.1109/72.279181
URL : http://www.research.microsoft.com/~patrice/PDF/long_term.pdf
The tradeoffs of large scale learning, Advances in Neural Information Processing Systems, pp.161-168, 2008. ,
Torch7: A matlab-like environment for machine learning, BigLearn, NIPS Workshop, 2011. ,
SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives, pp.14-1646, 2014. ,
URL : https://hal.archives-ouvertes.fr/hal-01016843
Adaptive subgradient methods for online learning and stochastic optimization, 2010. ,
Why does unsupervised pre-training help deep learning?, J. Mach. Learn. Res, vol.11, pp.625-660, 2010. ,
Adaptive stepsizes for recursive estimation with applications in approximate dynamic programming, Machine Learning, vol.6, issue.1, pp.167-198, 2006. ,
DOI : 10.1007/978-3-642-82336-7
Understanding the difficulty of training deep feedforward neural networks, Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp.249-256, 2010. ,
Deep Learning, 2016. ,
Completely Derandomized Self-Adaptation in Evolution Strategies, Evolutionary Computation, vol.9, issue.2, pp.159-195, 2001. ,
DOI : 10.1016/0004-3702(95)00124-7
URL : http://www.mitpressjournals.org/userimages/ContentEditor/1164817256746/lib_rec_form.pdf
Train faster, generalize better: Stability of stochastic gradient descent ,
Inference about the change-point from cumulative sum tests, Biometrika, vol.58, issue.3, pp.509-523, 1970. ,
DOI : 10.1093/biomet/58.3.509
Batch normalization: Accelerating deep network training by reducing internal covariate shift, Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, pp.6-11, 2015. ,
Adam: A method for stochastic optimization. CoRR, abs/1412, 2014. ,
Learning multiple layers of features from tiny images, 2009. ,
Gradient-based learning applied to document recognition, Proceedings of the IEEE, pp.2278-2324, 1998. ,
A method of solving a convex programming problem with convergence rate O(1/sqr(k)), Soviet Mathematics Doklady, vol.27, pp.372-376, 1983. ,
CONTINUOUS INSPECTION SCHEMES, Biometrika, vol.41, issue.1-2, pp.100-115, 1954. ,
DOI : 10.1093/biomet/41.1-2.100
On the difficulty of training recurrent neural networks, Proceedings of the 30th International Conference on International Conference on Machine Learning, pp.1310-1318 ,
Some methods of speeding up the convergence of iteration methods, USSR Computational Mathematics and Mathematical Physics, vol.4, issue.5, pp.1-17, 1964. ,
DOI : 10.1016/0041-5553(64)90137-5
A Stochastic Approximation Method, The Annals of Mathematical Statistics, vol.22, issue.3, pp.400-407, 1951. ,
DOI : 10.1214/aoms/1177729586
No More Pesky Learning Rates, Proceedings of the 30th International Conference on International Conference on Machine Learning, pp.343-351, 2013. ,
On the importance of initialization and momentum in deep learning, Proceedings of the 30th International Conference on International Conference on Machine Learning, pp.1139-1147 ,
Lecture 6.5 -RMSProp, COURSERA: Neural Networks for Machine Learning, 2012. ,
Natural Evolution Strategies, 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence), pp.949-980 ,
DOI : 10.1109/CEC.2008.4631255
URL : http://arxiv.org/pdf/1106.4487
ADADELTA: an adaptive learning rate method ,
Online convex programming and generalized infinitesimal gradient ascent, Proceedings of the Twentieth International Conference on International Conference on Machine Learning, ICML'03, pp.928-935, 2003. ,