. Typically,

, for each ? ? ?, we test several learning rates ? ? {10 ?2 , 10 ?3 , 10 ?4 }, and we select the learning rate ? ? which led to the best accuracy

M. Jose, M. Alvarez, and . Salzmann, Learning the number of neurons in deep networks, Advances in Neural Information Processing Systems, pp.2270-2278, 2016.

. Shun-ichi-amari, Natural gradient works efficiently in learning, Neural Comput, vol.10, pp.251-276, 1998.

J. Ba and R. Caruana, Do deep nets really need to be deep?, Advances in neural information processing systems, pp.2654-2662, 2014.

B. Baker, O. Gupta, N. Naik, and R. Raskar, Designing neural network architectures using reinforcement learning, 2016.

R. Atilim-gunes-baydin, D. M. Cornish, M. Rubio, F. Schmidt, and . Wood, Online learning rate adaptation with hypergradient descent, International Conference on Learning Representations, 2018.

Y. Bengio, Practical recommendations for gradient-based training of deep architectures, Neural networks: Tricks of the trade, pp.437-478

. Springer, , 2012.

Y. Bengio, N. L. Roux, P. Vincent, O. Delalleau, and P. Marcotte, Convex neural networks, Advances in Neural Information Processing Systems, vol.18, pp.123-130, 2006.

J. Bergstra, D. Yamins, and D. D. Cox, Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures, 2013.

A. Brock, T. Lim, J. Millar-ritchie, and N. Weston, Smash: One-shot model architecture search through hypernetworks, 6th International Conference on Learning Representations, 2018.

G. Brockman, V. Cheung, and L. Pettersson, , 2016.

W. Chen, J. Wilson, S. Tyree, K. Weinberger, and Y. Chen, Compressing neural networks with the hashing trick, International Conference on Machine Learning, pp.2285-2294, 2015.

G. Cybenko, Approximation by superpositions of a sigmoidal function, Mathematics of control, signals and systems, vol.2, issue.4, pp.303-314, 1989.

J. Deng, W. Dong, R. Socher, L. Li, K. Li et al., ImageNet: A Large-Scale Hierarchical Image Database, CVPR09, 2009.

M. Denil, B. Shakibi, L. Dinh, M. Ranzato, and N. De-freitas, Predicting parameters in deep learning, Advances in Neural Information Processing Systems, vol.26, pp.2148-2156, 2013.

M. Denkowski and G. Neubig, Stronger baselines for trustable results in neural machine translation, 2017.

X. Simon-s-du, B. Zhai, A. Poczos, and . Singh, Gradient descent provably optimizes over-parameterized neural networks, 2019.

J. Duchi, E. Hazan, and Y. Singer, Adaptive subgradient methods for online learning and stochastic optimization, JMLR, vol.12, pp.2121-2159, 2011.

A. Erraqabi and N. L. Roux, Combining adaptive algorithms and hypergradient method: a performance and robustness study, 2018.

J. Frankle and M. Carbin, The Lottery Ticket Hypothesis: Finding Small, Trainable Neural Networks, 2018.

J. Frankle, K. Gintare, D. M. Dziugaite, M. Roy, and . Carbin, The lottery ticket hypothesis at scale, 2019.

J. Friedman, T. Hastie, and R. Tibshirani, , 2010.

X. Glorot and Y. Bengio, Understanding the difficulty of training deep feedforward neural networks, Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp.249-256, 2010.

Y. Gong, L. Liu, M. Yang, and L. Bourdev, Compressing deep convolutional networks using vector quantization, 2014.

A. Graves, Practical variational inference for neural networks, Advances in Neural Information Processing Systems, pp.2348-2356, 2011.

I. Guyon, I. Chaabane, H. J. Escalante, S. Escalera, D. Jajetic et al., A brief review of the ChaLearn AutoML challenge: any-time any-dataset learning without human intervention, Workshop on Automatic Machine Learning, pp.21-30, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01381145

S. Han, H. Mao, and W. J. Dally, Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, 2015.

S. Han, J. Pool, J. Tran, and W. Dally, Learning both weights and connections for efficient neural network, Advances in neural information processing systems, pp.1135-1143, 2015.

S. Han, J. Pool, J. Tran, and W. J. Dally, Learning both Weights and Connections for Efficient Neural Networks, NIPS, 2015.

B. Hassibi, G. David, and . Stork, Second order derivatives for network pruning: Optimal brain surgeon, Advances in neural information processing systems, pp.164-171, 1993.

K. He, X. Zhang, S. Ren, and J. Sun, Delving deep into rectifiers: Surpassing human-level performance on imagenet classification, Proceedings of the IEEE international conference on computer vision, pp.1026-1034, 2015.

K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, ICCV, pp.770-778, 2016.

M. Herbster and M. K. Warmuth, Tracking the best expert, Machine learning, vol.32, issue.2, pp.151-178, 1998.

G. Hinton, O. Vinyals, and J. Dean, Distilling the knowledge in a neural network, 2015.

E. Geoffrey, D. Hinton, and . Van-camp, Keeping neural networks simple, International Conference on Artificial Neural Networks, pp.11-18, 1993.

S. Hochreiter and J. Schmidhuber, Long short-term memory, Neural computation, vol.9, issue.8, pp.1735-1780, 1997.

G. Andrew, M. Howard, B. Zhu, D. Chen, W. Kalenichenko et al., Mobilenets: Efficient convolutional neural networks for mobile vision applications, 2017.

G. Huang, Z. Liu, L. Van-der-maaten, and K. Weinberger, Densely connected convolutional networks, CVPR, vol.1, p.3, 2017.

Z. Huang and N. Wang, Data-driven sparse structure selection for deep neural networks, Proceedings of the European Conference on Computer Vision (ECCV), pp.304-320, 2018.

L. Hörmander, The Analysis of Linear Partial Differential Operators I, 1998.

A. Robert and . Jacobs, Increased rates of convergence through learning rate adaptation, Neural networks, vol.1, issue.4, pp.295-307, 1988.

A. Jacot, F. Gabriel, and C. Hongler, Neural tangent kernel: Convergence and generalization in neural networks, Advances in neural information processing systems, pp.8571-8580, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01824549

S. Jastrzebski, Z. Kenton, D. Arpit, N. Ballas, A. Fischer et al., Three factors influencing minima in sgd, 2017.

Z. Michael-i-jordan, T. S. Ghahramani, L. Jaakkola, and . Saul, An introduction to variational methods for graphical models, Machine learning, vol.37, issue.2, pp.183-233, 1999.

R. Jozefowicz, O. Vinyals, M. Schuster, N. Shazeer, and Y. Wu, Exploring the limits of language modeling, 2016.

R. Jozefowicz, W. Zaremba, and I. Sutskever, An empirical exploration of recurrent network architectures, International Conference on Machine Learning, pp.2342-2350, 2015.

S. Nitish, R. Keskar, and . Socher, Improving generalization performance by switching from Adam to SGD, 2017.

, Kianglu. pytorch-cifar, 2018.

P. Diederik, J. Kingma, and . Ba, Adam: A Method for Stochastic Optimization, International Conference on Learning Representations, 2015.

P. Durk, T. Kingma, M. Salimans, and . Welling, Variational dropout and the local reparameterization trick, Advances in Neural Information Processing Systems, pp.2575-2583, 2015.

W. Koolen and . Steven-de-rooij, , 2008.

A. Krizhevsky, Learning Multiple Layers of Features from Tiny Images, 2009.

A. Krizhevsky, One weird trick for parallelizing convolutional neural networks, 2014.

K. Kurita, Learning Rate Tuning in Deep Learning: A Practical Guide | Machine Learning Explained, 2018.

Y. Lecun, B. Boser, S. John, D. Denker, R. E. Henderson et al., Backpropagation applied to handwritten zip code recognition, Neural computation, vol.1, issue.4, pp.541-551, 1989.

Y. Lecun, L. Bottou, G. B. Orr, and K. Müller, Efficient backprop, Neural Networks: Tricks of the Trade, pp.9-50, 1998.

Y. Lecun, J. S. Denker, and S. A. Solla, Optimal brain damage, pp.598-605, 1990.

Y. Lecun, S. John, S. A. Denker, and . Solla, Optimal brain damage, Advances in neural information processing systems, pp.598-605, 1990.

J. Lee, L. Xiao, S. Samuel, Y. Schoenholz, J. Bahri et al., Wide neural networks of any depth evolve as linear models under gradient descent, 2019.

C. Li, C. Chen, D. E. Carlson, and L. Carin, Preconditioned stochastic gradient Langevin dynamics for deep neural networks, Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, pp.1788-1794, 2016.

H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, , 2016.

L. Li, K. Jamieson, G. Desalvo, A. Rostamizadeh, and A. Talwalkar, Hyperband: A novel bandit-based approach to hyperparameter optimization, JMLR, vol.18, issue.1, pp.6765-6816, 2017.

T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez et al., Continuous control with deep reinforcement learning, 2015.

C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua et al., Progressive neural architecture search, Proceedings of the European Conference on Computer Vision (ECCV), pp.19-34, 2018.

H. Liu, K. Simonyan, and Y. Yang, Darts: Differentiable architecture search, 2018.

Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan et al., Learning efficient convolutional networks through network slimming, 2017 IEEE International Conference on, pp.2755-2763, 2017.

Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell, Rethinking the value of network pruning, 2018.

C. Louizos, K. Ullrich, and M. Welling, Bayesian compression for deep learning, Advances in Neural Information Processing Systems, pp.3288-3298, 2017.

D. Mack, How to pick the best learning rate for your machine learning project, 2016.

J. C. David and . Mackay, Bayesian model comparison and backprop nets, Advances in neural information processing systems, pp.839-846, 1992.

J. C. David and . Mackay, A practical bayesian framework for backpropagation networks, Neural computation, vol.4, issue.3, pp.448-472, 1992.

J. C. David and . Mackay, Probable networks and plausible predictions-a review of practical bayesian methods for supervised neural networks. Network: computation in neural systems, vol.6, pp.469-505, 1995.

J. C. David, D. Mackay, and . Kay, Information theory, inference and learning algorithms, 2003.

D. Maclaurin, D. Duvenaud, and R. Adams, Gradient-based hyperparameter optimization through reversible learning, International Conference on Machine Learning, pp.2113-2122, 2015.

R. S. Ashique-rupam-mahmood, T. Sutton, P. M. Degris, and . Pilarski, Tuning-free step-size adaptation, Acoustics, Speech and Signal Processing, pp.2121-2124, 2012.

G. Marceau-caron and Y. Ollivier, Natural langevin dynamics for neural networks, International Conference on Geometric Science of Information, pp.451-459, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01655949

M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini, Building a large annotated corpus of english: The penn treebank, Comput. Linguist, vol.19, issue.2, pp.313-330, 1993.

P. , Y. Massé, and Y. Ollivier, Speed learning on the fly, 2015.

J. Alexander-g-de-g-matthews, M. Hron, R. E. Rowland, Z. Turner, and . Ghahramani, Gaussian process behaviour in wide deep neural networks, 2018.

V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness et al., Human-level control through deep reinforcement learning, Nature, vol.518, issue.7540, p.529, 2015.

D. Molchanov, A. Ashukha, and D. Vetrov, Variational dropout sparsifies deep neural networks, Proceedings of the 34th International Conference on Machine Learning, vol.70, pp.2498-2507, 2017.

K. Murray and D. Chiang, Auto-sizing neural networks: With applications to n-gram language models, 2015.

M. Radford and . Neal, Bayesian learning for neural networks, 1995.

A. Bruno, D. Olshausen, and . Field, Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision research, vol.37, pp.3311-3325, 1997.

A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang et al., Automatic differentiation in pytorch, 2017.

D. Andrei, A. Polyanin, and . Manzhirov, Handbook of integral equations, 1998.

E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu et al., Large-scale evolution of image classifiers, Proceedings of the 34th International Conference on Machine Learning, vol.70, pp.2902-2911, 2017.

H. Robbins and S. Monro, A stochastic approximation method, Annals of Mathematical Statistics, vol.22, pp.400-407, 1951.

P. Andrew-m-saxe, Z. Wei-koh, M. Chen, B. Bhand, A. Suresh et al., On random weights and unsupervised feature learning, ICML, pp.1089-1096, 2011.

S. Scardapane, D. Comminiello, A. Hussain, and A. Uncini, Group sparse regularization for deep neural networks, Neurocomputing, vol.241, pp.81-89, 2017.

T. Schaul, S. Zhang, and Y. Lecun, No more pesky learning rates, International Conference on Machine Learning, pp.343-351, 2013.

N. Nicol and . Schraudolph, Local gain adaptation in stochastic gradient descent, 1999.

A. See, M. Luong, and C. Manning, Compression of neural machine translation models via pruning, p.291, 2016.

K. Simonyan and A. Zisserman, Very deep convolutional networks for largescale image recognition, 2014.

K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, 2014.

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, Dropout: A simple way to prevent neural networks from overfitting, Journal of Machine Learning Research, vol.15, pp.1929-1958, 2014.

O. Kenneth, R. Stanley, and . Miikkulainen, Evolving neural networks through augmenting topologies, Evolutionary computation, vol.10, issue.2, pp.99-127, 2002.

M. Elias, G. Stein, and . Weiss, Introduction to Fourier analysis on Euclidean spaces, vol.32, 2016.

P. Surmenok, Estimating an Optimal Learning Rate For a Deep Neural Network, 2017.

C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed et al., Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions, ICCV, pp.1-9, 2015.

S. Theodoridis, Machine learning: a Bayesian and optimization perspective, 2015.

R. Tibshirani, Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society: Series B (Methodological), vol.58, issue.1, pp.267-288, 1996.

R. Tibshirani, Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society: Series B (Methodological), vol.58, issue.1, pp.267-288, 1996.

T. Tieleman and G. Hinton, Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, vol.4, pp.26-31, 2012.

T. Van-erven, P. Grünwald, and S. Rooij, Catching up faster by switching sooner: A predictive approach to adaptive estimation with an application to the AIC-BIC dilemma, Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol.74, issue.3, pp.361-417, 2012.

T. Van-erven, S. D. Rooij, and P. Grünwald, Catching up faster in Bayesian model selection and model averaging, vol.20, pp.417-424, 2008.

A. J. Paul, . Volf, M. J. Frans, and . Willems, Switching between two universal source coding algorithms, Data Compression Conference, 1998. DCC'98. Proceedings, pp.491-500, 1998.

L. Wasserman, Bayesian Model Selection and Model Averaging, Journal of Mathematical Psychology, vol.44, 2000.

J. Paul and . Werbos, Backpropagation through time: what it does and how to do it, Proceedings of the IEEE, vol.78, issue.10, pp.1550-1560, 1990.

C. Ashia, R. Wilson, M. Roelofs, N. Stern, B. Srebro et al., The marginal value of adaptive gradient methods in machine learning, NIPS, pp.4148-4158, 2017.

M. Yuan and Y. Lin, Model selection and estimation in regression with grouped variables, Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol.68, issue.1, pp.49-67, 2006.

C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, Understanding deep learning requires rethinking generalization, 2017.

C. Zhang, S. Bengio, and Y. Singer, Are all layers created equal?, 2019.

B. Zoph, V. Quoc, and . Le, Neural architecture search with reinforcement learning, 2016.