C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed et al., Going deeper with convolutions, The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.

K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

G. Huang, Z. Liu, L. Van-der-maaten, and K. Q. Weinberger, Densely connected convolutional networks, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.

P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski et al., Accurate, large minibatch sgd: Training imagenet in 1 hour, 2017.

Y. You, Z. Zhang, C. Hsieh, J. Demmel, and K. Keutzer, Imagenet training in minutes, 2017.

S. Bulò, L. Porzi, and P. Kontschieder, In-place activated batchnorm for memory-optimized training of dnns, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.5639-5647, 2018.

G. Pleiss, D. Chen, G. Huang, T. Li, L. Van-der-maaten et al., Memory-efficient implementation of densenets, 2017.

M. Rhu, N. Gimelshein, J. Clemons, A. Zulfiqar, and S. W. Keckler, vdnn: Virtualized deep neural networks for scalable, memory-efficient neural network design, The 49th Annual IEEE/ACM International Symposium on Microarchitecture, p.18, 2016.

A. Garg and P. Kulkarni, Dynamic memory management for gpu-based training of deep neural networks, IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2019.

M. Zinkevich, M. Weimer, L. Li, and A. J. Smola, Parallelized stochastic gradient descent, Advances in neural information processing systems, pp.2595-2603, 2010.

T. Paine, H. Jin, J. Yang, Z. Lin, and T. Huang, Gpu asynchronous stochastic gradient descent to speed up neural network training, 2013.

J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin et al., Large scale distributed deep networks, Advances in neural information processing systems, pp.1223-1231, 2012.

A. Griewank, Mathematical Programming: recent developments and applications, vol.6, pp.83-107, 1989.

A. Adcroft, J. Campin, S. Dutkiewicz, C. Evangelinos, D. Ferreira et al., Mitgcm user manual, 2008.

P. Brubaker, Engineering Design Optimization using Calculus Level Methods, 2016.

A. Griewank and A. Walther, Algorithm 799: Revolve: an implementation of checkpointing for the reverse or adjoint mode of computational differentiation, ACM Transactions on Mathematical Software (TOMS), vol.26, issue.1, pp.19-45, 2000.

A. Gruslys, R. Munos, I. Danihelka, M. Lanctot, and A. Graves, Memory-efficient backpropagation through time, Advances in Neural Information Processing Systems, pp.4125-4133, 2016.

T. Chen, B. Xu, C. Zhang, and C. Guestrin, Training deep nets with sublinear memory cost, 2016.

N. Kukreja, J. Hückelheim, and G. J. Gorman, Backpropagation for long sequences: beyond memory constraints with constant overheads, 2018.

A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang et al., Automatic differentiation in pytorch, 2017.

, Periodic checkpointing in pytorch, 2018.