C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed et al., Going deeper with convolutions, The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.

K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

G. Huang, Z. Liu, L. Van-der-maaten, and K. Weinberger, Densely connected convolutional networks, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.

P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski et al., Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour, 2017.

Y. You, Z. Zhang, J. Demmel, K. Keutzer, and C. Hsieh, Imagenet training in 24 minutes, 2017.

R. Hemenway, High bandwidth, low latency, burst-mode optical interconnect for high performance computing systems, Conference on Lasers and Electro-Optics, vol.1, p.4, 2004.

J. Liu, W. Yu, J. Wu, D. Buntinas, D. K. Panda et al., Microbenchmark performance comparison of high-speed cluster interconnects, IEEE Micro, vol.24, issue.1, pp.42-51, 2004.

S. Falkner, A. Klein, and F. Hutter, Bohb: Robust and efficient hyperparameter optimization at scale, 2018.

O. Bousquet, S. Gelly, K. Kurach, M. Schoenauer, M. Sebag et al., Toward optimal run racing: Application to deep learning calibration, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01634381

M. Zinkevich, M. Weimer, L. Li, and A. J. Smola, Parallelized stochastic gradient descent, Advances in neural information processing systems, pp.2595-2603, 2010.

T. Paine, H. Jin, J. Yang, Z. Lin, and T. Huang, Gpu asynchronous stochastic gradient descent to speed up neural network training, 2013.

W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang et al., Terngrad: Ternary gradients to reduce communication in distributed deep learning, Advances in Neural Information Processing Systems, vol.30, pp.1509-1519, 2017.

D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, Qsgd: Communication-efficient sgd via gradient quantization and encoding

S. Luxburg, H. Bengio, R. Wallach, S. Fergus, R. Vishwanathan et al., Advances in Neural Information Processing Systems, vol.30, pp.1709-1720, 2017.

X. Glorot and Y. Bengio, Understanding the difficulty of training deep feedforward neural networks, Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp.249-256, 2010.

N. Kukreja, A. Shilova, O. Beaumont, J. Huckelheim, N. Ferrier et al., Training on the edge: The why and the how, 1st Workshop on Parallel AI and Systems for the Edge, 2019.
URL : https://hal.archives-ouvertes.fr/hal-02069728

N. Jouppi, C. Young, N. Patil, and D. Patterson, Motivation for and evaluation of the first tensor processing unit, IEEE Micro, vol.38, issue.3, pp.10-19, 2018.

J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin et al., Large scale distributed deep networks, Advances in neural information processing systems, pp.1223-1231, 2012.

D. Das, S. Avancha, D. Mudigere, K. Vaidynathan, S. Sridharan et al., Distributed deep learning using synchronous stochastic gradient descent, 2016.

R. Sethi, Complete register allocation problems, SIAM journal on Computing, vol.4, issue.3, pp.226-248, 1975.

W. H. Joseph and . Liu, An application of generalized tree pebbling to sparse matrix factorization, SIAM Journal on Algebraic Discrete Methods, vol.8, issue.3, pp.375-395, 1987.

E. Kayaaslan, T. Lambert, L. Marchal, and B. Uçar, Scheduling seriesparallel task graphs to minimize peak memory, Theoretical Computer Science, vol.707, pp.1-23, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01397299

A. Griewank, Mathematical Programming: recent developments and applications, vol.6, pp.83-107, 1989.

G. Aupy, J. Herrmann, P. Hovland, and Y. Robert, Optimal multistage algorithm for adjoint computation, SIAM Journal on Scientific Computing, vol.38, issue.3, pp.232-255, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01354902

A. Griewank and A. Walther, Algorithm 799: Revolve: an implementation of checkpointing for the reverse or adjoint mode of computational differentiation, ACM Transactions on Mathematical Software (TOMS), vol.26, issue.1, pp.19-45, 2000.

A. Gruslys, R. Munos, I. Danihelka, M. Lanctot, and A. Graves, Memoryefficient backpropagation through time, Advances in Neural Information Processing Systems, pp.4125-4133, 2016.

T. Chen, B. Xu, C. Zhang, and C. Guestrin, Training deep nets with sublinear memory cost, 2016.

A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang et al., Automatic differentiation in pytorch, 2017.

, Periodic checkpointing in pytorch, 2018.

A. Adcroft, . Campin, C. Dutkiewicz, . Evangelinos, G. Ferreira et al., Mitgcm user manual, 2008.

P. Brubaker, Engineering Design Optimization using Calculus Level Methods, 2016.

N. Kukreja, J. Hückelheim, and G. Gorman, Backpropagation for long sequences: beyond memory constraints with constant overheads, 2018.

. Gc-pringle, S. Jones, . Goswami, D. Shk-narayanan, and . Goldberg, Providing the archer community with adjoint modelling tools for high-performance oceanographic and cryospheric computation, 2016.

M. Louboutin, M. Lange, F. Luporini, N. Kukreja, P. A. Witte et al., Devito: an embedded domain-specific language for finite differences and geophysical exploration, 2018.

A. Griewank, Achieving logarithmic growth of temporal and spatial complexity in reverse automatic differentiation, Optimization Methods and software, vol.1, issue.1, pp.35-54, 1992.

J. Grimm, L. Pottier, and N. Schmidt, Optimal time and minimum spacetime product for reversing a certain class of programs, Computational Differentiation: Techniques, Applications, and Tools, pp.95-106, 1996.
URL : https://hal.archives-ouvertes.fr/inria-00073896

P. Stumm and A. Walther, Multistage approaches for optimal offline checkpointing, SIAM Journal on Scientific Computing, vol.31, issue.3, pp.1946-1967, 2009.

G. Aupy and J. Herrmann, Periodicity in optimal hierarchical checkpointing schemes for adjoint computations, Optimization Methods and Software, vol.32, issue.3, pp.594-624, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01654632

M. Schanen, O. Marin, H. Zhang, and M. Anitescu, Asynchronous two-level checkpointing scheme for large-scale adjoints in the spectral-element solver nek5000, Procedia Computer Science, vol.80, pp.1147-1158, 2016.

G. Aupy and J. Herrmann, H-Revolve: A Framework for Adjoint Computation on Synchrone Hierarchical Platforms, 2019.
URL : https://hal.archives-ouvertes.fr/hal-02080706

K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.

S. Hochreiter and J. Schmidhuber, Long short-term memory, Neural Comput, vol.9, issue.8, pp.1735-1780, 1997.

. Jeffrey-l-elman, Finding structure in time, Cognitive science, vol.14, issue.2, pp.179-211, 1990.

J. Marin, A. Biswas, F. Ofli, N. Hynes, A. Salvador et al., Recipe1m: A dataset for learning cross-modal embeddings for cooking recipes and food images, 2018.

M. Mueller, A. Arzt, S. Balke, M. Dorfer, and G. Widmer, Cross-modal music retrieval and applications: An overview of key methodologies, IEEE Signal Processing Magazine, vol.36, issue.1, pp.52-62, 2019.

J. Bromley, I. Guyon, Y. Lecun, E. Säckinger, and R. Shah, Signature verification using a" siamese" time delay neural network, Advances in neural information processing systems, pp.737-744, 1994.

W. Du, M. Fang, and M. Shen, Siamese convolutional neural networks for authorship verification, Proceedings, 2017.

J. Masci, D. Migliore, J. Michael-m-bronstein, and . Schmidhuber, Descriptor learning for omnidirectional image matching, Registration and Recognition in Images and Videos, pp.49-62, 2014.

E. Hoffer and N. Ailon, Deep metric learning using triplet network, International Workshop on Similarity-Based Pattern Recognition, pp.84-92, 2015.