&. Benchmarks,

A. Moody, G. Bronevetsky, K. Mohror, and B. R. Supinski, Design, modeling, and evaluation of a scalable multi-level checkpointing system, SC '10: The 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, vol.1, p.11, 2010.

L. Bautista-gomez, S. Tsuboi, D. Komatitsch, F. Cappello, N. Maruyama et al., FTI: High performance fault tolerance interface for hybrid systems, 2011.
URL : https://hal.archives-ouvertes.fr/hal-01298430

, ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, vol.32, p.32, 2011.

I. S. Reed and G. Solomon, Polynomial codes over certain finite fields, Journal of the Society for Industrial and Applied Mathematics, vol.8, issue.2, pp.300-304, 1960.

B. Nicolae, A. Moody, E. Gonsiorowski, K. Mohror, and F. Cappello, VeloC: Towards high performance adaptive asynchronous checkpointing at large scale, IPDPS'19: The 2019 IEEE International Parallel and Distributed Processing Symposium, pp.911-920, 2019.
URL : https://hal.archives-ouvertes.fr/hal-02184203

S. Tseng, B. Nicolae, G. Bosilca, E. Jeannot, and F. Cappello, Towards portable online prediction of network utilization using MPI-level monitoring, EuroPar'19 : 25th International European Conference on Parallel and Distributed Systems, pp.1-14, 2019.
URL : https://hal.archives-ouvertes.fr/hal-02184204

M. Dorier, G. Antoniu, F. Cappello, M. Snir, and L. Orf, Damaris: How to efficiently leverage multicore parallelism to achieve scalable, jitterfree I/O, CLUSTER '12 -Proceedings of the 2012 IEEE International Conference on Cluster Computing, pp.155-163, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00715252

T. Ilsche, J. Schuchart, J. Cope, D. Kimpe, T. Jones et al., Optimizing I/O forwarding techniques for extreme-scale event tracing, Cluster Computing, vol.17, issue.1, pp.1-18, 2014.

B. Nicolae and F. Cappello, AI-Ckpt: Leveraging memory access patterns for adaptive asynchronous incremental checkpointing, HPDC '13: 22th International ACM Symposium on High-Performance Parallel and Distributed Computing, pp.155-166, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00809847

Y. Zhu, W. Yu, B. Jiao, K. Mohror, A. Moody et al., Efficient user-level storage disaggregation for deep learning, 2019 IEEE International Conference on Cluster Computing (CLUSTER)

, IEEE, pp.1-12, 2019.

S. Pumma, M. Si, W. Feng, and P. Balaji, Parallel I/O optimizations for scalable deep learning, 2017 IEEE 23rd International Conference on Parallel and Distributed Systems (ICPADS), pp.720-729, 2017.

Z. Zhang, L. Huang, U. Manor, L. Fang, G. Merlo et al., FanStore: Enabling efficient and scalable I/O for distributed deep learning, 2018.

Y. Zhu, F. Chowdhury, H. Fu, A. Moody, K. Mohror et al., Entropy-aware I/O pipelining for large-scale deep learning on hpc systems, 2018 IEEE 26th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), pp.145-156, 2018.

J. Wozniak, R. Jain, and P. Balaprakash, CANDLE/Supervisor: A workflow framework for machine learning applied to cancer research, BMC Bioinformatics, issue.19, 2018.

M. Jaderberg, V. Dalibard, S. Osindero, W. M. Czarnecki, J. Donahue et al., Population based training of neural networks, 2017.

P. Balaprakash, M. Salim, T. Uram, V. Vishwanath, and S. Wild, Deep-Hyper: Asynchronous hyperparameter search for deep neural networks, 2018 IEEE 25th International Conference on High Performance Computing (HiPC), pp.42-51, 2018.

G. K. Kaul and D. Golovin, Hyperparameter tuning in cloud machine learning engine using bayesian optimization, 2017.

J. M. Wozniak, P. E. Davis, T. Shu, J. Ozik, N. Collier et al., Scaling deep learning for cancer with advanced workflow storage integration, Proceedings of Machine Learning in High Performance Computing Environments (MLHPC), 2018.

C. Docan, M. Parashar, and S. Klasky, Dataspaces: an interaction and coordination framework for coupled simulation workflows, Cluster Computing, vol.15, issue.2, pp.163-181, 2012.

S. Jin, S. Di, X. Liang, J. Tian, D. Tao et al., DeepSZ: A novel framework to compress deep neural networks by using errorbounded lossy compression, 2019.

I. Hubara, M. Courbariaux, D. Soudry, R. El-yaniv, and Y. Bengio, Quantized neural networks: Training neural networks with low precision weights and activations, The Journal of Machine Learning Research, vol.18, issue.1, pp.6869-6898, 2017.

Y. Gong, L. Liu, M. Yang, and L. Bourdev, Compressing deep convolutional networks using vector quantization, 2014.

B. Nicolae, Towards scalable checkpoint restart: A collective inline memory contents deduplication proposal, IPDPS '13: The 27th IEEE International Parallel and Distributed Processing Symposium, pp.1-10, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00781532

K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, CVPR'16: 2016 IEEE Conference on Computer Vision and Pattern Recognition, pp.770-778, 2016.

M. Abadi, A. Agarwal, and P. Barham, TensorFlow: Large-scale machine learning on heterogeneous systems, 2015, software available from tensorflow.org

Y. Jia, E. Shelhamer, and J. Donahue, Caffe: Convolutional architecture for fast feature embedding, ICM'14: The 22Nd ACM International Conference on Multimedia, pp.675-678, 2014.

, Torch: A scientific computing framework for luajit

A. Sergeev and M. D. Balso, Meet Horovod: Uber's open source distributed deep learning framework for tensorflow

F. Chollet, Horovod repository, 2015.

J. Deng, W. Dong, and R. Socher, ImageNet: A large-scale hierarchical image database, CVPR'09: Conference on Computer Vision and Pattern Recognition, pp.248-255, 2009.

J. Li, B. Nicolae, J. Wozniak, and G. Bosilca, Understanding scalability and fine-grain parallelism of synchronous data parallel training, 5th Workshop on Machine Learning in HPC Environments, 2019.