M. Abadi, A. Agarwal, and P. Barham, TensorFlow: Large-scale machine learning on heterogeneous systems, 2015.

, The fair guiding principles for scientific data management and stewardship, Scientific Data, vol.3, 2016.

P. Balaprakash, R. Egele, M. Salim, S. M. Wild, V. Vishwanath et al., Scalable reinforcement-learning-based neural architecture search for cancer deep learning research, SC'19: The 2019 International Conference for High Performance Computing, Networking, Storage and Analysis, vol.37, p.33, 2019.

L. Bautista-gomez, S. Tsuboi, D. Komatitsch, F. Cappello, N. Maruyama et al., FTI: High performance fault tolerance interface for hybrid systems, SC '11: The 2011 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, vol.32, 2011.
URL : https://hal.archives-ouvertes.fr/hal-00721216

J. Bernard, Mercurial-revision control approximated, Linux J, issue.212, 2011.

S. Bhuiyan, M. Zheludkov, and T. Isachenko, High Performance In-memory Computing with Apache Ignite, 2017.

L. Cao, B. W. Settlemyer, and J. Bent, To share or not to share: Comparing burst buffer architectures, HPC '17: The 25th High Performance Computing Symposium, vol.4, pp.1-4, 2017.

S. Chacon and B. Straub, , 2014.

R. Chard, L. Ward, Z. Li, Y. Babuji, A. Woodard et al., Publishing and serving machine learning models with dlhub, PEARC '19: Practice and Experience in Advanced Research Computing on Rise of the Machines (Learning), 2019.

B. Collins-sussman, The subversion project: Buiding a better cvs, Linux J, issue.94, 2002.

J. Dean, G. S. Corrado, R. Monga, K. Chen, M. Devin et al., Large scale distributed deep networks, NIPS'12: The 25th International Conference on Neural Information Processing Systems, pp.1223-1231, 2012.

M. Lawson, C. Ulmer, S. Mukherjee, G. Templet, J. F. Lofstead et al., Empress: extensible metadata provider for extreme-scale scientific simulations, PDSW-DISCS@SC'17: The 2nd Joint International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems, pp.19-24, 2017.

J. Li, B. Nicolae, J. Wozniak, and G. Bosilca, Understanding scalability and finegrain parallelism of synchronous data parallel training, MLHPC'19: 5th Workshop on Machine Learning in HPC Environments (in conjunction with SC19), pp.1-8, 2019.
URL : https://hal.archives-ouvertes.fr/hal-02570148

G. Lockwood, D. Hazen, Q. Koziol, R. Canon, K. Antypas et al., Storage 2020: A vision for the future of hpc storage, 2017.

J. Lofstead, J. Baker, and A. Younge, Data pallets: Containerizing storage for reproducibility and traceability, ISC'19: 2019 International Conference on High Performance Computing, pp.36-45, 2019.

J. Lofstead, I. Jimenez, C. Maltzahn, Q. Koziol, J. Bent et al., Daos and friends: A proposal for an exascale storage system, SC '16: The 2016 International Conference for High Performance Computing, Networking, Storage and Analysis, 2016.

D. Merkel, Docker: Lightweight linux containers for consistent development and deployment, Linux J, issue.239, 2014.

A. Moody, G. Bronevetsky, K. Mohror, and B. R. Supinski, Design, modeling, and evaluation of a scalable multi-level checkpointing system, SC '10: The, 2010.

, ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. pp. 1:1-1:11, 2010.

D. Narayanan, A. Harlap, A. Phanishayee, V. Seshadri, N. R. Devanur et al., Pipedream: Generalized pipeline parallelism for dnn training, SOSP '19: The 27th ACM Symposium on Operating Systems Principles, pp.1-15, 2019.

B. Nicolae, Towards Scalable Checkpoint Restart: A Collective Inline Memory Contents Deduplication Proposal, IPDPS '13: The 27th IEEE International Parallel and Distributed Processing Symposium, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00781532

B. Nicolae, Leveraging naturally distributed data redundancy to reduce collective I/O replication overhead, IPDPS '15: 29th IEEE International Parallel and Distributed Processing Symposium, pp.1023-1032, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01115700

B. Nicolae, G. Antoniu, L. Bougé, D. Moise, and A. Carpen-amarie, BlobSeer: Nextgeneration data management for large scale infrastructures, J. Parallel Distrib. Comput, vol.71, pp.169-184, 2011.

B. Nicolae, J. Li, J. Wozniak, G. Bosilca, M. Dorier et al., Deepfreeze: Towards scalable asynchronous checkpointing of deep learning models, CGrid'20: 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, pp.172-181, 2020.
URL : https://hal.archives-ouvertes.fr/hal-02543977

B. Nicolae, A. Moody, E. Gonsiorowski, K. Mohror, and F. Cappello, VeloC: Towards high performance adaptive asynchronous checkpointing at large scale, IPDPS'19: The 2019 IEEE International Parallel and Distributed Processing Symposium, pp.911-920, 2019.
URL : https://hal.archives-ouvertes.fr/hal-02184203

B. Nicolae, J. M. Wozniak, M. Dorier, and F. Cappello, DeepClone: Lightweight State Replication of Deep Learning Models for Data Parallel Training, CLUSTER'20: The 2020 IEEE International Conference on Cluster Computing, 2020.
URL : https://hal.archives-ouvertes.fr/hal-02914545

E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu et al., Large-scale evolution of image classifiers, ICML'17: The 34th International Conference on Machine Learning, pp.2902-2911, 2017.

N. Saurabh, D. Kimovski, S. Ostermann, and R. Prodan, Vm image repository and distribution models for federated clouds: State of the art, possible directions and open issues, Euro-Par 2016: Parallel Processing Workshops, pp.260-271, 2016.

J. G. Shanahan and L. Dai, Large scale distributed data science using apache spark, KDD '15: The 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp.2323-2324, 2015.

H. Shu and H. Zhu, Sensitivity analysis of deep neural networks. In: AAAI'19: The 33rd AAAI Conference of Artificial Intelligence, pp.4943-4950, 2019.

S. Teerapittayanon, B. Mcdanel, and H. T. Kung, Branchynet: Fast inference via early exiting from deep neural networks. In: ICPR'16: The 23rd International Conference on Pattern Recognition, pp.2464-2469, 2016.

S. M. Tseng, B. Nicolae, G. Bosilca, E. Jeannot, and F. Cappello, Towards portable online prediction of network utilization using MPI-level monitoring, EuroPar'19 : 25th International European Conference on Parallel and Distributed Systems, pp.1-14, 2019.
URL : https://hal.archives-ouvertes.fr/hal-02184204

J. Wozniak, R. Jain, and P. Balaprakash, Candle/supervisor: A workflow framework for machine learning applied to cancer research, BMC Bioinformatics, vol.19, issue.491, 2018.

M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma et al., Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing, NSDI'12: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, pp.2-2, 2012.

M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, Spark: Cluster computing with working sets, HotCloud'10: The 2Nd USENIX Conference on Hot Topics in Cloud Computing, pp.10-10, 2010.

S. Zhang, W. Boehmer, and S. Whiteson, Deep residual reinforcement learning, AAMAS '20: The 19th International Conference on Autonomous Agents and Mul-tiAgent Systems, pp.1611-1619, 2020.