TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. ,
, The fair guiding principles for scientific data management and stewardship, Scientific Data, vol.3, 2016.
Scalable reinforcement-learning-based neural architecture search for cancer deep learning research, SC'19: The 2019 International Conference for High Performance Computing, Networking, Storage and Analysis, vol.37, p.33, 2019. ,
FTI: High performance fault tolerance interface for hybrid systems, SC '11: The 2011 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, vol.32, 2011. ,
URL : https://hal.archives-ouvertes.fr/hal-00721216
Mercurial-revision control approximated, Linux J, issue.212, 2011. ,
High Performance In-memory Computing with Apache Ignite, 2017. ,
To share or not to share: Comparing burst buffer architectures, HPC '17: The 25th High Performance Computing Symposium, vol.4, pp.1-4, 2017. ,
, , 2014.
Publishing and serving machine learning models with dlhub, PEARC '19: Practice and Experience in Advanced Research Computing on Rise of the Machines (Learning), 2019. ,
The subversion project: Buiding a better cvs, Linux J, issue.94, 2002. ,
Large scale distributed deep networks, NIPS'12: The 25th International Conference on Neural Information Processing Systems, pp.1223-1231, 2012. ,
Empress: extensible metadata provider for extreme-scale scientific simulations, PDSW-DISCS@SC'17: The 2nd Joint International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems, pp.19-24, 2017. ,
Understanding scalability and finegrain parallelism of synchronous data parallel training, MLHPC'19: 5th Workshop on Machine Learning in HPC Environments (in conjunction with SC19), pp.1-8, 2019. ,
URL : https://hal.archives-ouvertes.fr/hal-02570148
Storage 2020: A vision for the future of hpc storage, 2017. ,
Data pallets: Containerizing storage for reproducibility and traceability, ISC'19: 2019 International Conference on High Performance Computing, pp.36-45, 2019. ,
Daos and friends: A proposal for an exascale storage system, SC '16: The 2016 International Conference for High Performance Computing, Networking, Storage and Analysis, 2016. ,
Docker: Lightweight linux containers for consistent development and deployment, Linux J, issue.239, 2014. ,
Design, modeling, and evaluation of a scalable multi-level checkpointing system, SC '10: The, 2010. ,
, ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. pp. 1:1-1:11, 2010.
Pipedream: Generalized pipeline parallelism for dnn training, SOSP '19: The 27th ACM Symposium on Operating Systems Principles, pp.1-15, 2019. ,
Towards Scalable Checkpoint Restart: A Collective Inline Memory Contents Deduplication Proposal, IPDPS '13: The 27th IEEE International Parallel and Distributed Processing Symposium, 2013. ,
URL : https://hal.archives-ouvertes.fr/hal-00781532
Leveraging naturally distributed data redundancy to reduce collective I/O replication overhead, IPDPS '15: 29th IEEE International Parallel and Distributed Processing Symposium, pp.1023-1032, 2015. ,
URL : https://hal.archives-ouvertes.fr/hal-01115700
BlobSeer: Nextgeneration data management for large scale infrastructures, J. Parallel Distrib. Comput, vol.71, pp.169-184, 2011. ,
Deepfreeze: Towards scalable asynchronous checkpointing of deep learning models, CGrid'20: 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, pp.172-181, 2020. ,
URL : https://hal.archives-ouvertes.fr/hal-02543977
VeloC: Towards high performance adaptive asynchronous checkpointing at large scale, IPDPS'19: The 2019 IEEE International Parallel and Distributed Processing Symposium, pp.911-920, 2019. ,
URL : https://hal.archives-ouvertes.fr/hal-02184203
DeepClone: Lightweight State Replication of Deep Learning Models for Data Parallel Training, CLUSTER'20: The 2020 IEEE International Conference on Cluster Computing, 2020. ,
URL : https://hal.archives-ouvertes.fr/hal-02914545
Large-scale evolution of image classifiers, ICML'17: The 34th International Conference on Machine Learning, pp.2902-2911, 2017. ,
Vm image repository and distribution models for federated clouds: State of the art, possible directions and open issues, Euro-Par 2016: Parallel Processing Workshops, pp.260-271, 2016. ,
Large scale distributed data science using apache spark, KDD '15: The 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp.2323-2324, 2015. ,
, Sensitivity analysis of deep neural networks. In: AAAI'19: The 33rd AAAI Conference of Artificial Intelligence, pp.4943-4950, 2019.
, Branchynet: Fast inference via early exiting from deep neural networks. In: ICPR'16: The 23rd International Conference on Pattern Recognition, pp.2464-2469, 2016.
Towards portable online prediction of network utilization using MPI-level monitoring, EuroPar'19 : 25th International European Conference on Parallel and Distributed Systems, pp.1-14, 2019. ,
URL : https://hal.archives-ouvertes.fr/hal-02184204
Candle/supervisor: A workflow framework for machine learning applied to cancer research, BMC Bioinformatics, vol.19, issue.491, 2018. ,
Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing, NSDI'12: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, pp.2-2, 2012. ,
Spark: Cluster computing with working sets, HotCloud'10: The 2Nd USENIX Conference on Hot Topics in Cloud Computing, pp.10-10, 2010. ,
Deep residual reinforcement learning, AAMAS '20: The 19th International Conference on Autonomous Agents and Mul-tiAgent Systems, pp.1611-1619, 2020. ,