D. Andresen, W. Hsu, H. Yang, and A. Okanlawon, Machine learning for predictive analytics of compute cluster jobs, 2018.

J. Ansel, K. Arya, and G. Cooperman, DMTCP: Transparent checkpointing for cluster computations and the desktop, 2009 IEEE International Symposium on Parallel & Distributed Processing, pp.1-12, 2009.

G. Aupy, A. Gainaru, V. Honore, P. Raghavan, Y. Robert et al., Reservation Strategies for Stochastic Jobs, 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp.166-175, 2019.
URL : https://hal.archives-ouvertes.fr/hal-01968419

L. Bautista-gomez, S. Tsuboi, D. Komatitsch, F. Cappello, N. Maruyama et al., FTI, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11, pp.1-12, 2011.
URL : https://hal.archives-ouvertes.fr/hal-00721216

J. Breitbart, S. Pickartz, S. Lankes, J. Weidendorfer, and A. Monti, Dynamic Co-Scheduling Driven by Main Memory Bandwidth Utilization, 2017 IEEE International Conference on Cluster Computing (CLUSTER), pp.400-409, 2017.

J. Bruno, P. Downey, and G. N. Frederickson, Sequencing Tasks with Exponential Service Times to Minimize the Expected Flow Time or Makespan, Journal of the ACM, vol.28, issue.1, pp.100-113, 1981.

Y. Chen, Checkpoint and Restore of Micro-service in Docker Containers, Proceedings of the 3rd International Conference on Mechatronics and Industrial Informatics, 2015.

J. T. Daly, A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, vol.22, issue.3, pp.303-312, 2006.

J. Dean and S. Ghemawat, MapReduce, Communications of the ACM, vol.51, issue.1, pp.107-113, 2008.

A. Gainaru, G. Aupy, A. Benoit, F. Cappello, Y. Robert et al., Scheduling the I/O of HPC Applications Under Congestion, 2015 IEEE International Parallel and Distributed Processing Symposium, pp.1013-1022, 2015.
URL : https://hal.archives-ouvertes.fr/hal-00983789

A. Gainaru, B. Goglin, V. Honore, G. Pallez-aupy, P. Raghavan et al., Reservation and Checkpointing Strategies for Stochastic Jobs, 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2020.
URL : https://hal.archives-ouvertes.fr/hal-02448393

A. Gainaru, G. P. Aupy, H. Sun, and P. Raghavan, Speculative Scheduling for Stochastic HPC Applications, Proceedings of the 48th International Conference on Parallel Processing, 2019.
URL : https://hal.archives-ouvertes.fr/hal-02158598

A. Gainaru, H. Sun, G. Aupy, Y. Huo, B. A. Landman et al., On-the-fly scheduling versus reservation-based scheduling for unpredictable workflows, The International Journal of High Performance Computing Applications, vol.33, issue.6, pp.1140-1158, 2019.

R. Garg, A. Mohan, M. Sullivan, and G. Cooperman, CRUM: Checkpoint-Restart Support for CUDA's Unified Memory, 2018 IEEE International Conference on Cluster Computing (CLUSTER), pp.302-313, 2018.

E. Gaussier, J. Lelong, V. Reis, and D. Trystram, Online Tuning of EASY-Backfilling using Queue Reordering Policies, IEEE Transactions on Parallel and Distributed Systems, vol.29, issue.10, pp.2304-2316, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01963216

A. Goel and P. Indyk, Stochastic load balancing and related problems, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039), pp.579-586

P. H. Hargrove and J. C. Duell, Berkeley lab checkpoint/restart (BLCR) for Linux clusters, Journal of Physics: Conference Series, vol.46, pp.494-499, 2006.

J. V. Haxby, J. . Guntupalli, A. C. Connolly, Y. O. Halchenko, B. R. Conroy et al., A Common, High-Dimensional Model of the Representational Space in Human Ventral Temporal Cortex, Neuron, vol.72, issue.2, pp.404-416, 2011.

B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph et al., Mesos: A platform for finegrained resource sharing in the data center, 8th USENIX Conf. Networked Systems Design and Implementation, pp.295-308, 2011.

Y. Huo, A. Carass, S. M. Resnick, D. L. Pham, J. L. Prince et al., Combining multi-atlas segmentation with brain surface estimation, Medical Imaging 2016: Image Processing, vol.9784, p.97840, 2016.

Y. Huo, A. J. Plassard, A. Carass, S. M. Resnick, D. L. Pham et al., Consistent cortical reconstruction and multi-atlas brain segmentation, NeuroImage, vol.138, pp.197-210, 2016.

Y. Huo, Z. Xu, K. Aboud, P. Parvathaneni, S. Bao et al., Spatially Localized Atlas Network Tiles Enables 3D Whole Brain Segmentation from Limited Data, Medical Image Computing and Computer Assisted Intervention ? MICCAI 2018, pp.698-705, 2018.

Y. Huo, Z. Xu, Y. Xiong, K. Aboud, P. Parvathaneni et al., 3D whole brain segmentation using spatially localized atlas network tiles, NeuroImage, vol.194, pp.105-119, 2019.

T. Hérault and Y. Robert, Fault-Tolerance Techniques for High-Performance Computing, 2015.

M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, Dryad, Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007 - EuroSys '07, 2007.

J. Kleinberg, Y. Rabani, and É. Tardos, Allocating bandwidth for bursty connections, Proceedings of the twenty-ninth annual ACM symposium on Theory of computing - STOC '97, pp.664-673, 1997.

R. Kumar and S. Vadhiyar, Identifying Quick Starters: Towards an Integrated Framework for Efficient Predictions of Queue Waiting Times of Batch Parallel Jobs, Job Scheduling Strategies for Parallel Processing, pp.196-215, 2013.

P. J. Lamontagne, T. L. Benzinger, J. C. Morris, S. Keefe, R. Hornbeck et al., OASIS-3: Longitudinal Neuroimaging, Clinical, and Cognitive Dataset for Normal Aging and Alzheimer Disease, 2019.

B. Landman, Medical-image Analysis and Statistical Interpretation (MASI) Lab

S. Li, T. Ben-nun, S. D. Girolamo, D. Alistarh, and T. Hoefler, Taming unbalanced training workloads in deep learning with partial collective operations, Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp.45-61, 2020.

D. A. Lifka, The ANL/IBM SP scheduling system, Job Scheduling Strategies for Parallel Processing, pp.295-303, 1995.

A. Matsunaga and J. A. Fortes, On the Use of Machine Learning to Predict the Time and Resources Consumed by Applications, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, pp.495-504, 2010.

A. Merzky, M. Turilli, M. Maldonado, M. Santcroos, and S. Jha, Using Pilot Systems to Execute Many Task Workloads on Supercomputers, Job Scheduling Strategies for Parallel Processing, pp.61-82, 2019.

A. Mirkin, A. Kuznetsov, and K. Kolyshkin, Containers checkpointing and live migration, Ottawa Linux Symposium, 2008.

R. H. Möhring, A. S. Schulz, and M. Uetz, Approximation in stochastic scheduling, Journal of the ACM, vol.46, issue.6, pp.924-942, 1999.

A. W. Mu and D. G. Feitelson, Utilization, Predictability, Workloads, and User Runtime Estimates in Scheduling the IBM SP2 with Backfilling, IEEE Trans. Parallel Distrib. Syst, vol.12, issue.6, pp.529-543, 2001.

J. Mora, Stochastic scheduling. Encyclopedia of Optimization, pp.3818-3824, 2009.

T. Patki, D. K. Lowenthal, B. Rountree, M. Schulz, and B. R. De-supinski, Exploring hardware overprovisioning in power-constrained, high performance computing, Proceedings of the 27th international ACM conference on International conference on supercomputing - ICS '13, pp.173-182, 2013.

T. Patki, J. J. Thiagarajan, A. Ayala, and T. Z. Islam, Performance optimality or reproducibility, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2019.

S. Pickartz, N. Eiling, S. Lankes, L. Razik, and A. Monti, Migrating LinuX Containers Using CRIU, Lecture Notes in Computer Science, pp.674-684, 2016.

B. Pourghassemi and A. Chandramowlishwaran, cudaCR: An In-Kernel Application-Level Checkpoint/Restart Scheme for CUDA-Enabled GPUs, 2017 IEEE International Conference on Cluster Computing (CLUSTER), pp.725-732, 2017.

M. Rodríguez, J. Moríñigo, and R. Mayo-garcía, When you have a hammer, everything looks like a nail -Checkpoint/restart in Slurm, 2017.

J. Skovira, W. Chan, H. Zhou, and D. A. Lifka, The EASY ? LoadLeveler API project, Job Scheduling Strategies for Parallel Processing, pp.41-47, 1996.

M. Tanash, B. Dunn, D. Andresen, W. Hsu, H. Yang et al., Improving HPC System Performance by Predicting Job Resources via Supervised Machine Learning, Proceedings of the Practice and Experience in Advanced Research Computing on Rise of the Machines (learning), 2019.

, Environmental assessment for operations, upgrades, and modifications in SNL/NM Technical Area IV, SNL-NM), 1996.

V. K. Vavilapalli, S. Seth, B. Saha, C. Curino, O. O'malley et al., Apache Hadoop YARN, Proceedings of the 4th annual Symposium on Cloud Computing - SOCC '13, vol.5, p.16, 2013.

K. Wolter, Checkpointing Systems, Stochastic Models for Fault Tolerance, pp.171-176, 2010.

L. T. Yang, . Xiaosong-ma, and F. Mueller, Cross-Platform Performance Prediction of Parallel Applications Using Partial Execution, ACM/IEEE SC 2005 Conference (SC'05), pp.40-40

J. W. Young, A first order approximation to the optimum checkpoint interval, Communications of the ACM, vol.17, issue.9, pp.530-531, 1974.

S. Zrigui, R. De-camargo, D. Trystram, and A. Legrand, Improving the performance of batch schedulers using online job size classification, 2019.
URL : https://hal.archives-ouvertes.fr/hal-02334116