, Google protocol buffer format

D. Andresen, W. Hsu, H. Yang, and A. Okanlawon, Machine learning for predictive analytics of compute cluster jobs, 2018.

J. Ansel, K. Arya, and G. Cooperman, DMTCP: Transparent checkpointing for cluster computations and the desktop, 2009 IEEE International Symposium on Parallel & Distributed Processing (IPDPS'09), pp.1-12, 2009.

G. Aupy, A. Gainaru, V. Honoré, P. Raghavan, Y. Robert et al., Reservation Strategies for Stochastic Jobs, IPDPS 2019 -33rd IEEE International Parallel and Distributed Processing Symposium, pp.166-175, 2019.
URL : https://hal.archives-ouvertes.fr/hal-01968419

L. Bautista-gomez, S. Tsuboi, D. Komatitsch, F. Cappello, N. Maruyama et al., Fti: High performance fault tolerance interface for hybrid systems, SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, pp.1-12, 2011.
URL : https://hal.archives-ouvertes.fr/hal-00721216

J. Breitbart, S. Pickartz, S. Lankes, J. Weidendorfer, and A. Monti, Dynamic co-scheduling driven by main memory bandwidth utilization, 2017 IEEE International Conference on Cluster Computing (CLUSTER), pp.400-409, 2017.

J. Bruno, P. Downey, and G. N. Frederickson, Sequencing tasks with exponential service times to minimize the expected flow time or makespan, Journal of the ACM, vol.28, issue.1, pp.100-113, 1981.

Y. Chen, Checkpoint and Restore of Micro-service in Docker Containers, 3rd International Conference on Mechatronics and Industrial Informatics, 2015.

J. T. Daly, A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Comp. Syst, vol.22, issue.3, pp.303-312, 2006.

J. Dean and S. Ghemawat, MapReduce: Simplified data processing on large clusters, Commun, vol.51, issue.1, pp.107-113, 2008.

A. Gainaru, G. Aupy, A. Benoit, F. Cappello, Y. Robert et al., Scheduling the i/o of hpc applications under congestion, 2015 IEEE International Parallel and Distributed Processing Symposium, pp.1013-1022, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01251938

A. Gainaru, B. Goglin, V. Honoré, G. Pallez, P. Raghavan et al., Reservation and Checkpointing Strategies for Stochastic Jobs, IPDPS 2020 -34th IEEE International Parallel and Distributed Processing Symposium, 2020.
URL : https://hal.archives-ouvertes.fr/hal-02448393

A. Gainaru, G. Pallez, H. Sun, and P. Raghavan, Speculative scheduling for stochastic HPC applications, ICPP, 2019.
URL : https://hal.archives-ouvertes.fr/hal-02158598

A. Gainaru, H. Sun, G. Aupy, Y. Huo, B. A. Landman et al., On-the-fly scheduling versus reservation-based scheduling for unpredictable workflows, Int. J. High Perf. Computing Applications, 2019.

R. Garg, A. Mohan, M. Sullivan, and G. Cooperman, Crum: Checkpoint-restart support for cuda's unified memory, 2018 IEEE International Conference on Cluster Computing (CLUSTER), pp.302-313, 2018.

E. Gaussier, J. Lelong, V. Reis, and D. Trystram, Online tuning of easy-backfilling using queue reordering policies, IEEE Transactions on Parallel and Distributed Systems, vol.29, issue.10, pp.2304-2316, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01963216

A. Goel and P. Indyk, Stochastic load balancing and related problems, FOCS, pp.579-586, 1999.

P. H. Hargrove and J. C. Duell, Berkeley lab checkpoint/restart (blcr) for linux clusters, Journal of Physics. Conference Series, vol.46, 2006.

J. Haxby, J. S. Guntupalli, A. Connolly, Y. Halchenko, B. Conroy et al., A common, high-dimensional model of the representational space in human ventral temporal cortex, Neuron, vol.72, pp.404-420, 2011.

B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph et al., Mesos: A platform for fine-grained resource sharing in the data center, 8th USENIX Conf. Networked Systems Design and Implementation, pp.295-308, 2011.

Y. Huo, A. Carass, S. M. Resnick, D. L. Pham, J. L. Prince et al., Combining multi-atlas segmentation with brain surface estimation, Medical Imaging 2016: Image Processing, vol.9784, p.97840, 2016.

Y. Huo, A. J. Plassard, A. Carass, S. M. Resnick, D. L. Pham et al., Landman. Consistent cortical reconstruction and multi-atlas brain segmentation, NeuroImage, vol.138, pp.197-210, 2016.

Y. Huo, Z. Xu, K. Aboud, P. Parvathaneni, S. Bao et al., Spatially localized atlas network tiles enables 3d whole brain segmentation from limited data, Medical Image Computing and Computer Assisted Intervention, pp.698-705, 2018.

Y. Huo, Z. Xu, Y. Xiong, K. Aboud, P. Parvathaneni et al., 3d whole brain segmentation using spatially localized atlas network tiles, NeuroImage, vol.194, pp.105-119, 2019.

T. Hérault and Y. Robert, Fault-Tolerance Techniques for High-Performance Computing, 2015.

M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, Dryad: Distributed data-parallel programs from sequential building blocks, 2nd ACM SIGOPS/EuroSys European Conf. Computer Systems, 2007.

J. Kleinberg, Y. Rabani, and E. Tardos, Allocating bandwidth for bursty connections, STOC, pp.664-673, 1997.

R. Kumar and S. Vadhiyar, Identifying quick starters: Towards an integrated framework for efficient predictions of queue waiting times of batch parallel jobs, Job Scheduling Strategies for Parallel Processing, pp.196-215, 2013.

P. J. Lamontagne, T. L. Benzinger, J. C. Morris, S. Keefe, R. Hornbeck et al., Oasis-3: Longitudinal neuroimaging, clinical, and cognitive dataset for normal aging and alzheimer disease. medRxiv, 2019.

B. Landman, Medical-image Analysis and Statistical Interpretation (MASI) Lab

S. Li, T. Ben-nun, S. D. Girolamo, D. Alistarh, and T. Hoefler, Taming unbalanced training workloads in deep learning with partial collective operations, Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp.45-61, 2020.

D. A. Lifka, The ANL/IBM SP Scheduling System, JSSPP, pp.295-303, 1995.

A. Matsunaga and J. A. Fortes, On the use of machine learning to predict the time and resources consumed by applications, 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, pp.495-504, 2010.

A. Merzky, M. Santcroos, M. Turilli, and S. Jha, Radical-pilot: Scalable execution of heterogeneous and dynamic workloads on supercomputers, 2015.

A. Mirkin, A. Kuznetsov, and K. Kolyshkin, Containers checkpointing and live migration, Ottawa Linux Symposium, 2008.

R. H. Möhring, A. S. Schulz, and M. Uetz, Approximation in stochastic scheduling: The power of LP-based priority policies, Journal of the ACM, vol.46, issue.6, pp.924-942, 1999.

A. W. Mu and D. G. Feitelson, Utilization, Predictability, Workloads, and User Runtime Estimates in Scheduling the IBM SP2 with Backfilling, IEEE Trans. Parallel Distrib. Syst, vol.12, issue.6, pp.529-543, 2001.

J. Mora, Stochastic scheduling. Encyclopedia of Optimization, pp.3818-3824, 2009.

T. Patki, D. K. Lowenthal, B. Rountree, M. Schulz, and B. R. De-supinski, Exploring hardware overprovisioning in power-constrained, high performance computing, Proceedings of the 27th international ACM conference on International conference on supercomputing, pp.173-182, 2013.

T. Patki, J. J. Thiagarajan, A. Ayala, and T. Z. Islam, Performance optimality or reproducibility: That is the question, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '19, 2019.

S. Pickartz, N. Eiling, S. Lankes, L. Razik, and A. Monti, Migrating linux containers using criu, High Performance Computing, pp.674-684, 2016.

B. Pourghassemi and A. Chandramowlishwaran, cudacr: An in-kernel application-level checkpoint/restart scheme for cuda-enabled gpus, 2017 IEEE International Conference on Cluster Computing (CLUSTER), pp.725-732, 2017.

M. Rodríguez, J. Moríñigo, and R. Mayo-garcía, When you have a hammer, everything looks like a nail -Checkpoint/restart in Slurm, 2017.

J. Skovira, W. Chan, H. Zhou, and D. A. Lifka, The EASY -LoadLeveler API Project, JSSPP, pp.41-47, 1996.

M. Tanash, B. Dunn, D. Andresen, W. Hsu, H. Yang et al., Improving hpc system performance by predicting job resources via supervised machine learning, Proceedings of the Practice and Experience in Advanced Research Computing on Rise of the Machines (Learning), PEARC '19, 2019.

C. T. Vaughan and S. D. Hammond, Evaluating production load balancing functions for adaptive mesh schemes using mini-applications, SNL-NM), 2017.

V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar et al., Apache hadoop yarn: Yet another resource negotiator, the 4th Annual Symposium on Cloud Computing, vol.5, p.16, 2013.

, Stochastic Models for Fault Tolerance, Restart, Rejuvenation, and Checkpointing, 2010.

L. T. Yang, X. Ma, and F. Mueller, Cross-platform performance prediction of parallel applications using partial execution, SC '05: Proceedings of the 2005 ACM/IEEE Conference on Supercomputing, pp.40-40, 2005.

J. W. Young, A first order approximation to the optimum checkpoint interval, Comm. ACM, vol.17, issue.9, pp.530-531, 1974.

S. Zrigui, R. De-camargo, D. Trystram, and A. Legrand, Improving the performance of batch schedulers using online job size classification, 2019.
URL : https://hal.archives-ouvertes.fr/hal-02334116

, Talence Cedex Publisher Inria Domaine de Voluceau -Rocquencourt BP 105 -78153 Le Chesnay Cedex inria.fr ISSN