Machine learning for predictive analytics of compute cluster jobs, 2018. ,
DMTCP: Transparent checkpointing for cluster computations and the desktop, 2009 IEEE International Symposium on Parallel & Distributed Processing, pp.1-12, 2009. ,
Reservation Strategies for Stochastic Jobs, 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp.166-175, 2019. ,
URL : https://hal.archives-ouvertes.fr/hal-01968419
FTI, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11, pp.1-12, 2011. ,
URL : https://hal.archives-ouvertes.fr/hal-00721216
Dynamic Co-Scheduling Driven by Main Memory Bandwidth Utilization, 2017 IEEE International Conference on Cluster Computing (CLUSTER), pp.400-409, 2017. ,
Sequencing Tasks with Exponential Service Times to Minimize the Expected Flow Time or Makespan, Journal of the ACM, vol.28, issue.1, pp.100-113, 1981. ,
Checkpoint and Restore of Micro-service in Docker Containers, Proceedings of the 3rd International Conference on Mechatronics and Industrial Informatics, 2015. ,
A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, vol.22, issue.3, pp.303-312, 2006. ,
MapReduce, Communications of the ACM, vol.51, issue.1, pp.107-113, 2008. ,
Scheduling the I/O of HPC Applications Under Congestion, 2015 IEEE International Parallel and Distributed Processing Symposium, pp.1013-1022, 2015. ,
URL : https://hal.archives-ouvertes.fr/hal-00983789
Reservation and Checkpointing Strategies for Stochastic Jobs, 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2020. ,
URL : https://hal.archives-ouvertes.fr/hal-02448393
Speculative Scheduling for Stochastic HPC Applications, Proceedings of the 48th International Conference on Parallel Processing, 2019. ,
URL : https://hal.archives-ouvertes.fr/hal-02158598
On-the-fly scheduling versus reservation-based scheduling for unpredictable workflows, The International Journal of High Performance Computing Applications, vol.33, issue.6, pp.1140-1158, 2019. ,
CRUM: Checkpoint-Restart Support for CUDA's Unified Memory, 2018 IEEE International Conference on Cluster Computing (CLUSTER), pp.302-313, 2018. ,
Online Tuning of EASY-Backfilling using Queue Reordering Policies, IEEE Transactions on Parallel and Distributed Systems, vol.29, issue.10, pp.2304-2316, 2018. ,
URL : https://hal.archives-ouvertes.fr/hal-01963216
Stochastic load balancing and related problems, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039), pp.579-586 ,
Berkeley lab checkpoint/restart (BLCR) for Linux clusters, Journal of Physics: Conference Series, vol.46, pp.494-499, 2006. ,
A Common, High-Dimensional Model of the Representational Space in Human Ventral Temporal Cortex, Neuron, vol.72, issue.2, pp.404-416, 2011. ,
Mesos: A platform for finegrained resource sharing in the data center, 8th USENIX Conf. Networked Systems Design and Implementation, pp.295-308, 2011. ,
Combining multi-atlas segmentation with brain surface estimation, Medical Imaging 2016: Image Processing, vol.9784, p.97840, 2016. ,
Consistent cortical reconstruction and multi-atlas brain segmentation, NeuroImage, vol.138, pp.197-210, 2016. ,
Spatially Localized Atlas Network Tiles Enables 3D Whole Brain Segmentation from Limited Data, Medical Image Computing and Computer Assisted Intervention ? MICCAI 2018, pp.698-705, 2018. ,
3D whole brain segmentation using spatially localized atlas network tiles, NeuroImage, vol.194, pp.105-119, 2019. ,
Fault-Tolerance Techniques for High-Performance Computing, 2015. ,
Dryad, Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007 - EuroSys '07, 2007. ,
Allocating bandwidth for bursty connections, Proceedings of the twenty-ninth annual ACM symposium on Theory of computing - STOC '97, pp.664-673, 1997. ,
Identifying Quick Starters: Towards an Integrated Framework for Efficient Predictions of Queue Waiting Times of Batch Parallel Jobs, Job Scheduling Strategies for Parallel Processing, pp.196-215, 2013. ,
OASIS-3: Longitudinal Neuroimaging, Clinical, and Cognitive Dataset for Normal Aging and Alzheimer Disease, 2019. ,
Medical-image Analysis and Statistical Interpretation (MASI) Lab ,
Taming unbalanced training workloads in deep learning with partial collective operations, Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp.45-61, 2020. ,
The ANL/IBM SP scheduling system, Job Scheduling Strategies for Parallel Processing, pp.295-303, 1995. ,
On the Use of Machine Learning to Predict the Time and Resources Consumed by Applications, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, pp.495-504, 2010. ,
Using Pilot Systems to Execute Many Task Workloads on Supercomputers, Job Scheduling Strategies for Parallel Processing, pp.61-82, 2019. ,
Containers checkpointing and live migration, Ottawa Linux Symposium, 2008. ,
Approximation in stochastic scheduling, Journal of the ACM, vol.46, issue.6, pp.924-942, 1999. ,
Utilization, Predictability, Workloads, and User Runtime Estimates in Scheduling the IBM SP2 with Backfilling, IEEE Trans. Parallel Distrib. Syst, vol.12, issue.6, pp.529-543, 2001. ,
Stochastic scheduling. Encyclopedia of Optimization, pp.3818-3824, 2009. ,
Exploring hardware overprovisioning in power-constrained, high performance computing, Proceedings of the 27th international ACM conference on International conference on supercomputing - ICS '13, pp.173-182, 2013. ,
Performance optimality or reproducibility, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2019. ,
Migrating LinuX Containers Using CRIU, Lecture Notes in Computer Science, pp.674-684, 2016. ,
cudaCR: An In-Kernel Application-Level Checkpoint/Restart Scheme for CUDA-Enabled GPUs, 2017 IEEE International Conference on Cluster Computing (CLUSTER), pp.725-732, 2017. ,
When you have a hammer, everything looks like a nail -Checkpoint/restart in Slurm, 2017. ,
The EASY ? LoadLeveler API project, Job Scheduling Strategies for Parallel Processing, pp.41-47, 1996. ,
Improving HPC System Performance by Predicting Job Resources via Supervised Machine Learning, Proceedings of the Practice and Experience in Advanced Research Computing on Rise of the Machines (learning), 2019. ,
, Environmental assessment for operations, upgrades, and modifications in SNL/NM Technical Area IV, SNL-NM), 1996.
Apache Hadoop YARN, Proceedings of the 4th annual Symposium on Cloud Computing - SOCC '13, vol.5, p.16, 2013. ,
Checkpointing Systems, Stochastic Models for Fault Tolerance, pp.171-176, 2010. ,
Cross-Platform Performance Prediction of Parallel Applications Using Partial Execution, ACM/IEEE SC 2005 Conference (SC'05), pp.40-40 ,
A first order approximation to the optimum checkpoint interval, Communications of the ACM, vol.17, issue.9, pp.530-531, 1974. ,
Improving the performance of batch schedulers using online job size classification, 2019. ,
URL : https://hal.archives-ouvertes.fr/hal-02334116