The International Exascale Software Project: a Call To Cooperative Action By the Global High-Performance Community, International Journal of High Performance Computing Applications, vol.23, issue.4, pp.309-322, 2009. ,
DOI : 10.1177/1094342009347714
Exascale software study: Software challenges in extreme scale systems, 2009. ,
A first order approximation to the optimum checkpoint interval, Communications of the ACM, vol.17, issue.9, pp.530-531, 1974. ,
DOI : 10.1145/361147.361115
A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, vol.22, issue.3, pp.303-312, 2004. ,
DOI : 10.1016/j.future.2004.11.016
Exascale Computing Technology Challenges, VECPAR'10, the 9th Int. Conf. High Performance Computing for Computational Science, ser, pp.1-25, 2011. ,
DOI : 10.1109/MM.2009.5
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.185.3897
Assessing Energy Efficiency of Fault Tolerance Protocols for HPC Systems, 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing, 2012. ,
DOI : 10.1109/SBAC-PAD.2012.12
Distributed snapshots: determining global states of distributed systems, Transactions on Computer Systems, pp.63-75, 1985. ,
DOI : 10.1145/214451.214456
Unified model for assessing checkpointing protocols at extreme-scale Concurrency and Computation: Practice and Experience, 2013. ,
Evaluating the viability of process replication reliability for exascale systems, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, 2011. ,
DOI : 10.1145/2063384.2063443
A scalable double in-memory checkpoint and restart scheme towards exascale, IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012), p.2012 ,
DOI : 10.1109/DSNW.2012.6264677
PREVENTIVE MIGRATION VS. PREVENTIVE CHECKPOINTING FOR EXTREME SCALE SUPERCOMPUTERS, Parallel Processing Letters, vol.21, issue.02, pp.111-132, 2011. ,
DOI : 10.1142/S0129626411000126
URL : https://hal.archives-ouvertes.fr/hal-00945068
FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI, Proc. 2004 IEEE Int. Conf. Cluster Computing, 2004. ,
Hiding Checkpoint Overhead in HPC Applications with a Semi-Blocking Algorithm, 2012 IEEE International Conference on Cluster Computing, 2012. ,
DOI : 10.1109/CLUSTER.2012.82
Revisiting the Double Checkpointing Algorithm, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum, 2013. ,
DOI : 10.1109/IPDPSW.2013.11
URL : https://hal.archives-ouvertes.fr/hal-00925168
A 1 PB/s file system to checkpoint three million MPI tasks, Proceedings of the 22nd international symposium on High-performance parallel and distributed computing, HPDC '13, pp.143-154, 2013. ,
DOI : 10.1145/2493123.2462908
Inovallée 655 avenue de l'Europe Montbonnot 38334 Saint Ismier Cedex Publisher Inria Domaine de Voluceau -Rocquencourt BP 105 -78153 Le Chesnay Cedex inria, pp.249-6399 ,