Understanding failures in petascale computers, Journal of Physics: Conference Series, vol.78, issue.1, p.12022, 2007. ,
DOI : 10.1088/1742-6596/78/1/012022
A scalable double in-memory checkpoint and restart scheme towards exascale, IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012), pp.2012-2013 ,
DOI : 10.1109/DSNW.2012.6264677
The Failure Trace Archive: Enabling Comparative Analysis of Failures in Diverse Distributed Systems, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, pp.398-407, 2010. ,
DOI : 10.1109/CCGRID.2010.71
URL : https://hal.archives-ouvertes.fr/inria-00433523
The International Exascale Software Project: a Call To Cooperative Action By the Global High-Performance Community, International Journal of High Performance Computing Applications, vol.23, issue.4, pp.309-322, 2009. ,
DOI : 10.1177/1094342009347714
Evaluating the viability of process replication reliability for exascale systems, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, pp.441-4412, 2011. ,
DOI : 10.1145/2063384.2063443
Unified model for assessing checkpointing protocols at extreme-scale Concurrency and Computation: Practice and Experience, p.3173, 2013. ,
Algorithm-based fault tolerance for matrix operations, IEEE Trans. Comput, vol.33, issue.6, pp.518-528, 1984. ,
Characterizing the impact of soft errors on iterative methods in scientific computing, Proceedings of the international conference on Supercomputing, ICS '11, pp.152-161, 2011. ,
DOI : 10.1145/1995896.1995922
Algorithm-based fault tolerance for dense matrix factorizations, pp.225-234, 2012. ,
High performance linpack benchmark, Proceedings of the international conference on Supercomputing, ICS '11, pp.162-171, 2011. ,
DOI : 10.1145/1995896.1995923
Fault tolerant preconditioned conjugate gradient for sparse linear system solution, Proceedings of the 26th ACM international conference on Supercomputing, ICS '12, pp.69-78, 2012. ,
DOI : 10.1145/2304576.2304588
Towards resilient parallel linear Krylov solvers: recover-restart strategies, 2013. ,
URL : https://hal.archives-ouvertes.fr/hal-00843992
Algorithm-based recovery for iterative methods without checkpointing, Proceedings of the 20th international symposium on High performance distributed computing, HPDC '11, pp.73-84, 2011. ,
DOI : 10.1145/1996130.1996142
Distributed snapshots: determining global states of distributed systems, Transactions on Computer Systems, pp.63-75, 1985. ,
DOI : 10.1145/214451.214456
A survey of rollback-recovery protocols in message-passing systems, ACM Computing Surveys, vol.34, issue.3, pp.375-408, 2002. ,
DOI : 10.1145/568522.568525
Diskless checkpointing, IEEE Transactions on Parallel and Distributed Systems, vol.9, issue.10, pp.972-986, 1998. ,
DOI : 10.1109/71.730527
Algorithm-based fault tolerance applied to high performance computing, Journal of Parallel and Distributed Computing, vol.69, issue.4, pp.410-416, 2009. ,
DOI : 10.1016/j.jpdc.2008.12.002
Fault tolerant high performance computing by a coding approach, Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming , PPoPP '05, pp.213-223, 2005. ,
DOI : 10.1145/1065944.1065973
A first order approximation to the optimum checkpoint interval, Communications of the ACM, vol.17, issue.9, pp.530-531, 1974. ,
DOI : 10.1145/361147.361115
A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, vol.22, issue.3, pp.303-312, 2004. ,
DOI : 10.1016/j.future.2004.11.016
Distribution-Free Checkpoint Placement Algorithms Based on Min-Max Principle, IEEE Transactions on Dependable and Secure Computing, vol.3, issue.2, pp.130-140, 2006. ,
DOI : 10.1109/TDSC.2006.22
Checkpointing strategies for parallel jobs, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, pp.1-11, 2011. ,
DOI : 10.1145/2063384.2063428
URL : https://hal.archives-ouvertes.fr/hal-00738504
Reliability-aware scalability models for high performance computing, 2009 IEEE International Conference on Cluster Computing and Workshops, pp.1-9, 2009. ,
DOI : 10.1109/CLUSTR.2009.5289177
Assessing the Impact of ABFT and Checkpoint Composite Strategies, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops, 2013. ,
DOI : 10.1109/IPDPSW.2014.79
URL : https://hal.archives-ouvertes.fr/hal-01354689
FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI, Cluster Computing IEEE International Conference on, pp.93-103, 2004. ,
Hiding Checkpoint Overhead in HPC Applications with a Semi-Blocking Algorithm, 2012 IEEE International Conference on Cluster Computing, pp.364-372, 2012. ,
DOI : 10.1109/CLUSTER.2012.82
Revisiting the Double Checkpointing Algorithm, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum, pp.706-715, 2013. ,
DOI : 10.1109/IPDPSW.2013.11
URL : https://hal.archives-ouvertes.fr/hal-00925168
A 1 PB/s file system to checkpoint three million MPI tasks, Proceedings of the 22nd international symposium on High-performance parallel and distributed computing, HPDC '13, pp.143-154, 2013. ,
DOI : 10.1145/2493123.2462908