The International Exascale Software Project: a Call To Cooperative Action By the Global High-Performance Community, International Journal of High Performance Computing Applications, vol.23, issue.4, pp.309-322, 2009. ,
DOI : 10.1177/1094342009347714
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System, Proc. of the ACM, pp.1-11, 2010. ,
When is multi-version checkpointing needed, " in 3rd Workshop for Fault-tolerance at Extreme Scale (FTXS), 2013. ,
A first order approximation to the optimum checkpoint interval, Communications of the ACM, vol.17, issue.9, pp.530-531, 1974. ,
DOI : 10.1145/361147.361115
A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, vol.22, issue.3, pp.303-312, 2004. ,
DOI : 10.1016/j.future.2004.11.016
Toward Exascale Resilience, International Journal of High Performance Computing Applications, vol.23, issue.4, pp.374-388, 2009. ,
DOI : 10.1177/1094342009347767
Checkpointing strategies for parallel jobs, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, 2011. ,
DOI : 10.1145/2063384.2063428
URL : https://hal.archives-ouvertes.fr/hal-00738504
Evaluating the viability of process replication reliability for exascale systems, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, 2011. ,
DOI : 10.1145/2063384.2063443
The Use of Triple-Modular Redundancy to Improve Computer Reliability, IBM Journal of Research and Development, vol.6, issue.2, pp.200-209, 1962. ,
DOI : 10.1147/rd.62.0200
A variational calculus approach to optimal checkpoint placement, IEEE Trans. on computers, pp.699-708, 2001. ,
Distribution-Free Checkpoint Placement Algorithms Based on Min-Max Principle, IEEE Transactions on Dependable and Secure Computing, vol.3, issue.2, pp.130-140, 2006. ,
DOI : 10.1109/TDSC.2006.22
A Flexible Checkpoint/Restart Model in Distributed Systems, PPAM, ser. LNCS, pp.206-215, 2010. ,
DOI : 10.1007/978-3-642-14390-8_22
URL : https://hal.archives-ouvertes.fr/hal-00788926
On the Optimum Checkpoint Selection Problem, SIAM Journal on Computing, vol.13, issue.3, pp.630-649, 1984. ,
DOI : 10.1137/0213039
Complexity Analysis of Checkpoint Scheduling with Variable Costs, IEEE Transactions on Computers, vol.62, issue.6, 2012. ,
DOI : 10.1109/TC.2012.57
URL : https://hal.archives-ouvertes.fr/hal-00788101
Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems, Journal of Parallel and Distributed Computing, vol.61, issue.11, p.1590, 2001. ,
DOI : 10.1006/jpdc.2001.1757
Optimizing HPC Fault-Tolerant Environment: An Analytical Approach, 2010 39th International Conference on Parallel Processing, pp.525-534, 2010. ,
DOI : 10.1109/ICPP.2010.80
Modeling Coordinated Checkpointing for Large-Scale Supercomputers, 2005 International Conference on Dependable Systems and Networks (DSN'05), pp.812-821, 2005. ,
DOI : 10.1109/DSN.2005.67
Modeling the Impact of Checkpoints on Next-Generation Systems, 24th IEEE Conference on Mass Storage Systems and Technologies (MSST 2007), pp.30-46, 2007. ,
DOI : 10.1109/MSST.2007.4367962
Reliability-aware scalability models for high performance computing, 2009 IEEE International Conference on Cluster Computing and Workshops, 2009. ,
DOI : 10.1109/CLUSTR.2009.5289177
PREVENTIVE MIGRATION VS. PREVENTIVE CHECKPOINTING FOR EXTREME SCALE SUPERCOMPUTERS, Parallel Processing Letters, vol.21, issue.02, pp.111-132, 2011. ,
DOI : 10.1142/S0129626411000126
URL : https://hal.archives-ouvertes.fr/hal-00945068
Cosmic rays don't strike twice, ACM SIGARCH Computer Architecture News, vol.40, issue.1, pp.111-122, 2012. ,
DOI : 10.1145/2189750.2150989
Soft error vulnerability of iterative linear algebra methods, Proceedings of the 22nd annual international conference on Supercomputing , ICS '08, pp.155-164, 2008. ,
DOI : 10.1145/1375527.1375552
Fault-tolerant iterative methods via selective reliability, Sandia National Laboratories, 2011. ,
Characterizing the impact of soft errors on iterative methods in scientific computing, Proceedings of the international conference on Supercomputing, ICS '11, pp.152-161, 2011. ,
DOI : 10.1145/1995896.1995922
Algorithm-based fault tolerance for matrix operations, IEEE Trans. Comput, vol.33, issue.6, pp.518-528, 1984. ,
Algorithm-based fault tolerance applied to high performance computing, Journal of Parallel and Distributed Computing, vol.69, issue.4, pp.410-416, 2009. ,
DOI : 10.1016/j.jpdc.2008.12.002
Detection and correction of silent data corruption for largescale high-performance computing, Proc. of the ACM, 2012. ,
Inovallée 655 avenue de l'Europe Montbonnot 38334 Saint Ismier Cedex Publisher Inria Domaine de Voluceau -Rocquencourt BP 105 -78153 Le Chesnay Cedex inria, pp.249-6399 ,