The validity of the single processor approach to achieving large scale computing capabilities, AFIPS Conference Proceedings, vol.30, pp.483-485, 1967. ,
Unprotected computing: A large-scale study of dram raw error rate on a supercomputer, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '16, vol.55, pp.1-55, 2016. ,
Identifying the right replication level to detect and correct silent errors at scale, FTXS'2017, the Workshop on Fault-Tolerance for HPC at Extreme Scale, in conjunction with HPDC'2017, 2017. ,
URL : https://hal.archives-ouvertes.fr/hal-02082907
Combining checkpointing and replication for reliable execution of linear workflows, 2018. ,
URL : https://hal.archives-ouvertes.fr/hal-01963655
Assessing general-purpose algorithms to cope with fail-stop and silent errors, ACM Trans. Parallel Comput, vol.3, issue.2, 2016. ,
URL : https://hal.archives-ouvertes.fr/hal-01066664
Assessing general-purpose algorithms to cope with fail-stop and silent errors, ACM Trans. Parallel Computing, vol.3, issue.2, 2016. ,
URL : https://hal.archives-ouvertes.fr/hal-01066664
Optimal resilience patterns to cope with fail-stop and silent errors, 2016. ,
URL : https://hal.archives-ouvertes.fr/hal-01215857
Unified model for assessing checkpointing protocols at extreme-scale, Concurrency and Computation: Practice and Experience, 2013. ,
URL : https://hal.archives-ouvertes.fr/hal-00696154
Qualitative Performance Analysis for Large-Scale Scientific Workflows, 2008. ,
Toward Exascale Resilience, Int. J. High Performance Computing Applications, vol.23, issue.4, pp.374-388, 2009. ,
Toward Exascale Resilience: 2014 update, Supercomputing frontiers and innovations, vol.1, issue.1, 2014. ,
On the impact of process replication on executions of large-scale parallel applications with coordinated checkpointing, Future Gen. Comp. Syst, vol.51, pp.7-19, 2015. ,
URL : https://hal.archives-ouvertes.fr/hal-01199752
Distributed snapshots: Determining global states of distributed systems, ACM Transactions on Computer Systems, vol.3, issue.1, pp.63-75, 1985. ,
Programming models and development software for a space-based many-core processor, 4th Int. Conf. onon Space Mission Challenges for Information Technology, pp.95-102, 2011. ,
DOI : 10.1109/smc-it.2011.29
Scientific workflows and provenance: Introduction and research opportunities, Datenbank-Spektrum, vol.12, issue.3, pp.193-203, 2012. ,
A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Comp. Syst, vol.22, issue.3, pp.303-312, 2006. ,
Toward an optimal online checkpoint solution under a two-level HPC checkpoint model, IEEE Trans. Parallel & Distributed Systems, 2016. ,
DOI : 10.1109/tpds.2016.2546248
URL : https://hal.archives-ouvertes.fr/hal-01263879
Combining partial redundancy and checkpointing for HPC, ICDCS, 2012. ,
DOI : 10.1109/icdcs.2012.56
URL : http://moss.csc.ncsu.edu/%7Emueller/ftp/pub/mueller/papers/icdcs12.pdf
Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery, IEEE Trans. Dependable and Secure Computing, vol.1, issue.2, pp.97-108, 2004. ,
A survey of rollbackrecovery protocols in message-passing systems, ACM Computing Survey, vol.34, pp.375-408, 2002. ,
The case for modular redundancy in large-scale high performance computing systems, 2009. ,
Redundant execution of HPC applications with MR-MPI, PDCN. IASTED, 2011. ,
Evaluating the viability of process replication reliability for exascale systems, SC'11, 2011. ,
ADFT: An adaptive framework for fault tolerance on large scale systems using application malleability, Procedia Computer Science, vol.9, pp.166-175, 2012. ,
DOI : 10.1016/j.procs.2012.04.018
URL : https://doi.org/10.1016/j.procs.2012.04.018
Fault-Tolerance Techniques for High-Performance Computing, Computer Communications and Networks, 2015. ,
Fault tolerance and recovery of scientific workflows on computational grids, Proc. of CCGrid, pp.777-782, 2008. ,
VolpexMPI: An MPI Library for Execution of Parallel Applications on Volatile Nodes, 16th European PVM/MPI Users' Group Meeting, pp.124-133, 2009. ,
, Top ten exascale research challenges. DOE ASCAC subcommittee report, pp.1-86, 2014.
DOI : 10.2172/1222713
The use of triple-modular redundancy to improve computer reliability, IBM J. Res. Dev, vol.6, issue.2, pp.200-209, 1962. ,
Meta-Algorithms for Scheduling a Chain of Coarse-Grained Tasks on an Array of Reconfigurable FPGAs, 2013. ,
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System, 2010. ,
ACR: Automatic Checkpoint/Restart for Soft and Hard Error Protection, 2013. ,
The effect of cosmic rays on the soft error rate of a DRAM at ground level, IEEE Trans. Electron Devices, vol.41, issue.4, pp.553-557, 1994. ,
, Diskless checkpointing. IEEE Trans. Parallel Dist. Systems, vol.9, issue.10, pp.972-986, 1998.
Supporting highly-decoupled thread-level redundancy for parallel programs, Proc. HPCA'2008, pp.393-404, 2008. ,
Understanding Failures in Petascale Computers, Journal of Physics: Conference Series, vol.78, issue.1, 2007. ,
Using two-level stable storage for efficient checkpointing, IEE Proceedings-Software, vol.145, issue.6, pp.198-202, 1998. ,
Does partial replication pay off? In FTXS, 2012. ,
Programmer-directed partial redundancy for resilient HPC, Computing Frontiers, 2015. ,
Designing and Modelling Selective Replication for Fault-Tolerant HPC Applications, Proc. CCGrid'2017, pp.452-457, 2017. ,
, Workflow Systems for Science: Concepts and Tools. ISRN Software Engineering, 2013.
, Top500. Top500 Supercomputer Sites
On the optimum checkpoint selection problem, SIAM J. Comput, vol.13, issue.3, pp.630-649, 1984. ,
A case for two-level distributed recovery schemes. SIGMETRICS Perform, Eval. Rev, vol.23, issue.1, pp.64-73, 1995. ,
Using replication and checkpointing for reliable task management in computational grids, 2010. ,
URL : https://hal.archives-ouvertes.fr/hal-00788867
A first order approximation to the optimum checkpoint interval, Comm. of the ACM, vol.17, issue.9, pp.530-531, 1974. ,
Thread-level redundancy fault tolerant CMP based on relaxed input replication, 2011. ,
FTC-Charm++: an in-memory checkpointbased fault tolerant runtime for Charm++ and MPI, IEEE Int. Conf. on Cluster Computing, pp.93-103, 2004. ,
Reliability-aware scalability models for high performance computing, Cluster Computing, 2009. ,
Reliability-aware speedup models for parallel applications with coordinated checkpointing/restart, IEEE Trans. Computers, vol.64, issue.5, pp.1402-1415, 2015. ,
Accelerated testing for cosmic soft-error rate, IBM J. Res. Dev, vol.40, issue.1, pp.51-72, 1996. ,
Cosmic ray soft error rates of 16-Mb DRAM memory chips, IEEE Journal of Solid-State Circuits, vol.33, issue.2, pp.246-252, 1998. ,
IBM experiments in soft fails in computer electronics, RR n° 9235, vol.40, pp.3-18, 1996. ,
DOI : 10.1147/rd.401.0003