The validity of the single processor approach to achieving large scale computing capabilities, AFIPS Conference Proceedings, vol.30, pp.483-485, 1967. ,
Identifying the right replication level to detect and correct silent errors at scale, 2017. ,
URL : https://hal.archives-ouvertes.fr/hal-02082907
Combining checkpointing and replication for reliable execution of linear workflows, 2018. ,
URL : https://hal.archives-ouvertes.fr/hal-01963655
Optimal resilience patterns to cope with fail-stop and silent errors, 2016. ,
URL : https://hal.archives-ouvertes.fr/hal-01215857
Unified model for assessing checkpointing protocols at extreme-scale, Concurrency and Computation: Practice and Experience, 2013. ,
URL : https://hal.archives-ouvertes.fr/hal-00696154
Qualitative Performance Analysis for Large-Scale Scientific Workflows, 2008. ,
Toward Exascale Resilience: 2014 update, Supercomputing frontiers and innovations, vol.1, issue.1, 2014. ,
On the impact of process replication on executions of large-scale parallel applications with coordinated checkpointing, Future Gen. Comp. Syst, vol.51, pp.7-19, 2015. ,
URL : https://hal.archives-ouvertes.fr/hal-01199752
Distributed snapshots: Determining global states of distributed systems, ACM Transactions on Computer Systems, vol.3, issue.1, pp.63-75, 1985. ,
Programming models and development software for a space-based many-core processor, 4th Int. Conf. onon Space Mission Challenges for Information Technology, pp.95-102, 2011. ,
Scientific workflows and provenance: Introduction and research opportunities, Datenbank-Spektrum, vol.12, issue.3, pp.193-203, 2012. ,
A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Comp. Syst, vol.22, issue.3, pp.303-312, 2006. ,
Toward an optimal online checkpoint solution under a two-level HPC checkpoint model, IEEE Trans. Parallel & Distributed Systems, 2016. ,
URL : https://hal.archives-ouvertes.fr/hal-01263879
Combining partial redundancy and checkpointing for HPC, ICDCS, 2012. ,
Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery, IEEE Trans. Dependable and Secure Computing, vol.1, issue.2, pp.97-108, 2004. ,
A survey of rollback-recovery protocols in message-passing systems, ACM Computing Survey, vol.34, pp.375-408, 2002. ,
The case for modular redundancy in large-scale high performance computing systems, 2009. ,
Redundant execution of HPC applications with MR-MPI, PDCN. IASTED, 2011. ,
Evaluating the viability of process replication reliability for exascale systems, SC'11, 2011. ,
ADFT: An adaptive framework for fault tolerance on large scale systems using application malleability, Procedia Computer Science, vol.9, pp.166-175, 2012. ,
Fault-Tolerance Techniques for HighPerformance Computing, Computer Communications and Networks, 2015. ,
Fault tolerance and recovery of scientific workflows on computational grids, Proc. of CCGrid, pp.777-782, 2008. ,
VolpexMPI: An MPI Library for Execution of Parallel Applications on Volatile Nodes, 16th European PVM/MPI Users' Group Meeting, pp.124-133, 2009. ,
, Top ten exascale research challenges. DOE ASCAC subcommittee report, pp.1-86, 2014.
The use of triple-modular redundancy to improve computer reliability, IBM J. Res. Dev, vol.6, issue.2, pp.200-209, 1962. ,
DOI : 10.1147/rd.62.0200
, Meta-Algorithms for Scheduling a Chain of Coarse-Grained Tasks on an Array of Reconfigurable FPGAs. VLSI Design, 2013.
ACR: Automatic Checkpoint/Restart for Soft and Hard Error Protection, 2013. ,
, Diskless checkpointing. IEEE Trans. Parallel Dist. Systems, vol.9, issue.10, pp.972-986, 1998.
Supporting highly-decoupled threadlevel redundancy for parallel programs, Proc. HPCA'2008, pp.393-404, 2008. ,
Understanding Failures in Petascale Computers, Journal of Physics: Conference Series, vol.78, issue.1, 2007. ,
Using two-level stable storage for efficient checkpointing, IEE Proceedings-Software, vol.145, issue.6, pp.198-202, 1998. ,
Does partial replication pay off? In FTXS, 2012. ,
Programmerdirected partial redundancy for resilient HPC, Computing Frontiers, 2015. ,
Designing and Modelling Selective Replication for Fault-Tolerant HPC Applications, Proc. CCGrid'2017, pp.452-457, 2017. ,
, Workflow Systems for Science: Concepts and Tools. ISRN Software Engineering, 2013.
On the optimum checkpoint selection problem, SIAM J. Comput, vol.13, issue.3, pp.630-649, 1984. ,
A case for two-level distributed recovery schemes. SIGMETRICS Perform, Eval. Rev, vol.23, issue.1, pp.64-73, 1995. ,
Using replication and checkpointing for reliable task management in computational grids, 2010. ,
URL : https://hal.archives-ouvertes.fr/hal-00788867
A first order approximation to the optimum checkpoint interval, Comm. of the ACM, vol.17, issue.9, pp.530-531, 1974. ,
Thread-level redundancy fault tolerant CMP based on relaxed input replication, 2011. ,
FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI, IEEE Int. Conf. on Cluster Computing, pp.93-103, 2004. ,
Reliability-aware scalability models for high performance computing, Cluster Computing, 2009. ,
Reliability-aware speedup models for parallel applications with coordinated checkpointing/restart, IEEE Trans. Computers, vol.64, issue.5, pp.1402-1415, 2015. ,