G. , The validity of the single processor approach to achieving large scale computing capabilities, AFIPS Conference Proceedings, vol.30, pp.483-485, 1967.

A. Benoit, A. Cavelan, F. Cappello, P. Raghavan, Y. Robert et al., Identifying the right replication level to detect and correct silent errors at scale, 2017.
URL : https://hal.archives-ouvertes.fr/hal-02082907

A. Benoit, A. Cavelan, F. Ciorba, V. L. , and Y. Robert, Combining checkpointing and replication for reliable execution of linear workflows, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01963655

A. Benoit, A. Cavelan, Y. Robert, and H. Sun, Optimal resilience patterns to cope with fail-stop and silent errors, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01215857

G. Bosilca, A. Bouteiller, E. Brunet, F. Cappello, J. Dongarra et al., Unified model for assessing checkpointing protocols at extreme-scale, Concurrency and Computation: Practice and Experience, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00696154

E. S. Buneci, Qualitative Performance Analysis for Large-Scale Scientific Workflows, 2008.

F. Cappello, A. Geist, W. Gropp, S. Kale, B. Kramer et al., Toward Exascale Resilience: 2014 update, Supercomputing frontiers and innovations, vol.1, issue.1, 2014.

H. Casanova, Y. Robert, F. Vivien, and D. Zaidouni, On the impact of process replication on executions of large-scale parallel applications with coordinated checkpointing, Future Gen. Comp. Syst, vol.51, pp.7-19, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01199752

K. M. Chandy and L. Lamport, Distributed snapshots: Determining global states of distributed systems, ACM Transactions on Computer Systems, vol.3, issue.1, pp.63-75, 1985.

S. P. Crago, D. I. Kang, M. Kang, R. Kost, K. Singh et al., Programming models and development software for a space-based many-core processor, 4th Int. Conf. onon Space Mission Challenges for Information Technology, pp.95-102, 2011.

V. Cuevas-vicenttín, S. C. Dey, S. Köhler, S. Riddle, and B. Ludäscher, Scientific workflows and provenance: Introduction and research opportunities, Datenbank-Spektrum, vol.12, issue.3, pp.193-203, 2012.

J. T. Daly, A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Comp. Syst, vol.22, issue.3, pp.303-312, 2006.

S. Di, Y. Robert, F. Vivien, and F. Cappello, Toward an optimal online checkpoint solution under a two-level HPC checkpoint model, IEEE Trans. Parallel & Distributed Systems, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01263879

J. Elliott, K. Kharbas, D. Fiala, F. Mueller, K. Ferreira et al., Combining partial redundancy and checkpointing for HPC, ICDCS, 2012.

E. Elnozahy and J. Plank, Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery, IEEE Trans. Dependable and Secure Computing, vol.1, issue.2, pp.97-108, 2004.

E. N. Elnozahy, L. Alvisi, Y. Wang, and D. B. Johnson, A survey of rollback-recovery protocols in message-passing systems, ACM Computing Survey, vol.34, pp.375-408, 2002.

C. Engelmann, H. H. Ong, and S. L. Scorr, The case for modular redundancy in large-scale high performance computing systems, 2009.

C. Engelmann and B. Swen, Redundant execution of HPC applications with MR-MPI, PDCN. IASTED, 2011.

K. Ferreira, J. Stearley, J. H. Laros, R. Oldfield, K. Pedretti et al., Evaluating the viability of process replication reliability for exascale systems, SC'11, 2011.

C. George and S. S. Vadhiyar, ADFT: An adaptive framework for fault tolerance on large scale systems using application malleability, Procedia Computer Science, vol.9, pp.166-175, 2012.

T. Hérault and Y. Robert, Fault-Tolerance Techniques for HighPerformance Computing, Computer Communications and Networks, 2015.

G. Kandaswamy, A. Mandal, and D. A. Reed, Fault tolerance and recovery of scientific workflows on computational grids, Proc. of CCGrid, pp.777-782, 2008.

T. Leblanc, R. Anand, E. Gabriel, and J. Subhlok, VolpexMPI: An MPI Library for Execution of Parallel Applications on Volatile Nodes, 16th European PVM/MPI Users' Group Meeting, pp.124-133, 2009.

R. Lucas, J. Ang, K. Bergman, S. Borkar, W. Carlson et al., Top ten exascale research challenges. DOE ASCAC subcommittee report, pp.1-86, 2014.

R. E. Lyons and W. Vanderkulk, The use of triple-modular redundancy to improve computer reliability, IBM J. Res. Dev, vol.6, issue.2, pp.200-209, 1962.
DOI : 10.1147/rd.62.0200

D. P. Mehta, C. Shetters, and D. W. Bouldin, Meta-Algorithms for Scheduling a Chain of Coarse-Grained Tasks on an Array of Reconfigurable FPGAs. VLSI Design, 2013.

X. Ni, E. Meneses, N. Jain, and L. V. Kalé, ACR: Automatic Checkpoint/Restart for Soft and Hard Error Protection, 2013.

J. Plank, K. Li, and M. Puening, Diskless checkpointing. IEEE Trans. Parallel Dist. Systems, vol.9, issue.10, pp.972-986, 1998.

M. W. Rashid and M. C. Huang, Supporting highly-decoupled threadlevel redundancy for parallel programs, Proc. HPCA'2008, pp.393-404, 2008.

B. Schroeder and G. A. Gibson, Understanding Failures in Petascale Computers, Journal of Physics: Conference Series, vol.78, issue.1, 2007.

L. Silva and J. Silva, Using two-level stable storage for efficient checkpointing, IEE Proceedings-Software, vol.145, issue.6, pp.198-202, 1998.

J. Stearley, K. B. Ferreira, D. J. Robinson, J. Laros, K. T. Pedretti et al., Does partial replication pay off? In FTXS, 2012.

O. Subasi, J. Arias, O. Unsal, J. Labarta, and A. Cristal, Programmerdirected partial redundancy for resilient HPC, Computing Frontiers, 2015.

O. Subasi, G. Yalcin, F. Zyulkyarov, O. Unsal, and J. Labarta, Designing and Modelling Selective Replication for Fault-Tolerant HPC Applications, Proc. CCGrid'2017, pp.452-457, 2017.

D. Talia, Workflow Systems for Science: Concepts and Tools. ISRN Software Engineering, 2013.

S. Babaoglu, On the optimum checkpoint selection problem, SIAM J. Comput, vol.13, issue.3, pp.630-649, 1984.

N. H. Vaidya, A case for two-level distributed recovery schemes. SIGMETRICS Perform, Eval. Rev, vol.23, issue.1, pp.64-73, 1995.

S. Yi, D. Kondo, B. Kim, G. Park, and Y. Cho, Using replication and checkpointing for reliable task management in computational grids, 2010.
URL : https://hal.archives-ouvertes.fr/hal-00788867

J. W. Young, A first order approximation to the optimum checkpoint interval, Comm. of the ACM, vol.17, issue.9, pp.530-531, 1974.

J. Yu, D. Jian, Z. Wu, and H. Liu, Thread-level redundancy fault tolerant CMP based on relaxed input replication, 2011.

G. Zheng, L. Shi, and L. V. Kale, FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI, IEEE Int. Conf. on Cluster Computing, pp.93-103, 2004.

Z. Zheng and Z. Lan, Reliability-aware scalability models for high performance computing, Cluster Computing, 2009.

Z. Zheng, L. Yu, and Z. Lan, Reliability-aware speedup models for parallel applications with coordinated checkpointing/restart, IEEE Trans. Computers, vol.64, issue.5, pp.1402-1415, 2015.