G. , The validity of the single processor approach to achieving large scale computing capabilities, AFIPS Conference Proceedings, vol.30, pp.483-485, 1967.

L. Bautista-gomez, F. Zyulkyarov, O. Unsal, and S. Mcintosh-smith, Unprotected computing: A large-scale study of dram raw error rate on a supercomputer, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '16, vol.55, pp.1-55, 2016.

A. Benoit, F. Cappello, A. Cavelan, P. Raghavan, Y. Robert et al., Identifying the right replication level to detect and correct silent errors at scale, FTXS'2017, the Workshop on Fault-Tolerance for HPC at Extreme Scale, in conjunction with HPDC'2017, 2017.
URL : https://hal.archives-ouvertes.fr/hal-02082907

A. Benoit, A. Cavelan, F. Ciorba, V. L. , and Y. Robert, Combining checkpointing and replication for reliable execution of linear workflows, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01963655

A. Benoit, A. Cavelan, Y. Robert, and H. Sun, Assessing general-purpose algorithms to cope with fail-stop and silent errors, ACM Trans. Parallel Comput, vol.3, issue.2, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01066664

A. Benoit, A. Cavelan, Y. Robert, and H. Sun, Assessing general-purpose algorithms to cope with fail-stop and silent errors, ACM Trans. Parallel Computing, vol.3, issue.2, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01066664

A. Benoit, A. Cavelan, Y. Robert, and H. Sun, Optimal resilience patterns to cope with fail-stop and silent errors, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01215857

G. Bosilca, A. Bouteiller, E. Brunet, F. Cappello, J. Dongarra et al., Unified model for assessing checkpointing protocols at extreme-scale, Concurrency and Computation: Practice and Experience, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00696154

E. S. Buneci, Qualitative Performance Analysis for Large-Scale Scientific Workflows, 2008.

F. Cappello, A. Geist, W. Gropp, S. Kale, B. Kramer et al., Toward Exascale Resilience, Int. J. High Performance Computing Applications, vol.23, issue.4, pp.374-388, 2009.

F. Cappello, A. Geist, W. Gropp, S. Kale, B. Kramer et al., Toward Exascale Resilience: 2014 update, Supercomputing frontiers and innovations, vol.1, issue.1, 2014.

H. Casanova, Y. Robert, F. Vivien, and D. Zaidouni, On the impact of process replication on executions of large-scale parallel applications with coordinated checkpointing, Future Gen. Comp. Syst, vol.51, pp.7-19, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01199752

K. M. Chandy and L. Lamport, Distributed snapshots: Determining global states of distributed systems, ACM Transactions on Computer Systems, vol.3, issue.1, pp.63-75, 1985.

S. P. Crago, D. I. Kang, M. Kang, R. Kost, K. Singh et al., Programming models and development software for a space-based many-core processor, 4th Int. Conf. onon Space Mission Challenges for Information Technology, pp.95-102, 2011.
DOI : 10.1109/smc-it.2011.29

V. Cuevas-vicenttín, S. C. Dey, S. Köhler, S. Riddle, and B. Ludäscher, Scientific workflows and provenance: Introduction and research opportunities, Datenbank-Spektrum, vol.12, issue.3, pp.193-203, 2012.

J. T. Daly, A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Comp. Syst, vol.22, issue.3, pp.303-312, 2006.

S. Di, Y. Robert, F. Vivien, and F. Cappello, Toward an optimal online checkpoint solution under a two-level HPC checkpoint model, IEEE Trans. Parallel & Distributed Systems, 2016.
DOI : 10.1109/tpds.2016.2546248
URL : https://hal.archives-ouvertes.fr/hal-01263879

J. Elliott, K. Kharbas, D. Fiala, F. Mueller, K. Ferreira et al., Combining partial redundancy and checkpointing for HPC, ICDCS, 2012.
DOI : 10.1109/icdcs.2012.56
URL : http://moss.csc.ncsu.edu/%7Emueller/ftp/pub/mueller/papers/icdcs12.pdf

E. Elnozahy and J. Plank, Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery, IEEE Trans. Dependable and Secure Computing, vol.1, issue.2, pp.97-108, 2004.

E. N. Elnozahy, L. Alvisi, Y. Wang, and D. B. Johnson, A survey of rollbackrecovery protocols in message-passing systems, ACM Computing Survey, vol.34, pp.375-408, 2002.

C. Engelmann, H. H. Ong, and S. L. Scorr, The case for modular redundancy in large-scale high performance computing systems, 2009.

C. Engelmann and B. Swen, Redundant execution of HPC applications with MR-MPI, PDCN. IASTED, 2011.

K. Ferreira, J. Stearley, J. H. Laros, R. Oldfield, K. Pedretti et al., Evaluating the viability of process replication reliability for exascale systems, SC'11, 2011.

C. George and S. S. Vadhiyar, ADFT: An adaptive framework for fault tolerance on large scale systems using application malleability, Procedia Computer Science, vol.9, pp.166-175, 2012.
DOI : 10.1016/j.procs.2012.04.018
URL : https://doi.org/10.1016/j.procs.2012.04.018

T. Hérault and Y. Robert, Fault-Tolerance Techniques for High-Performance Computing, Computer Communications and Networks, 2015.

G. Kandaswamy, A. Mandal, and D. A. Reed, Fault tolerance and recovery of scientific workflows on computational grids, Proc. of CCGrid, pp.777-782, 2008.

T. Leblanc, R. Anand, E. Gabriel, and J. Subhlok, VolpexMPI: An MPI Library for Execution of Parallel Applications on Volatile Nodes, 16th European PVM/MPI Users' Group Meeting, pp.124-133, 2009.

R. Lucas, J. Ang, K. Bergman, S. Borkar, W. Carlson et al., Top ten exascale research challenges. DOE ASCAC subcommittee report, pp.1-86, 2014.
DOI : 10.2172/1222713

R. E. Lyons and W. Vanderkulk, The use of triple-modular redundancy to improve computer reliability, IBM J. Res. Dev, vol.6, issue.2, pp.200-209, 1962.

D. P. Mehta, C. Shetters, and D. W. Bouldin, Meta-Algorithms for Scheduling a Chain of Coarse-Grained Tasks on an Array of Reconfigurable FPGAs, 2013.

A. Moody, G. Bronevetsky, K. Mohror, and B. R. Supinski, Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System, 2010.

X. Ni, E. Meneses, N. Jain, and L. V. Kalé, ACR: Automatic Checkpoint/Restart for Soft and Hard Error Protection, 2013.

T. O'gorman, The effect of cosmic rays on the soft error rate of a DRAM at ground level, IEEE Trans. Electron Devices, vol.41, issue.4, pp.553-557, 1994.

J. Plank, K. Li, and M. Puening, Diskless checkpointing. IEEE Trans. Parallel Dist. Systems, vol.9, issue.10, pp.972-986, 1998.

M. W. Rashid and M. C. Huang, Supporting highly-decoupled thread-level redundancy for parallel programs, Proc. HPCA'2008, pp.393-404, 2008.

B. Schroeder and G. A. Gibson, Understanding Failures in Petascale Computers, Journal of Physics: Conference Series, vol.78, issue.1, 2007.

L. Silva and J. Silva, Using two-level stable storage for efficient checkpointing, IEE Proceedings-Software, vol.145, issue.6, pp.198-202, 1998.

J. Stearley, K. B. Ferreira, D. J. Robinson, J. Laros, K. T. Pedretti et al., Does partial replication pay off? In FTXS, 2012.

O. Subasi, J. Arias, O. Unsal, J. Labarta, and A. Cristal, Programmer-directed partial redundancy for resilient HPC, Computing Frontiers, 2015.

O. Subasi, G. Yalcin, F. Zyulkyarov, O. Unsal, and J. Labarta, Designing and Modelling Selective Replication for Fault-Tolerant HPC Applications, Proc. CCGrid'2017, pp.452-457, 2017.

D. Talia, Workflow Systems for Science: Concepts and Tools. ISRN Software Engineering, 2013.

, Top500. Top500 Supercomputer Sites

S. Babaoglu, On the optimum checkpoint selection problem, SIAM J. Comput, vol.13, issue.3, pp.630-649, 1984.

N. H. Vaidya, A case for two-level distributed recovery schemes. SIGMETRICS Perform, Eval. Rev, vol.23, issue.1, pp.64-73, 1995.

S. Yi, D. Kondo, B. Kim, G. Park, and Y. Cho, Using replication and checkpointing for reliable task management in computational grids, 2010.
URL : https://hal.archives-ouvertes.fr/hal-00788867

J. W. Young, A first order approximation to the optimum checkpoint interval, Comm. of the ACM, vol.17, issue.9, pp.530-531, 1974.

J. Yu, D. Jian, Z. Wu, and H. Liu, Thread-level redundancy fault tolerant CMP based on relaxed input replication, 2011.

G. Zheng, L. Shi, and L. V. Kale, FTC-Charm++: an in-memory checkpointbased fault tolerant runtime for Charm++ and MPI, IEEE Int. Conf. on Cluster Computing, pp.93-103, 2004.

Z. Zheng and Z. Lan, Reliability-aware scalability models for high performance computing, Cluster Computing, 2009.

Z. Zheng, L. Yu, and Z. Lan, Reliability-aware speedup models for parallel applications with coordinated checkpointing/restart, IEEE Trans. Computers, vol.64, issue.5, pp.1402-1415, 2015.

J. Ziegler, H. Muhlfeld, C. Montrose, H. Curtis, T. O'gorman et al., Accelerated testing for cosmic soft-error rate, IBM J. Res. Dev, vol.40, issue.1, pp.51-72, 1996.

J. Ziegler, M. Nelson, J. Shell, R. Peterson, C. Gelderloos et al., Cosmic ray soft error rates of 16-Mb DRAM memory chips, IEEE Journal of Solid-State Circuits, vol.33, issue.2, pp.246-252, 1998.

J. F. Ziegler, H. W. Curtis, H. P. Muhlfeld, C. J. Montrose, and B. Chin, IBM experiments in soft fails in computer electronics, RR n° 9235, vol.40, pp.3-18, 1996.
DOI : 10.1147/rd.401.0003