A. Benoit, A. Cavelan, F. Cappello, P. Raghavan, Y. Robert et al., Identifying the Right Replication Level to Detect and Correct Silent Errors at Scale, Proceedings of the 2017 Workshop on Fault-Tolerance for HPC at Extreme Scale , FTXS '17, 2017.
DOI : 10.1147/rd.401.0003

URL : https://hal.archives-ouvertes.fr/hal-01494678

A. Benoit, A. Cavelan, Y. Robert, and H. Sun, Optimal Resilience Patterns to Cope with Fail-Stop and Silent Errors, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2016.
DOI : 10.1109/IPDPS.2016.39

URL : https://hal.archives-ouvertes.fr/hal-01354886

G. Bosilca, A. Bouteiller, E. Brunet, F. Cappello, J. Dongarra et al., Unified model for assessing checkpointing protocols at extreme-scale, Concurrency and Computation: Practice and Experience, 2013.
DOI : 10.1109/SNAPI.2010.10

URL : https://hal.archives-ouvertes.fr/hal-00908447

E. S. Buneci, Qualitative Performance Analysis for Large-Scale Scientific Workflows, 2008.

F. Cappello, A. Geist, W. Gropp, S. Kale, B. Kramer et al., Toward Exascale Resilience, The International Journal of High Performance Computing Applications, vol.23, issue.4, p.2014
DOI : 10.1515/9781400882618-003

URL : http://institute.lanl.gov/resilience/docs/Toward%20Exascale%20Resilience.pdf

H. Casanova, Y. Robert, F. Vivien, and D. Zaidouni, On the impact of process replication on executions of large-scale parallel applications with coordinated checkpointing, Future Generation Computer Systems, vol.51, pp.7-19, 2015.
DOI : 10.1016/j.future.2015.04.003

URL : https://hal.archives-ouvertes.fr/hal-01199752

K. M. Chandy and L. Lamport, Distributed snapshots: determining global states of distributed systems, ACM Transactions on Computer Systems, vol.3, issue.1, pp.63-75, 1985.
DOI : 10.1145/214451.214456

S. P. Crago, D. I. Kang, M. Kang, R. Kost, K. Singh et al., Programming Models and Development Software for a Space-Based Many-Core Processor, 2011 IEEE Fourth International Conference on Space Mission Challenges for Information Technology, pp.95-102, 2011.
DOI : 10.1109/SMC-IT.2011.29

V. Cuevas-vicenttín, S. C. Dey, S. Köhler, S. Riddle, and B. Ludäscher, Scientific Workflows and Provenance: Introduction and Research Opportunities, Datenbank-Spektrum, vol.54, issue.4, pp.193-203, 2012.
DOI : 10.1007/978-3-642-17819-1_23

J. T. Daly, A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, vol.22, issue.3, pp.303-312, 2006.
DOI : 10.1016/j.future.2004.11.016

S. Di, Y. Robert, F. Vivien, and F. Cappello, Toward an Optimal Online Checkpoint Solution under a Two-Level HPC Checkpoint Model, IEEE Transactions on Parallel and Distributed Systems, vol.28, issue.1, 2016.
DOI : 10.1109/TPDS.2016.2546248

URL : https://hal.archives-ouvertes.fr/hal-01353871

J. Elliott, K. Kharbas, D. Fiala, F. Mueller, K. Ferreira et al., Combining Partial Redundancy and Checkpointing for HPC, 2012 IEEE 32nd International Conference on Distributed Computing Systems, 2012.
DOI : 10.1109/ICDCS.2012.56

URL : http://moss.csc.ncsu.edu/%7Emueller/ftp/pub/mueller/papers/icdcs12.pdf

E. Elnozahy and J. Plank, Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery, IEEE Transactions on Dependable and Secure Computing, vol.1, issue.2, pp.97-108, 2004.
DOI : 10.1109/TDSC.2004.15

E. N. Elnozahy, L. Alvisi, Y. Wang, and D. B. Johnson, A survey of rollback-recovery protocols in message-passing systems, ACM Computing Surveys, vol.34, issue.3, pp.375-408, 2002.
DOI : 10.1145/568522.568525

C. Engelmann, H. H. Ong, and S. L. Scorr, The case for modular redundancy in large-scale high performance computing systems, PDCN. IASTED, 2009.

C. Engelmann and B. Swen, Redundant Execution of HPC Applications with MR-MPI, Parallel and Distributed Computing and Networks / 720: Software Engineering, 2011.
DOI : 10.2316/P.2011.719-031

K. Ferreira, J. Stearley, J. H. Laros, R. Oldfield, K. Pedretti et al., Evaluating the viability of process replication reliability for exascale systems, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, 2011.
DOI : 10.1145/2063384.2063443

C. George and S. S. Vadhiyar, ADFT: An Adaptive Framework for Fault Tolerance on Large Scale Systems using Application Malleability, Procedia Computer Science, vol.9, pp.166-175, 2012.
DOI : 10.1016/j.procs.2012.04.018

G. Kandaswamy, A. Mandal, and D. A. Reed, Fault Tolerance and Recovery of Scientific Workflows on Computational Grids, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID), pp.777-782, 2008.
DOI : 10.1109/CCGRID.2008.79

T. Leblanc, R. Anand, E. Gabriel, and J. Subhlok, VolpexMPI: An MPI Library for Execution of Parallel Applications on Volatile Nodes, 16th European PVM/MPI Users' Group Meeting, pp.124-133
DOI : 10.1007/978-3-540-30218-6_19

R. E. Lyons and W. Vanderkulk, The Use of Triple-Modular Redundancy to Improve Computer Reliability, IBM Journal of Research and Development, vol.6, issue.2, pp.200-209, 1962.
DOI : 10.1147/rd.62.0200

D. P. Mehta, C. Shetters, and D. W. Bouldin, Meta-Algorithms for Scheduling a Chain of Coarse-Grained Tasks on an Array of Reconfigurable FPGAs, VLSI Design, vol.1800, issue.3, 2013.
DOI : 10.1109/12.773794

X. Ni, E. Meneses, N. Jain, and L. V. Kalé, ACR, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '13, 2013.
DOI : 10.1145/2503210.2503266

J. Plank, K. Li, and M. Puening, Diskless checkpointing, IEEE Transactions on Parallel and Distributed Systems, vol.9, issue.10, pp.972-986, 1998.
DOI : 10.1109/71.730527

URL : http://www.cs.utk.edu/~plank/papers/CS-97-380.ps.Z

M. W. Rashid and M. C. Huang, Supporting highly-decoupled threadlevel redundancy for parallel programs, 14th Int. Conf. on High- Performance Computer Architecture (HPCA), pp.393-404, 2008.

B. Schroeder and G. A. Gibson, Understanding failures in petascale computers, Journal of Physics: Conference Series, vol.78, issue.1, 2007.
DOI : 10.1088/1742-6596/78/1/012022

L. Silva and J. Silva, Using two-level stable storage for efficient checkpointing, IEE Proceedings - Software, vol.145, issue.6, pp.198-202, 1998.
DOI : 10.1049/ip-sen:19982440

J. Stearley, K. B. Ferreira, D. J. Robinson, J. Laros, K. T. Pedretti et al., Does partial replication pay off? In FTXS, 2012.

O. Subasi, J. Arias, O. Unsal, J. Labarta, and A. , Programmerdirected partial redundancy for resilient HPC, Computing Frontiers, 2015.

O. Subasi, G. Yalcin, F. Zyulkyarov, O. Unsal, and J. Labarta, Designing and Modelling Selective Replication for Fault-Tolerant HPC Applications, 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), pp.452-457, 2017.
DOI : 10.1109/CCGRID.2017.40

D. Talia, Workflow Systems for Science: Concepts and Tools, ISRN Software Engineering, vol.37, issue.1, 2013.
DOI : 10.1109/TSC.2009.4

URL : https://doi.org/10.1155/2013/404525

S. Toueg and . Babaoglu, On the Optimum Checkpoint Selection Problem, SIAM Journal on Computing, vol.13, issue.3, pp.630-649, 1984.
DOI : 10.1137/0213039

N. H. Vaidya, A case for two-level distributed recovery schemes, ACM SIGMETRICS Performance Evaluation Review, vol.23, issue.1, pp.64-73, 1995.
DOI : 10.1145/223586.223596

S. Yi, D. Kondo, B. Kim, G. Park, and Y. Cho, Using replication and checkpointing for reliable task management in computational Grids, 2010 International Conference on High Performance Computing & Simulation, 2010.
DOI : 10.1109/HPCS.2010.5547140

URL : https://hal.archives-ouvertes.fr/hal-00788867

J. W. Young, A first order approximation to the optimum checkpoint interval, Communications of the ACM, vol.17, issue.9, pp.530-531, 1974.
DOI : 10.1145/361147.361115

J. Yu, D. Jian, Z. Wu, and H. Liu, Thread-level redundancy fault tolerant CMP based on relaxed input replication, ICCIT. IEEE, 2011.

G. Zheng, L. Shi, and L. V. Kale, FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI, IEEE Int. Conf. on Cluster Computing, pp.93-103, 2004.