. Supercomputer-sites, , 2016.

A. Benoit, A. Cavelan, V. L. , and Y. Robert, Optimal checkpointing period with replicated execution on heterogeneous platforms, 2017.
URL : https://hal.archives-ouvertes.fr/hal-02082847

F. Cappello, A. Geist, W. Gropp, S. Kale, B. Kramer et al., Toward exascale resilience: 2014 update, Supercomputing frontiers and innovations, vol.1, issue.1, 2014.

H. Casanova, M. Bougeret, Y. Robert, F. Vivien, and D. Zaidouni, Using group replication for resilience on exascale systems, Int. Journal of High Performance Computing Applications, vol.28, issue.2, pp.210-224, 2014.
URL : https://hal.archives-ouvertes.fr/hal-00668016

J. T. Daly, A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Comp. Syst, vol.22, issue.3, pp.303-312, 2006.

S. Di, M. S. Bouguerra, L. Bautista-gomez, and F. Cappello, Optimization of multi-level checkpoint model for large scale HPC applications, 2014.

S. Di, Y. Robert, F. Vivien, and F. Cappello, Toward an optimal online checkpoint solution under a two-level HPC checkpoint model, IEEE Trans. Parallel Distributed Systems, vol.28, issue.1, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01263879

C. Engelmann, H. H. Ong, and S. L. Scorr, e case for modular redundancy in large-scale highh performance computing systems, 2009.

K. Ferreira, J. Stearley, J. H. Laros, R. Olddeld, K. Pedreei et al., Evaluating the Viability of Process Replication Reliability for Exascale Systems, SC'11, 2011.

T. Hérault and Y. Robert, Fault-Tolerance Techniques for High-Performance Computing, Computer Communications and Networks, 2015.

T. Leblanc, R. Anand, E. Gabriel, and J. Subhlok, Volpexmpi: An mpi library for execution of parallel applications on volatile nodes, 16th European PVM/MPI Users' Group Meeting, pp.124-133, 2009.

B. Mills, T. Znati, and R. Melhem, Shadow computing: An energy-aware fault tolerant computing model, Int. Conf. on Computing, Networking and Communications (ICNC), pp.73-77, 2014.

M. Mitzenmacher and E. , Probability and Computing: Randomized Algorithms and Probabilistic Analysis, 2005.

A. Moody, G. Bronevetsky, K. Mohror, and B. R. Supinski, Design, modeling, and evaluation of a scalable multi-level checkpointing system, 2010.

F. , A cost model for selecting checkpoint positions in time warp parallel simulation, IEEE Trans. Parallel Dist. Syst, vol.12, issue.4, pp.346-362, 2001.

B. Schroeder and G. A. Gibson, Understanding Failures in Petascale Computers, Journal of Physics: Conference Series, vol.78, issue.1, 2007.

M. Snir, Addressing failures in exascale computing, Int. J. High Perform. Comput. Appl, vol.28, issue.2, pp.129-173, 2014.

S. Yi, D. Kondo, B. Kim, G. Park, and Y. Cho, Using Replication and Checkpointing for Reliable Task Management in Computational Grids, SC'10, 2010.
URL : https://hal.archives-ouvertes.fr/hal-00788867

J. W. Young, A rst order approximation to the optimum checkpoint interval, Comm. of the ACM, vol.17, issue.9, pp.530-531, 1974.

Z. Zheng and Z. Lan, Reliability-aware scalability models for high performance computing, Cluster Computing, 2009.