B. Schroeder and G. A. Gibson, Understanding failures in petascale computers, Journal of Physics: Conference Series, vol.78, issue.1, p.12022, 2007.
DOI : 10.1088/1742-6596/78/1/012022

G. Zheng, X. Ni, and L. Kale, A scalable double in-memory checkpoint and restart scheme towards exascale, IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012), pp.2012-2013
DOI : 10.1109/DSNW.2012.6264677

D. Kondo, B. Javadi, A. Iosup, and D. Epema, The Failure Trace Archive: Enabling Comparative Analysis of Failures in Diverse Distributed Systems, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, pp.398-407, 2010.
DOI : 10.1109/CCGRID.2010.71

URL : https://hal.archives-ouvertes.fr/inria-00433523

J. Dongarra, The International Exascale Software Project: a Call To Cooperative Action By the Global High-Performance Community, International Journal of High Performance Computing Applications, vol.23, issue.4, pp.309-322, 2009.
DOI : 10.1177/1094342009347714

K. Ferreira, J. Stearley, J. H. Laros, R. Oldfield, K. Pedretti et al., Evaluating the viability of process replication reliability for exascale systems, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, pp.441-4412, 2011.
DOI : 10.1145/2063384.2063443

G. Bosilca, Unified model for assessing checkpointing protocols at extreme-scale Concurrency and Computation: Practice and Experience, p.3173, 2013.

K. Huang and J. A. Abraham, Algorithm-based fault tolerance for matrix operations, IEEE Trans. Comput, vol.33, issue.6, pp.518-528, 1984.

M. Shantharam, S. Srinivasmurthy, and P. Raghavan, Characterizing the impact of soft errors on iterative methods in scientific computing, Proceedings of the international conference on Supercomputing, ICS '11, pp.152-161, 2011.
DOI : 10.1145/1995896.1995922

P. Du and A. Bouteiller, Algorithm-based fault tolerance for dense matrix factorizations, pp.225-234, 2012.

T. Davies, C. Karlsson, H. Liu, C. Ding, and Z. Chen, High performance linpack benchmark, Proceedings of the international conference on Supercomputing, ICS '11, pp.162-171, 2011.
DOI : 10.1145/1995896.1995923

M. Shantharam, S. Srinivasmurthy, and P. Raghavan, Fault tolerant preconditioned conjugate gradient for sparse linear system solution, Proceedings of the 26th ACM international conference on Supercomputing, ICS '12, pp.69-78, 2012.
DOI : 10.1145/2304576.2304588

E. Agullo, L. Giraud, A. Guermouche, J. Roman, and M. Zounon, Towards resilient parallel linear Krylov solvers: recover-restart strategies, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00843992

Z. Chen, Algorithm-based recovery for iterative methods without checkpointing, Proceedings of the 20th international symposium on High performance distributed computing, HPDC '11, pp.73-84, 2011.
DOI : 10.1145/1996130.1996142

K. M. Chandy and L. Lamport, Distributed snapshots: determining global states of distributed systems, Transactions on Computer Systems, pp.63-75, 1985.
DOI : 10.1145/214451.214456

E. N. Elnozahy, L. Alvisi, Y. Wang, and D. B. Johnson, A survey of rollback-recovery protocols in message-passing systems, ACM Computing Surveys, vol.34, issue.3, pp.375-408, 2002.
DOI : 10.1145/568522.568525

J. Plank, K. Li, and M. Puening, Diskless checkpointing, IEEE Transactions on Parallel and Distributed Systems, vol.9, issue.10, pp.972-986, 1998.
DOI : 10.1109/71.730527

G. Bosilca, R. Delmas, J. Dongarra, and J. Langou, Algorithm-based fault tolerance applied to high performance computing, Journal of Parallel and Distributed Computing, vol.69, issue.4, pp.410-416, 2009.
DOI : 10.1016/j.jpdc.2008.12.002

Z. Chen, G. E. Fagg, E. Gabriel, J. Langou, T. Angskun et al., Fault tolerant high performance computing by a coding approach, Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming , PPoPP '05, pp.213-223, 2005.
DOI : 10.1145/1065944.1065973

J. W. Young, A first order approximation to the optimum checkpoint interval, Communications of the ACM, vol.17, issue.9, pp.530-531, 1974.
DOI : 10.1145/361147.361115

J. T. Daly, A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, vol.22, issue.3, pp.303-312, 2004.
DOI : 10.1016/j.future.2004.11.016

T. Ozaki, T. Dohi, H. Okamura, and N. Kaio, Distribution-Free Checkpoint Placement Algorithms Based on Min-Max Principle, IEEE Transactions on Dependable and Secure Computing, vol.3, issue.2, pp.130-140, 2006.
DOI : 10.1109/TDSC.2006.22

M. Bougeret, H. Casanova, M. Rabie, Y. Robert, and F. Vivien, Checkpointing strategies for parallel jobs, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, pp.1-11, 2011.
DOI : 10.1145/2063384.2063428

URL : https://hal.archives-ouvertes.fr/hal-00738504

Z. Zheng and Z. Lan, Reliability-aware scalability models for high performance computing, 2009 IEEE International Conference on Cluster Computing and Workshops, pp.1-9, 2009.
DOI : 10.1109/CLUSTR.2009.5289177

G. Bosilca, A. Bouteiller, T. Herault, Y. Robert, and J. Dongarra, Assessing the Impact of ABFT and Checkpoint Composite Strategies, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops, 2013.
DOI : 10.1109/IPDPSW.2014.79

URL : https://hal.archives-ouvertes.fr/hal-01354689

G. Zheng, L. Shi, and L. V. Kale, FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI, Cluster Computing IEEE International Conference on, pp.93-103, 2004.

X. Ni, E. Meneses, and L. V. Kalé, Hiding Checkpoint Overhead in HPC Applications with a Semi-Blocking Algorithm, 2012 IEEE International Conference on Cluster Computing, pp.364-372, 2012.
DOI : 10.1109/CLUSTER.2012.82

J. Dongarra, T. Herault, and Y. Robert, Revisiting the Double Checkpointing Algorithm, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum, pp.706-715, 2013.
DOI : 10.1109/IPDPSW.2013.11

URL : https://hal.archives-ouvertes.fr/hal-00925168

R. Rajachandrasekar, A. Moody, K. Mohror, and D. K. Panda, A 1 PB/s file system to checkpoint three million MPI tasks, Proceedings of the 22nd international symposium on High-performance parallel and distributed computing, HPDC '13, pp.143-154, 2013.
DOI : 10.1145/2493123.2462908