A. Feinberg, An 83,000-processor supercomputer can only match 1% of your brain 2013, available at http://gizmodo.com/ an-83-000-processor-supercomputer-only-matched-one-perc-1045026757

H. Casanova, Y. Robert, F. Vivien, and D. Zaidouni, On the impact of process replication on executions of large-scale parallel applications with coordinated checkpointing, Future Generation Computer Systems, vol.51, pp.7-19, 2015.
DOI : 10.1016/j.future.2015.04.003

URL : https://hal.archives-ouvertes.fr/hal-01199752

D. Fiala, F. Mueller, C. Engelmann, R. Riesen, K. Ferreira et al., Detection and correction of silent data corruption for large-scale high-performance computing, Proc. SC '12, 2012.

A. R. Benson, S. Schmit, and R. Schreiber, Silent error detection in numerical time-stepping schemes, International Journal of High Performance Computing Applications, vol.29, issue.4, pp.403-421, 2015.
DOI : 10.1177/1094342014532297

J. P. Walters and V. Chaudhary, A Scalable Asynchronous Replicationbased Strategy for Fault Tolerant MPI Applications, Proc. HiPC'07, pp.257-268, 2007.

J. Walters and V. Chaudhary, Replication-Based Fault Tolerance for MPI Applications, IEEE Transactions on Parallel and Distributed Systems, vol.20, issue.7, pp.997-1010, 2009.
DOI : 10.1109/TPDS.2008.172

S. Di, M. Bouguerra, L. Bautista-gomez, and F. Cappello, Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications, 2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014.
DOI : 10.1109/IPDPS.2014.122

S. Di, L. Bautista-gomez, and F. Cappello, Optimization of a Multilevel Checkpoint Model with Uncertain Execution Scales, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis, 2014.
DOI : 10.1109/SC.2014.79

L. Gomez, A. Nukada, N. Maruyama, F. Cappello, and S. Matsuoka, Low-overhead diskless checkpoint for hybrid computing systems, 2010 International Conference on High Performance Computing, 2010.
DOI : 10.1109/HIPC.2010.5713163

L. A. Gomez, N. Maruyama, F. Cappello, and S. Matsuoka, Distributed Diskless Checkpoint for Large Scale Systems, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, 2010.
DOI : 10.1109/CCGRID.2010.40

H. Li, L. Pang, and Z. Wang, Two-Level Incremental Checkpoint Recovery Scheme for Reducing System Total Overheads, PLoS ONE, vol.54, issue.8, 2014.
DOI : 10.1371/journal.pone.0104591.t002

A. Moody, G. Bronevetsky, K. Mohror, and B. De-supinski, Design, modeling, and evaluation of a scalable multi-level checkpointing system, Proc. SC'10, 2010.
DOI : 10.2172/984082

L. Bautista-gomez, S. Tsuboi, D. Komatitsch, F. Cappello, N. Maruyama et al., FTI, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, 2011.
DOI : 10.1145/2063384.2063427

URL : https://hal.archives-ouvertes.fr/hal-00721216

A. Kulkarni, A. Manzanares, L. Ionkov, M. Lang, and A. Lumsdaine, The design and implementation of a multi-level content-addressable checkpoint file system, 2012 19th International Conference on High Performance Computing, 2012.
DOI : 10.1109/HiPC.2012.6507514

G. Zheng, X. Ni, and L. Kale, A scalable double in-memory checkpoint and restart scheme towards exascale, IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012), 2012.
DOI : 10.1109/DSNW.2012.6264677

S. Di and F. Cappello, Fast Error-Bounded Lossy HPC Data Compression with SZ, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2016.
DOI : 10.1109/IPDPS.2016.11

J. Choi, J. Dongarra, S. Ostrouchov, A. Petitet, D. Walker et al., Design and Implementation of the ScaLAPACK LU, QR, and Cholesky Factorization Routines, Scientific Programming, pp.173-184, 1996.
DOI : 10.1155/1996/483083

S. Di, Y. Robert, F. Vivien, and F. Cappello, Toward an Optimal Online Checkpoint Solution under a Two-Level HPC Checkpoint Model, IEEE Transactions on Parallel and Distributed Systems, vol.28, issue.1, 2016.
DOI : 10.1109/TPDS.2016.2546248

URL : https://hal.archives-ouvertes.fr/hal-01263879

M. Bougeret, H. Casanova, M. Rabie, Y. Robert, and F. Vivien, Checkpointing strategies for parallel jobs, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, 2011.
DOI : 10.1145/2063384.2063428

URL : https://hal.archives-ouvertes.fr/hal-00738504

J. T. Daly, A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, vol.22, issue.3, pp.303-312, 2006.
DOI : 10.1016/j.future.2004.11.016

H. Jin, Y. Chen, H. Zhu, and X. Sun, Optimizing HPC Fault-Tolerant Environment: An Analytical Approach, 2010 39th International Conference on Parallel Processing, 2010.
DOI : 10.1109/ICPP.2010.80

J. W. Young, A first order approximation to the optimum checkpoint interval, Communications of the ACM, vol.17, issue.9, pp.530-531, 1974.
DOI : 10.1145/361147.361115

N. Chen and S. Ren, Adaptive optimal checkpoint interval and its impact on system's overall quality in soft real-time applications, Proceedings of the 2009 ACM symposium on Applied Computing, SAC '09, 2009.
DOI : 10.1145/1529282.1529506

Y. Liu, R. Nassar, C. Leangsuksun, N. Naksinehaboon, M. Paun et al., An optimal checkpoint/restart model for a large scale high performance computing system, Proc. IPDPS. IEEE, 2008.

N. H. Vaidya, A case for two-level distributed recovery schemes, Proceedings of the 1995 ACM SIGMETRICS Joint International Conference on Measurement and Modeling of Computer Systems, ser. SIGMETRICS '95/PERFORMANCE '95, pp.64-73, 1995.

D. Hakkarinen and Z. Chen, Multilevel Diskless Checkpointing, IEEE Transactions on Computers, vol.62, issue.4, pp.772-783, 2013.
DOI : 10.1109/TC.2012.17