G. Aupy, Y. Robert, and F. Vivien, Assuming failure independence: are we right to be wrong? " in FTS'2017, the Workshop on Fault-Tolerant Systems, in conjunction with Cluster'2017, 2017.

J. W. Young, A first order approximation to the optimum checkpoint interval, Communications of the ACM, vol.17, issue.9, pp.530-531, 1974.
DOI : 10.1145/361147.361115

J. T. Daly, A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, vol.22, issue.3, pp.303-312, 2006.
DOI : 10.1016/j.future.2004.11.016

D. Kondo, B. Javadi, A. Iosup, and D. Epema, The Failure Trace Archive: Enabling Comparative Analysis of Failures in Diverse Distributed Systems, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, pp.398-407, 2010.
DOI : 10.1109/CCGRID.2010.71

URL : https://hal.archives-ouvertes.fr/hal-00788866

. Tsubame, Failure history

L. Bautista-gomez, F. Zyulkyarov, O. Unsal, and S. Mcintosh-smith, Unprotected Computing: A Large-Scale Study of DRAM Raw Error Rate on a Supercomputer, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis, 2016.
DOI : 10.1109/SC.2016.54

L. Bautista-gomez, A. Gainaru, S. Perarnau, D. Tiwari, S. Gupta et al., Reducing Waste in Extreme Scale Systems through Introspective Analysis, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp.212-221, 2016.
DOI : 10.1109/IPDPS.2016.100

K. M. Chandy and L. Lamport, Distributed snapshots: determining global states of distributed systems, ACM Transactions on Computer Systems, vol.3, issue.1, pp.63-75, 1985.
DOI : 10.1145/214451.214456

G. Bosilca, A. Bouteiller, E. Brunet, F. Cappello, J. Dongarra et al., Unified model for assessing checkpointing protocols at extreme-scale, Concurrency and Computation: Practice and Experience, pp.2772-2791, 2014.
DOI : 10.1109/SNAPI.2010.10

URL : https://hal.archives-ouvertes.fr/hal-00696154

G. Zheng, L. Shi, and L. V. Kale, FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI, Cluster Computing IEEE International Conference on, pp.93-103, 2004.

X. Ni, E. Meneses, and L. V. Kalé, Hiding Checkpoint Overhead in HPC Applications with a Semi-Blocking Algorithm, 2012 IEEE International Conference on Cluster Computing, pp.364-372, 2012.
DOI : 10.1109/CLUSTER.2012.82

A. Moody, G. Bronevetsky, K. Mohror, and B. R. Supinski, Design, modeling, and evaluation of a scalable multi-level checkpointing system, Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC'10), pp.1-11, 2010.

L. Bautista-gomez, S. Tsuboi, D. Komatitsch, F. Cappello, N. Maruyama et al., FTI, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, 2011.
DOI : 10.1145/2063384.2063427

URL : https://hal.archives-ouvertes.fr/hal-00721216

A. Gainaru, F. Cappello, M. Snir, and W. Kramer, Fault prediction under the microscope: A closer look into HPC systems, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis, p.77, 2012.
DOI : 10.1109/SC.2012.57

G. Aupy, Y. Robert, F. Vivien, and D. Zaidouni, Checkpointing algorithms and fault prediction, Journal of Parallel and Distributed Computing, vol.74, issue.2, pp.2048-2064, 2014.
DOI : 10.1016/j.jpdc.2013.10.010

URL : https://hal.archives-ouvertes.fr/hal-00908446

E. Heien, D. Kondo, A. Gainaru, D. Lapine, B. Kramer et al., Modeling and tolerating heterogeneous failures in large parallel systems, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, pp.1-11, 2011.
DOI : 10.1145/2063384.2063444

S. Gupta, D. Tiwari, C. Jantzi, J. Rogers, and D. Maxwell, Understanding and Exploiting Spatial Properties of System Failures on Extreme-Scale HPC Systems, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, pp.37-44, 2015.
DOI : 10.1109/DSN.2015.52

D. Tiwari, S. Gupta, and S. S. Vazhkudai, Lazy Checkpointing: Exploiting Temporal Locality in Failures to Mitigate Checkpointing Overheads on Extreme-Scale Systems, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, pp.25-36, 2014.
DOI : 10.1109/DSN.2014.101

E. Gelenbe and M. Hernández, Optimum checkpoints with age dependent failures, Acta Informatica, vol.27, issue.6, pp.519-531, 1990.
DOI : 10.1007/BF00277388