A. Anderson and D. Semmelroth, Statistics for Big Data For Dummies. For Dummies, 2015.

G. Aupy, Y. Robert, F. Vivien, and D. Zaidouni, Checkpointing algorithms and fault prediction, Journal of Parallel and Distributed Computing, vol.74, issue.2, pp.2048-2064, 2014.
DOI : 10.1016/j.jpdc.2013.10.010

URL : https://hal.archives-ouvertes.fr/hal-00788313

L. Bautista-gomez, A. Gainaru, S. Perarnau, D. Tiwari, S. Gupta et al., Reducing Waste in Extreme Scale Systems through Introspective Analysis, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp.212-221, 2016.
DOI : 10.1109/IPDPS.2016.100

L. Bautista-gomez, S. Tsuboi, D. Komatitsch, F. Cappello, N. Maruyama et al., FTI, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, 2011.
DOI : 10.1145/2063384.2063427

URL : https://hal.archives-ouvertes.fr/hal-01298430

G. Bosilca, A. Bouteiller, E. Brunet, F. Cappello, J. Dongarra et al., Unified model for assessing checkpointing protocols at extreme-scale, Concurrency and Computation: Practice and Experience, vol.9, issue.16, pp.262772-2791, 2014.
DOI : 10.1109/SNAPI.2010.10

URL : https://hal.archives-ouvertes.fr/hal-00696154

K. M. Chandy and L. Lamport, Distributed snapshots: determining global states of distributed systems, ACM Transactions on Computer Systems, vol.3, issue.1, pp.63-75, 1985.
DOI : 10.1145/214451.214456

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.119.7694

J. T. Daly, A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, vol.22, issue.3, pp.303-312, 2006.
DOI : 10.1016/j.future.2004.11.016

A. Gainaru, F. Cappello, M. Snir, and W. Kramer, Fault prediction under the microscope: A closer look into HPC systems, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis, p.77, 2012.
DOI : 10.1109/SC.2012.57

E. Gelenbe and M. Hernández, Optimum checkpoints with age dependent failures, Acta Informatica, vol.27, issue.6, pp.519-531, 1990.
DOI : 10.1007/BF00277388

S. Gupta, D. Tiwari, C. Jantzi, J. Rogers, and D. Maxwell, Understanding and Exploiting Spatial Properties of System Failures on Extreme-Scale HPC Systems, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, pp.37-44, 2015.
DOI : 10.1109/DSN.2015.52

E. Heien, D. Kondo, A. Gainaru, D. Lapine, B. Kramer et al., Modeling and tolerating heterogeneous failures in large parallel systems, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, pp.1-11, 2011.
DOI : 10.1145/2063384.2063444

S. Y. Ko, I. Hoque, B. Cho, and I. Gupta, Making cloud intermediate data fault-tolerant, Proceedings of the 1st ACM symposium on Cloud computing, SoCC '10
DOI : 10.1145/1807128.1807160

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.186.5558

D. Kondo, B. Javadi, A. Iosup, and D. Epema, The failure trace archive: Enabling comparative analysis of failures in diverse distributed systems. Cluster Computing and the Grid, IEEE International Symposium on, pp.398-407, 2010.
URL : https://hal.archives-ouvertes.fr/hal-00788866

Y. Liu, R. Nassar, C. Leangsuksun, N. Naksinehaboon, M. Paun et al., An optimal checkpoint/restart model for a large scale high performance computing system, IPDPS'08, 2008.

A. Moody, G. Bronevetsky, K. Mohror, and B. R. Supinski, Design, modeling, and evaluation of a scalable multi-level checkpointing system, Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC'10), pp.1-11, 2010.

X. Ni, E. Meneses, and L. V. Kalé, Hiding Checkpoint Overhead in HPC Applications with a Semi-Blocking Algorithm, 2012 IEEE International Conference on Cluster Computing, pp.364-372, 2012.
DOI : 10.1109/CLUSTER.2012.82

B. Schroeder and G. A. Gibson, A large-scale study of failures in high-performance computing systems, Proc. of DSN, pp.249-258, 2006.

K. Schroiff, P. Gemsjaeger, and C. Bolik, Cascading failover of a data management application for shared disk file systems in loosely coupled node clusters, US Patent, vol.6990, p.606, 2006.

Y. A. Shardt, Statistics for chemical and process engineers : a modern approach, 2015.
DOI : 10.1007/978-3-319-21509-9

D. Tiwari, S. Gupta, and S. S. Vazhkudai, Lazy Checkpointing: Exploiting Temporal Locality in Failures to Mitigate Checkpointing Overheads on Extreme-Scale Systems, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, pp.25-36, 2014.
DOI : 10.1109/DSN.2014.101

. Tsubame, Failure history, 2017.

J. W. Young, A first order approximation to the optimum checkpoint interval, Communications of the ACM, vol.17, issue.9, pp.530-531, 1974.
DOI : 10.1145/361147.361115

G. Zheng, L. Shi, and L. V. Kale, FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI, Cluster Computing IEEE International Conference on, pp.93-103, 2004.