Assuming failure independence: are we right to be wrong? " in FTS'2017, the Workshop on Fault-Tolerant Systems, in conjunction with Cluster'2017, 2017. ,
A first order approximation to the optimum checkpoint interval, Communications of the ACM, vol.17, issue.9, pp.530-531, 1974. ,
DOI : 10.1145/361147.361115
A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, vol.22, issue.3, pp.303-312, 2006. ,
DOI : 10.1016/j.future.2004.11.016
The Failure Trace Archive: Enabling Comparative Analysis of Failures in Diverse Distributed Systems, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, pp.398-407, 2010. ,
DOI : 10.1109/CCGRID.2010.71
URL : https://hal.archives-ouvertes.fr/hal-00788866
Failure history ,
Unprotected Computing: A Large-Scale Study of DRAM Raw Error Rate on a Supercomputer, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis, 2016. ,
DOI : 10.1109/SC.2016.54
Reducing Waste in Extreme Scale Systems through Introspective Analysis, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp.212-221, 2016. ,
DOI : 10.1109/IPDPS.2016.100
Distributed snapshots: determining global states of distributed systems, ACM Transactions on Computer Systems, vol.3, issue.1, pp.63-75, 1985. ,
DOI : 10.1145/214451.214456
Unified model for assessing checkpointing protocols at extreme-scale, Concurrency and Computation: Practice and Experience, pp.2772-2791, 2014. ,
DOI : 10.1109/SNAPI.2010.10
URL : https://hal.archives-ouvertes.fr/hal-00696154
FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI, Cluster Computing IEEE International Conference on, pp.93-103, 2004. ,
Hiding Checkpoint Overhead in HPC Applications with a Semi-Blocking Algorithm, 2012 IEEE International Conference on Cluster Computing, pp.364-372, 2012. ,
DOI : 10.1109/CLUSTER.2012.82
Design, modeling, and evaluation of a scalable multi-level checkpointing system, Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC'10), pp.1-11, 2010. ,
FTI, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, 2011. ,
DOI : 10.1145/2063384.2063427
URL : https://hal.archives-ouvertes.fr/hal-00721216
Fault prediction under the microscope: A closer look into HPC systems, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis, p.77, 2012. ,
DOI : 10.1109/SC.2012.57
Checkpointing algorithms and fault prediction, Journal of Parallel and Distributed Computing, vol.74, issue.2, pp.2048-2064, 2014. ,
DOI : 10.1016/j.jpdc.2013.10.010
URL : https://hal.archives-ouvertes.fr/hal-00908446
Modeling and tolerating heterogeneous failures in large parallel systems, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, pp.1-11, 2011. ,
DOI : 10.1145/2063384.2063444
Understanding and Exploiting Spatial Properties of System Failures on Extreme-Scale HPC Systems, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, pp.37-44, 2015. ,
DOI : 10.1109/DSN.2015.52
Lazy Checkpointing: Exploiting Temporal Locality in Failures to Mitigate Checkpointing Overheads on Extreme-Scale Systems, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, pp.25-36, 2014. ,
DOI : 10.1109/DSN.2014.101
Optimum checkpoints with age dependent failures, Acta Informatica, vol.27, issue.6, pp.519-531, 1990. ,
DOI : 10.1007/BF00277388