A first order approximation to the optimum checkpoint interval, Communications of the ACM, vol.17, issue.9, pp.530-531, 1974. ,
DOI : 10.1145/361147.361115
A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, vol.22, issue.3, pp.303-312, 2004. ,
DOI : 10.1016/j.future.2004.11.016
Fault prediction under the microscope: A closer look into HPC systems, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis, p.12, 2012. ,
DOI : 10.1109/SC.2012.57
Checkpointing algorithms and fault prediction, Journal of Parallel and Distributed Computing, vol.74, issue.2, 2013. ,
DOI : 10.1016/j.jpdc.2013.10.010
URL : https://hal.archives-ouvertes.fr/hal-00788313
Practical online failure prediction for Blue Gene/P: Period-based vs event-driven, 2011 IEEE/IFIP 41st International Conference on Dependable Systems and Networks Workshops (DSN-W), 2011. ,
DOI : 10.1109/DSNW.2011.5958823
Failure Prediction in IBM BlueGene/L Event Logs, Seventh IEEE International Conference on Data Mining (ICDM 2007), pp.583-588, 2007. ,
DOI : 10.1109/ICDM.2007.46
Software rejuvenation: Analysis, module and applications, FTCS '95, p.381, 1995. ,
On the choice of checkpoint interval using memory usage profile and adaptive time series analysis, Proc. Pacific Rim Int. Symp. on Dependable Computing, 2001. ,
A practical failure prediction with location and lead time for Blue Gene/P, 2010 International Conference on Dependable Systems and Networks Workshops (DSN-W), pp.15-22, 2010. ,
DOI : 10.1109/DSNW.2010.5542627
Preventive migration vs. preventive checkpointing for extreme scale supercomputers. Par, Proc. Letters, pp.111-132, 2011. ,
URL : https://hal.archives-ouvertes.fr/hal-00945068
Evaluating the viability of process replication reliability for exascale systems, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, p.11, 2011. ,
DOI : 10.1145/2063384.2063443
A scalable double in-memory checkpoint and restart scheme towards exascale, IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012), 2012. ,
DOI : 10.1109/DSNW.2012.6264677
Improving cluster availability using workstation validation. SIGMETRICS Perf, Eval. Rev, vol.30, issue.1, 2002. ,
A large-scale study of failures in highperformance computing systems, In: Proc. of DSN, pp.249-258, 2006. ,
Modeling and tolerating heterogeneous failures in large parallel systems, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, p.11, 2011. ,
DOI : 10.1145/2063384.2063444
Checkpointing strategies for parallel jobs, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, 2011. ,
DOI : 10.1145/2063384.2063428
URL : https://hal.archives-ouvertes.fr/hal-00738504
A Flexible Checkpoint/Restart Model in Distributed Systems, In: PPAM. LNCS, vol.6067, pp.206-215, 2010. ,
DOI : 10.1007/978-3-642-14390-8_22
URL : https://hal.archives-ouvertes.fr/hal-00788926
Taming of the Shrew: Modeling the Normal and Faulty Behaviour of Large-scale HPC Systems, 2012 IEEE 26th International Parallel and Distributed Processing Symposium, 2012. ,
DOI : 10.1109/IPDPS.2012.107
Predicting computer system failures using support vector machines, Proceedings of the First USENIX conference on Analysis of system logs, USENIX Association, 2008. ,
Fault-aware runtime strategies for high-performance computing, IEEE TPDS, vol.20, issue.4, pp.460-473, 2009. ,