Checkpointing Strategies with Prediction Windows, 2013 IEEE 19th Pacific Rim International Symposium on Dependable Computing, pp.1-10, 2013. ,
DOI : 10.1109/PRDC.2013.9
URL : https://hal.archives-ouvertes.fr/hal-00789109
Checkpointing algorithms and fault prediction, Journal of Parallel and Distributed Computing, vol.74, issue.2, pp.2048-2064, 2014. ,
DOI : 10.1016/j.jpdc.2013.10.010
URL : https://hal.archives-ouvertes.fr/hal-00788313
Basic concepts and taxonomy of dependable and secure computing, IEEE Transactions on Dependable and Secure Computing, vol.1, issue.1, pp.11-33, 2004. ,
DOI : 10.1109/TDSC.2004.2
Checkpointing strategies for parallel jobs, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, 2011. ,
DOI : 10.1145/2063384.2063428
URL : https://hal.archives-ouvertes.fr/hal-00738504
Toward Exascale Resilience, International Journal of High Performance Computing Applications, vol.23, issue.4, pp.374-388, 2009. ,
DOI : 10.1177/1094342009347767
Combining process replication and checkpointing for resilience on exascale systems, 2012. ,
URL : https://hal.archives-ouvertes.fr/hal-00697180
Distributed snapshots: determining global states of distributed systems, ACM Transactions on Computer Systems, vol.3, issue.1, pp.63-75, 1985. ,
DOI : 10.1145/214451.214456
A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, vol.22, issue.3, pp.303-312, 2004. ,
DOI : 10.1016/j.future.2004.11.016
The case for modular redundancy in large-scale highh performance computing systems, Proc. of the 8th IASTED Infernational Conference on Parallel and Distributed Computing and Networks (PDCN), pp.189-194, 2009. ,
Evaluating the viability of process replication reliability for exascale systems, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, 2011. ,
DOI : 10.1145/2063384.2063443
On Ramanujan's Q-function, Journal of Computational and Applied Mathematics, vol.58, issue.1, pp.103-116, 1995. ,
DOI : 10.1016/0377-0427(93)E0258-N
Taming of the Shrew: Modeling the Normal and Faulty Behaviour of Large-scale HPC Systems, 2012 IEEE 26th International Parallel and Distributed Processing Symposium, 2012. ,
DOI : 10.1109/IPDPS.2012.107
Failure prediction for HPC systems and applications: Current situation and open issues, International Journal of High Performance Computing Applications, vol.27, issue.3, pp.273-282, 2013. ,
DOI : 10.1177/1094342013488258
Fundamentals of fault-tolerant distributed computing in asynchronous environments, ACM Computing Surveys, vol.31, issue.1, 1999. ,
DOI : 10.1145/311531.311532
Algorithm-based fault tolerance for matrix operations, IEEE Trans. Computers, vol.33, issue.6, pp.518-528, 1984. ,
Scheduling Task Parallel Applications for Rapid Turnaround on Enterprise Desktop Grids, Journal of Grid Computing, vol.290, issue.5???6, pp.379-405, 2007. ,
DOI : 10.1007/s10723-007-9063-y
The Use of Triple-Modular Redundancy to Improve Computer Reliability, IBM Journal of Research and Development, vol.6, issue.2, pp.200-209, 1962. ,
DOI : 10.1147/rd.62.0200
Introduction to Probability Models, Eleventh Edition, 2009. ,
Understanding failures in petascale computers, Journal of Physics: Conference Series, vol.78, issue.1, 2007. ,
DOI : 10.1088/1742-6596/78/1/012022
A first order approximation to the optimum checkpoint interval, Communications of the ACM, vol.17, issue.9, pp.530-531, 1974. ,
DOI : 10.1145/361147.361115
Practical online failure prediction for Blue Gene/P: Period-based vs event-driven, 2011 IEEE/IFIP 41st International Conference on Dependable Systems and Networks Workshops (DSN-W), pp.259-264, 2011. ,
DOI : 10.1109/DSNW.2011.5958823
Reliability-aware scalability models for high performance computing, 2009 IEEE International Conference on Cluster Computing and Workshops, 2009. ,
DOI : 10.1109/CLUSTR.2009.5289177
A practical failure prediction with location and lead time for Blue Gene/P, 2010 International Conference on Dependable Systems and Networks Workshops (DSN-W), pp.15-22, 2010. ,
DOI : 10.1109/DSNW.2010.5542627