Checkpointing strategies with prediction windows, Dependable Computing (PRDC), 2013 IEEE 19th Pacific Rim International Symposium on, pp.1-10, 2013. ,
URL : https://hal.archives-ouvertes.fr/hal-00847622
Checkpointing algorithms and fault prediction, Journal of Parallel and Distributed Computing, vol.74, issue.2, pp.2048-2064, 2014. ,
URL : https://hal.archives-ouvertes.fr/hal-00908446
Checkpointing strategies for parallel jobs, Proceedings of SC'11, 2011. ,
URL : https://hal.archives-ouvertes.fr/inria-00560582
, Toward Exascale Resilience. Int. Journal of High Performance Computing Applications, vol.23, issue.4, pp.374-388, 2009.
Combining process replication and checkpointing for resilience on exascale systems, INRIA, 2012. ,
URL : https://hal.archives-ouvertes.fr/hal-00697180
A higher order estimate of the optimum checkpoint interval for restart dumps, FGCS, vol.22, issue.3, pp.303-312, 2004. ,
The case for modular redundancy in large-scale highh performance computing systems, Proc. of the 8th IASTED Infernational Conference on Parallel and Distributed Computing and Networks (PDCN), pp.189-194, 2009. ,
Evaluating the Viability of Process Replication Reliability for Exascale Systems, Proc. of the ACM/IEEE SC Conf, 2011. ,
On Ramanujan's Q-Function, J. Computational and Applied Mathematics, vol.58, pp.103-116, 1995. ,
Taming of the shrew: Modeling the normal and faulty behavior of large-scale hpc systems, Proc. IPDPS'12, 2012. ,
Failure prediction for hpc systems and applications: Current situation and open issues, Int. J. High Perform. Comput. Appl, vol.27, issue.3, pp.273-282, 2013. ,
Fundamentals of fault-tolerant distributed computing in asynchronous environments, ACM Computing Surveys, vol.31, issue.1, 1999. ,
Fault-Tolerance Techniques for High-Performance Computing, Computer Communications and Networks, 2015. ,
Superposition of renewal processes and an application to multi-server queues, Statistics & probability letters, vol.76, issue.17, pp.1914-1924, 2006. ,
Scheduling Task Parallel Applications for Rapid Application Turnaround on Enterprise Desktop Grids, J. Grid Computing, vol.5, issue.4, pp.379-405, 2007. ,
Introduction to Probability Models, Eleventh Edition, 2009. ,
Understanding failures in petascale computers, Journal of Physics: Conference Series, vol.78, issue.1, 2007. ,
A first order approximation to the optimum checkpoint interval, Comm. of the ACM, vol.17, issue.9, pp.530-531, 1974. ,
Practical online failure prediction for blue gene/p: Period-based vs event-driven, Dependable Systems and Networks Workshops (DSN-W), pp.259-264, 2011. ,
Reliability-aware scalability models for high performance computing, Proc. of the IEEE Conference on Cluster Computing, 2009. ,
A practical failure prediction with location and lead time for blue gene/p, Dependable Systems and Networks Workshops (DSN-W), pp.15-22, 2010. ,