G. Aupy, Y. Robert, F. Vivien, and D. Zaidouni, Checkpointing strategies with prediction windows, Dependable Computing (PRDC), 2013 IEEE 19th Pacific Rim International Symposium on, pp.1-10, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00847622

G. Aupy, Y. Robert, F. Vivien, and D. Zaidouni, Checkpointing algorithms and fault prediction, Journal of Parallel and Distributed Computing, vol.74, issue.2, pp.2048-2064, 2014.
URL : https://hal.archives-ouvertes.fr/hal-00908446

M. Bougeret, H. Casanova, M. Rabie, Y. Robert, and F. Vivien, Checkpointing strategies for parallel jobs, Proceedings of SC'11, 2011.
URL : https://hal.archives-ouvertes.fr/inria-00560582

F. Cappello, A. Geist, B. Gropp, L. V. Kalé, B. Kramer et al., Toward Exascale Resilience. Int. Journal of High Performance Computing Applications, vol.23, issue.4, pp.374-388, 2009.

H. Casanova, Y. Robert, F. Vivien, and D. Zaidouni, Combining process replication and checkpointing for resilience on exascale systems, INRIA, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00697180

J. T. Daly, A higher order estimate of the optimum checkpoint interval for restart dumps, FGCS, vol.22, issue.3, pp.303-312, 2004.

C. Engelmann, H. H. Ong, and S. L. Scorr, The case for modular redundancy in large-scale highh performance computing systems, Proc. of the 8th IASTED Infernational Conference on Parallel and Distributed Computing and Networks (PDCN), pp.189-194, 2009.

K. Ferreira, J. Stearley, J. H. Laros, R. Oldfield, K. Pedretti et al., Evaluating the Viability of Process Replication Reliability for Exascale Systems, Proc. of the ACM/IEEE SC Conf, 2011.

P. Flajolet, P. J. Grabner, P. Kirschenhofer, and H. Prodinger, On Ramanujan's Q-Function, J. Computational and Applied Mathematics, vol.58, pp.103-116, 1995.

A. Gainaru, F. Cappello, and W. Kramer, Taming of the shrew: Modeling the normal and faulty behavior of large-scale hpc systems, Proc. IPDPS'12, 2012.

A. Gainaru, F. Cappello, M. Snir, and W. Kramer, Failure prediction for hpc systems and applications: Current situation and open issues, Int. J. High Perform. Comput. Appl, vol.27, issue.3, pp.273-282, 2013.

F. Gärtner, Fundamentals of fault-tolerant distributed computing in asynchronous environments, ACM Computing Surveys, vol.31, issue.1, 1999.

T. Hérault and Y. Robert, Fault-Tolerance Techniques for High-Performance Computing, Computer Communications and Networks, 2015.

O. Kella and W. Stadje, Superposition of renewal processes and an application to multi-server queues, Statistics & probability letters, vol.76, issue.17, pp.1914-1924, 2006.

D. Kondo, A. Chien, and H. Casanova, Scheduling Task Parallel Applications for Rapid Application Turnaround on Enterprise Desktop Grids, J. Grid Computing, vol.5, issue.4, pp.379-405, 2007.

S. M. Ross, Introduction to Probability Models, Eleventh Edition, 2009.

B. Schroeder and G. Gibson, Understanding failures in petascale computers, Journal of Physics: Conference Series, vol.78, issue.1, 2007.

J. W. Young, A first order approximation to the optimum checkpoint interval, Comm. of the ACM, vol.17, issue.9, pp.530-531, 1974.

L. Yu, Z. Zheng, Z. Lan, and S. Coghlan, Practical online failure prediction for blue gene/p: Period-based vs event-driven, Dependable Systems and Networks Workshops (DSN-W), pp.259-264, 2011.

Z. Zheng and Z. Lan, Reliability-aware scalability models for high performance computing, Proc. of the IEEE Conference on Cluster Computing, 2009.

Z. Zheng, Z. Lan, R. Gupta, S. Coghlan, and P. Beckman, A practical failure prediction with location and lead time for blue gene/p, Dependable Systems and Networks Workshops (DSN-W), pp.15-22, 2010.