G. Aupy, Y. Robert, F. Vivien, and D. Zaidouni, Checkpointing Strategies with Prediction Windows, 2013 IEEE 19th Pacific Rim International Symposium on Dependable Computing, pp.1-10, 2013.
DOI : 10.1109/PRDC.2013.9

URL : https://hal.archives-ouvertes.fr/hal-00789109

G. Aupy, Y. Robert, F. Vivien, and D. Zaidouni, Checkpointing algorithms and fault prediction, Journal of Parallel and Distributed Computing, vol.74, issue.2, pp.2048-2064, 2014.
DOI : 10.1016/j.jpdc.2013.10.010

URL : https://hal.archives-ouvertes.fr/hal-00788313

A. Avizienis, J. C. Laprie, B. Randell, and C. Landwehr, Basic concepts and taxonomy of dependable and secure computing, IEEE Transactions on Dependable and Secure Computing, vol.1, issue.1, pp.11-33, 2004.
DOI : 10.1109/TDSC.2004.2

M. Bougeret, H. Casanova, M. Rabie, Y. Robert, and F. Vivien, Checkpointing strategies for parallel jobs, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, 2011.
DOI : 10.1145/2063384.2063428

URL : https://hal.archives-ouvertes.fr/hal-00738504

F. Cappello, A. Geist, B. Gropp, L. V. Kalé, B. Kramer et al., Toward Exascale Resilience, International Journal of High Performance Computing Applications, vol.23, issue.4, pp.374-388, 2009.
DOI : 10.1177/1094342009347767

H. Casanova, Y. Robert, F. Vivien, and D. Zaidouni, Combining process replication and checkpointing for resilience on exascale systems, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00697180

K. M. Chandy and L. Lamport, Distributed snapshots: determining global states of distributed systems, ACM Transactions on Computer Systems, vol.3, issue.1, pp.63-75, 1985.
DOI : 10.1145/214451.214456

J. T. Daly, A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, vol.22, issue.3, pp.303-312, 2004.
DOI : 10.1016/j.future.2004.11.016

C. Engelmann, H. H. Ong, and S. L. Scorr, The case for modular redundancy in large-scale highh performance computing systems, Proc. of the 8th IASTED Infernational Conference on Parallel and Distributed Computing and Networks (PDCN), pp.189-194, 2009.

K. Ferreira, J. Stearley, J. H. Iii-laros, R. Oldfield, K. Pedretti et al., Evaluating the viability of process replication reliability for exascale systems, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, 2011.
DOI : 10.1145/2063384.2063443

P. Flajolet, P. J. Grabner, P. Kirschenhofer, and H. Prodinger, On Ramanujan's Q-function, Journal of Computational and Applied Mathematics, vol.58, issue.1, pp.103-116, 1995.
DOI : 10.1016/0377-0427(93)E0258-N

A. Gainaru, F. Cappello, and W. Kramer, Taming of the Shrew: Modeling the Normal and Faulty Behaviour of Large-scale HPC Systems, 2012 IEEE 26th International Parallel and Distributed Processing Symposium, 2012.
DOI : 10.1109/IPDPS.2012.107

A. Gainaru, F. Cappello, M. Snir, and W. Kramer, Failure prediction for HPC systems and applications: Current situation and open issues, International Journal of High Performance Computing Applications, vol.27, issue.3, pp.273-282, 2013.
DOI : 10.1177/1094342013488258

F. Gärtner, Fundamentals of fault-tolerant distributed computing in asynchronous environments, ACM Computing Surveys, vol.31, issue.1, 1999.
DOI : 10.1145/311531.311532

K. Huang and J. A. Abraham, Algorithm-based fault tolerance for matrix operations, IEEE Trans. Computers, vol.33, issue.6, pp.518-528, 1984.

D. Kondo, A. Chien, and H. Casanova, Scheduling Task Parallel Applications for Rapid Turnaround on Enterprise Desktop Grids, Journal of Grid Computing, vol.290, issue.5???6, pp.379-405, 2007.
DOI : 10.1007/s10723-007-9063-y

R. E. Lyons and W. Vanderkulk, The Use of Triple-Modular Redundancy to Improve Computer Reliability, IBM Journal of Research and Development, vol.6, issue.2, pp.200-209, 1962.
DOI : 10.1147/rd.62.0200

M. Sheldon and . Ross, Introduction to Probability Models, Eleventh Edition, 2009.

B. Schroeder and G. Gibson, Understanding failures in petascale computers, Journal of Physics: Conference Series, vol.78, issue.1, 2007.
DOI : 10.1088/1742-6596/78/1/012022

J. W. Young, A first order approximation to the optimum checkpoint interval, Communications of the ACM, vol.17, issue.9, pp.530-531, 1974.
DOI : 10.1145/361147.361115

L. Yu, Z. Zheng, Z. Lan, and S. Coghlan, Practical online failure prediction for Blue Gene/P: Period-based vs event-driven, 2011 IEEE/IFIP 41st International Conference on Dependable Systems and Networks Workshops (DSN-W), pp.259-264, 2011.
DOI : 10.1109/DSNW.2011.5958823

Z. Zheng and Z. Lan, Reliability-aware scalability models for high performance computing, 2009 IEEE International Conference on Cluster Computing and Workshops, 2009.
DOI : 10.1109/CLUSTR.2009.5289177

Z. Zheng, Z. Lan, R. Gupta, S. Coghlan, and P. Beckman, A practical failure prediction with location and lead time for Blue Gene/P, 2010 International Conference on Dependable Systems and Networks Workshops (DSN-W), pp.15-22, 2010.
DOI : 10.1109/DSNW.2010.5542627