J. Dongarra, P. Beckman, P. Aerts, F. Cappello, T. Lippert et al., The International Exascale Software Project: a Call To Cooperative Action By the Global High-Performance Community, International Journal of High Performance Computing Applications, vol.23, issue.4, pp.309-322, 2009.
DOI : 10.1177/1094342009347714

G. Gibson, Failure tolerance in petascale computers, Journal of Physics: Conference Series, p.12022, 2007.

K. Ferreira, J. Stearley, J. H. Laros, R. Oldfield, K. Pedretti et al., Evaluating the viability of process replication reliability for exascale systems, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, 2011.
DOI : 10.1145/2063384.2063443

E. N. Elnozahy, L. Alvisi, Y. M. Wang, and D. B. Johnson, A survey of rollback-recovery protocols in message-passing systems, ACM Computing Surveys, vol.34, issue.3, pp.375-408, 2002.
DOI : 10.1145/568522.568525

A. Bouteiller, T. Herault, G. Bosilca, and J. J. Dongarra, Correlated Set Coordination in Fault Tolerant Message Logging Protocols, Proc. of Euro-Par'11 (II), pp.51-64, 2011.
DOI : 10.1007/978-3-642-23397-5_6

A. Guermouche, T. Ropars, M. Snir, and F. Cappello, HydEE: Failure Containment without Event Logging for Large Scale Send-Deterministic MPI Applications, 2012 IEEE 26th International Parallel and Distributed Processing Symposium, pp.1216-1227, 2012.
DOI : 10.1109/IPDPS.2012.111

URL : https://hal.archives-ouvertes.fr/hal-01121941

G. Bosilca, A. Bouteiller, E. Brunet, F. Cappello, J. Dongarra et al., Unified model for assessing checkpointing protocols at extreme-scale, Concurrency and Computation: Practice and Experience, vol.9, issue.16, 2012.
DOI : 10.1002/cpe.3173

URL : https://hal.archives-ouvertes.fr/hal-00696154

K. Huang and J. Abraham, Algorithm-based fault tolerance for matrix operations, IEEE Transactions on Computers, vol.100, issue.6, pp.518-528, 1984.

Z. Chen, G. E. Fagg, E. Gabriel, J. Langou, T. Angskun et al., Fault tolerant high performance computing by a coding approach, Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming , PPoPP '05, pp.213-223, 2005.
DOI : 10.1145/1065944.1065973

A. Bouteiller, T. Herault, G. Krawezik, P. Lemarinier, and F. Cappello, MPICH-V: a multiprotocol fault tolerant MPI, IJHPCA, vol.20, issue.3, pp.319-333, 2006.
DOI : 10.1177/1094342006067469

URL : https://hal.archives-ouvertes.fr/hal-00688637

A. Bouteiller, F. Cappello, J. Dongarra, A. Guermouche, T. Herault et al., Multi-criteria Checkpointing Strategies: Response-Time versus Resource Utilization, 2013.
DOI : 10.1007/978-3-642-40047-6_43

URL : https://hal.archives-ouvertes.fr/hal-00926606

H. Miyazaki, Y. Kusano, H. Okano, T. Nakada, K. Seki et al., K computer: 8.162 PetaFLOPS massively parallel scalar supercomputer built with over 548k cores, 2012 IEEE International Solid-State Circuits Conference, pp.192-194, 2012.
DOI : 10.1109/ISSCC.2012.6176971

S. Chakravorty and L. Kale, A Fault Tolerance Protocol with Fast Fault Recovery, 2007 IEEE International Parallel and Distributed Processing Symposium, pp.1-10, 2007.
DOI : 10.1109/IPDPS.2007.370310

X. Yang, Y. Du, P. Wang, H. Fu, and J. Jia, Ftpa: Supporting fault-tolerant parallel computing through parallel recomputing. Parallel and Distributed Systems, IEEE Transactions on, vol.20, issue.10, pp.1471-1486, 2009.

J. L. Gustafson, Reevaluating Amdahl's law, Communications of the ACM, vol.31, issue.5, pp.532-533, 1988.
DOI : 10.1145/42411.42415

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.509.6892

R. Thekkath and S. J. Eggers, The effectiveness of multiple hardware contexts, Proc. 6th int. conf. on Architectural support for programming languages and operating systems. ASPLOS VI, pp.328-337, 1994.

C. Huang, G. Zheng, L. Kalé, and S. Kumar, Performance evaluation of adaptive MPI, Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming , PPoPP '06, pp.12-21, 2006.
DOI : 10.1145/1122971.1122976

A. Bouteiller, H. L. Bouziane, T. Herault, P. Lemarinier, and F. Cappello, Hybrid Preemptive Scheduling of Message Passing Interface Applications on Grids, International Journal of High Performance Computing Applications, vol.20, issue.1, pp.77-90, 2006.
DOI : 10.1177/1094342006062526

J. T. Daly, A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, vol.22, issue.3, pp.303-312, 2004.
DOI : 10.1016/j.future.2004.11.016