A. H. Baker, R. D. Falgout, and U. M. Yang, An assumed partition algorithm for determining processor inter-communication, Parallel Computing, vol.32, issue.5-6, pp.5-6394, 2006.
DOI : 10.1016/j.parco.2006.06.009

L. Bautista-gomez, N. Maruyama, D. Komatitsch, S. Tsuboi, F. Cappello et al., FTI, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, 2011.
DOI : 10.1145/2063384.2063427

URL : https://hal.archives-ouvertes.fr/hal-00721216

L. Bautista-gomez, T. Ropars, N. Maruyama, F. Cappello, and S. Matsuoka, Hierarchical Clustering Strategies for Fault Tolerance in Large Scale HPC Systems, IEEE Cluster 2012, 2012.
URL : https://hal.archives-ouvertes.fr/hal-01121947

A. Bouteiller, G. Bosilca, and J. Dongarra, Redesigning the Message Logging Model for High Performance. Concurrency and Computation : Practice and Experience, pp.2196-2211, 2010.

A. Bouteiller, B. Collin, T. Herault, P. Lemarinier, and F. Cappello, Impact of Event Logger on Causal Message Logging Protocols for Fault Tolerant MPI, 19th IEEE International Parallel and Distributed Processing Symposium, p.97, 2005.
DOI : 10.1109/IPDPS.2005.249

A. Bouteiller, T. Herault, G. Bosilca, and J. Dongarra, Correlated Set Coordination in Fault Tolerant Message Logging Protocols, Proceedings of the 17th international conference on Parallel processing, Euro-Par'11, pp.51-64, 2011.
DOI : 10.1007/978-3-642-23397-5_6

A. Bouteiller, T. Ropars, G. Bosilca, C. Morin, and J. Dongarra, Reasons for a pessimistic or optimistic message logging protocol in MPI uncoordinated failure, recovery, 2009 IEEE International Conference on Cluster Computing and Workshops, 2009.
DOI : 10.1109/CLUSTR.2009.5289157

URL : https://hal.archives-ouvertes.fr/inria-00424017

F. Cappello, A. Guermouche, and M. Snir, On Communication Determinism in Parallel HPC Applications, 2010 Proceedings of 19th International Conference on Computer Communications and Networks, 2010.
DOI : 10.1109/ICCCN.2010.5560143

J. Chung, I. Lee, M. Sullivan, J. H. Ryoo, D. W. Kim et al., Containment Domains: A Scalable, Efficient, and Flexible Resilience Scheme for Exascale Systems, IEEE/ACM SuperComputing 2012, SC '12, pp.581-5811, 2012.

J. Dongarra, P. Beckman, and T. Moore, The International Exascale Software Project roadmap, International Journal of High Performance Computing Applications, vol.25, issue.1, pp.3-60, 2011.
DOI : 10.1177/1094342010391989

G. Dózsa, S. Kumar, P. Balaji, D. Buntinas, D. Goodell et al., Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems, Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface, EuroMPI'10, pp.11-20, 2010.
DOI : 10.1007/978-3-642-15646-5_2

P. Du, A. Bouteiller, G. Bosilca, T. Herault, and J. Dongarra, Algorithm-based fault tolerance for dense matrix factorizations, Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, pp.225-234, 2012.

E. N. Elnozahy, System Resilience at Extreme Scale, 2008.

E. N. Elnozahy, L. Alvisi, Y. Wang, and D. B. Johnson, A survey of rollback-recovery protocols in message-passing systems, ACM Computing Surveys, vol.34, issue.3, pp.375-408, 2002.
DOI : 10.1145/568522.568525

K. Ferreira, J. Stearley, J. H. Laros, I. , R. Oldfield et al., Evaluating the viability of process replication reliability for exascale systems, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, pp.441-4412, 2011.
DOI : 10.1145/2063384.2063443

D. Fiala, F. Mueller, C. Engelmann, R. Riesen, K. Ferreira et al., Detection and correction of silent data corruption for large-scale high-performance computing, IEEE/ACM SuperComputing 2012, pp.1-7812, 2012.

A. Guermouche, T. Ropars, E. Brunet, M. Snir, and F. Cappello, Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic Message Passing Applications, 25th IEEE International Parallel & Distributed Processing Symposium (IPDPS2011), 2011.

A. Guermouche, T. Ropars, M. Snir, and F. Cappello, HydEE: Failure Containment without Event Logging for Large Scale Send-Deterministic MPI Applications, 2012 IEEE 26th International Parallel and Distributed Processing Symposium, 2012.
DOI : 10.1109/IPDPS.2012.111

URL : https://hal.archives-ouvertes.fr/hal-01121941

V. E. Henson and U. M. Yang, BoomerAMG: A parallel algebraic multigrid solver and preconditioner, Applied Numerical Mathematics, vol.41, issue.1, pp.155-177, 2002.
DOI : 10.1016/S0168-9274(01)00115-5

D. B. Johnson and W. Zwaenepoel, Sender-Based Message Logging, Digest of Papers: The 17th Annual International Symposium on Fault-Tolerant Computing, pp.14-19, 1987.

R. Koo and S. Toueg, Checkpointing and Rollback-Recovery for Distributed Systems, Proceedings of 1986 ACM Fall joint computer conference, ACM '86, pp.1150-1158, 1986.
DOI : 10.1109/TSE.1987.232562

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.91.1616

L. Lamport, Time, clocks, and the ordering of events in a distributed system, Communications of the ACM, vol.21, issue.7, pp.558-565, 1978.
DOI : 10.1145/359545.359563

T. Mattson, B. Sanders, and B. Massingill, Patterns for Parallel Programming, 2004.

E. Meneses, C. L. Mendes, and L. V. Kale, Team-Based Message Logging: Preliminary Results, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, 2010.
DOI : 10.1109/CCGRID.2010.110

A. Moody, G. Bronevetsky, K. Mohror, and B. R. Supinski, Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, pp.1-11, 2010.
DOI : 10.1109/SC.2010.18

R. A. Oldfield, S. Arunagiri, P. J. Teller, S. Seelam, M. R. Varela et al., Modeling the Impact of Checkpoints on Next-Generation Systems, 24th IEEE Conference on Mass Storage Systems and Technologies (MSST 2007), pp.30-46, 2007.
DOI : 10.1109/MSST.2007.4367962

R. Riesen, K. Ferreira, D. Da-silva, P. Lemarinier, D. Arnold et al., Alleviating scalability issues of checkpointing protocols, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis, pp.1-1811, 2012.
DOI : 10.1109/SC.2012.18

T. Ropars, A. Guermouche, B. Uçar, E. Meneses, L. V. Kalé et al., On the Use of Cluster-Based Partial Message Logging to Improve Fault Tolerance for MPI HPC Applications, Proceedings of the 17th international conference on Parallel processing, Euro-Par'11, pp.567-578, 2011.
DOI : 10.1002/cpe.1364

URL : https://hal.archives-ouvertes.fr/hal-00786558

T. Ropars and C. Morin, Active optimistic and distributed message logging for message-passing applications. Concurrency and Computation: Practice and Experience, pp.2167-2178, 2011.
URL : https://hal.archives-ouvertes.fr/hal-00727470