L. Alvisi and K. Marzullo, Logging : Pessimistic, Optimistic, Causal, and Optimal, IEEE Transactions on Software Engineering, vol.24, issue.2, pp.149-159, 1998.
DOI : 10.1109/32.666828

L. Alvisi, S. Rao, S. A. Husain, . M. Asanka-d, E. Elnozahy et al., Analysis of CommunicationInduced Checkpointing, FTCS '99 : Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing, p.242, 1999.

K. Antypas, J. Shalf, and H. Wasserman, NERSC-6 Workload Analysis and Benchmark Selection Process, 2008.
DOI : 10.2172/938789

URL : https://digital.library.unt.edu/ark:/67531/metadc896561/m2/1/high_res_d/938789.pdf

K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands et al., The Landscape of Parallel Computing Research : A View from Berkeley, 2006.

, The Sequoia Benchmarks, 2009.

D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter et al., The NAS parallel benchmarks-summary and preliminary results », Supercomputing '91 : Proceedings of the 1991 ACM/IEEE conference on Supercomputing, pp.158-165, 1991.

B. Bhargava and S. Lian, « Independent checkpointing and concurrent rollback for recovery in distributed systems-an optimistic approach, Proceedings of the Seventh Symposium on Reliable Distributed Systems, pp.3-12, 1988.

K. Bhatia, K. Marzullo, and L. Alvisi, « Scalable causal message logging for wide-area environments, Concurrency and Computation : Practice and Experience, vol.15, issue.10, pp.873-889, 2003.
DOI : 10.1007/3-540-44681-8_120

A. Borg, J. Baumbach, and S. Glazer, Message System Supporting Fault Tolerance, SIGOPS Operating Systems Review, vol.17, issue.5, pp.90-99, 1983.
DOI : 10.1145/800217.806617

A. Bouteiller, G. Bosilca, and J. Dongarra, « Redesigning the Message Logging Model for High Performance, Concurrency and Computation : Practice and Experience, vol.22, pp.2196-2211, 2010.

A. Bouteiller, T. Herault, G. Bosilca, and J. Dongarra, « Correlated Set Coordination in Fault Tolerant Message Logging Protocols, Euro-Par 2011 Parallel Processing, vol.6853, pp.51-64, 2011.

A. Bouteiller, T. Hérault, G. Bosilca, and J. J. Dongarra, Correlated Set Coordination in Fault Tolerant Message Logging Protocols », pp.51-64, 2011.

A. Bouteiller, T. Ropars, G. Bosilca, C. Morin, and J. Dongarra, « Reasons for a Pessimistic or Optimistic Message Logging Protocol in MPI Uncoordinated Failure Recovery, IEEE International Conference on Cluster Computing, 2009.

D. Briatico, A. Ciuffoletti, and L. Simoncini, Distributed Domino Effect Free Recovery Algorithm, IEEE International Symposium on Reliability, Distributed Softwares, and Databases, pp.207-215, 1984.

F. Cappello, « Fault Tolerance in Petascale/ Exascale Systems : Current Knowledge, Challenges and Research Opportunities, International Journal of High Performance Computing Applications, vol.23, pp.212-226, 2009.

F. Cappello, A. Guermouche, and M. Snir, On Communication Determinism in Parallel HPC Applications », 19th International Conference on Computer Communications and Networks (ICCCN 2010), 2010.

K. Chandy and L. Lamport, Snapshots : Determining Global States of Distributed Systems, ACM Transactions on Computer Systems, vol.3, issue.1, pp.63-75, 1985.

J. Daly, « A model for predicting the optimum checkpoint interval for restart dumps, Proceedings of the 2003 international conference on Computational science, ICCS'03, pp.3-12, 2003.

J. T. Daly, « A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, vol.22, pp.303-312, 2006.

O. P. Damani and V. K. Garg, « How to Recover Efficiently and Asynchronously when Optimism Fails, International Conference on Distributed Computing systems, pp.108-115, 1996.

E. N. Elnozahy, System Resilience at Extreme Scale, DARPA, 2008.

E. N. Elnozahy, L. Alvisi, Y. Wang, and D. B. Johnson, Survey of Rollback-Recovery Protocols in Message-Passing Systems, ACM Computing Surveys, vol.34, issue.3, pp.375-408, 2002.

E. N. Elnozahy, W. Zwaenepoel, and . Manetho, Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit, IEEE Transactions on Computers, vol.41, issue.5, pp.526-531, 1992.

Q. Gao, W. Huang, M. J. Koop, and D. K. Panda, « Group-based Coordinated Checkpointing for MPI : A Case Study on InfiniBand, Proceedings of the 2007 International Conference on Parallel Processing, ICPP '07, p.47, 2007.

A. Guermouche, T. Ropars, E. Brunet, M. Snir, and F. Cappello, « Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic Message Passing Applications, 25th IEEE International Parallel & Distributed Processing Symposium (IPDPS2011), Anchorage, 2011.

B. Gupta, R. Nikolaev, and R. Chirra, Recovery Scheme for Cluster Federations Using Senderbased Message Logging, Journal of Computing and Information Technology, vol.19, issue.2, pp.127-139, 2011.

B. Gupta, S. Rahimi, V. Allam, and V. Jupally, « Domino-effect free crash recovery for concurrent failures in cluster federation, Proceedings of the 3rd international conference on Advances in grid and pervasive computing, GPC'08, pp.4-17, 2008.

J. C. Ho, C. Wang, F. C. Lau, and . Scalable, Group-Based Checkpoint/Restart for LargeScale Message-Passing Systems », 22nd IEEE International Parallel and Distributed Processing Symposium, 2008.

D. B. Johnson, W. Zwaenepoel, and . Sender, Based Message Logging, The 17th Annual International Symposium on Fault-Tolerant Computing, pp.14-19, 1987.

P. Kogge, Exascale Computing Study : Technology Challenges in Achieving Exascale Systems, 2008.

D. Komatitsch, S. Tsuboi, C. Ji, and J. Tromp, « A 14.6 billion degrees of freedom, 5 teraflops, 2.5 terabyte earthquake simulation on the Earth Simulator, pp.4-11, 2003.

R. Koo and S. Toueg, Checkpointing and Rollback-Recovery for Distributed Systems, ACM Fall joint computer conference, ACM, vol.86, pp.1150-1158, 1986.

L. Lamport, Clocks, and the Ordering of Events in a Distributed System, Communications of the ACM, vol.21, pp.558-565, 1978.

B. W. Lampson and H. E. Sturgis, Crash Recovery in a Distributed Data Storage System, 1979.

E. Meneses, C. L. Mendes, and L. V. Kale, Team-based Message Logging : Preliminary Results », 3rd Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids, 2010.

, Message Passing Interface Forum, « MPI : A Message-Passing Interface Standard, 1995.

S. Monnet, C. Morin, R. Badrinath, and . Hybrid, Checkpointing for Parallel Applications in Cluster Federations, Proceedings of the 2004 IEEE International Symposium on Cluster Computing and the Grid (CCGRID'04), pp.773-782, 2004.
URL : https://hal.archives-ouvertes.fr/inria-00000990

A. Moody, G. Bronevetsky, K. Mohror, B. R. Supinski, and . Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System, Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC '10, pp.1-11, 2010.

N. Neves and W. K. Fuchs, « Using Time to Improve the Performance of Coordinated Checkpointing, Proceedings of the International Computer Performance & Dependability Symposium, pp.282-291, 1996.

R. A. Oldfield, S. Arunagiri, P. J. Teller, S. Seelam, M. R. Varela et al., deling the Impact of Checkpoints on Next-Generation Systems », MSST '07 : Proceedings of the 24th IEEE Conference on Mass Storage Systems and Technologies, pp.30-46, 2007.

B. Randell, System structure for software fault tolerance », Proceedings of the international conference on Reliable software, pp.437-449, 1975.

T. Ropars, A. Guermouche, B. Uçar, E. Meneses, L. V. Kalé et al., On the Use of Cluster-Based Partial Message Logging to Improve Fault Tolerance for MPI HPC Applications, Euro-Par, issue.1, pp.567-578, 2011.
URL : https://hal.archives-ouvertes.fr/hal-00786558

T. Ropars, C. Morin, and . Fault, Workshop on Resiliency in High Performance Computing, p.5, 2008.

T. Ropars and C. Morin, Optimistic Message Logging for Reliable Execution of MPI Applications, 15th International Euro-Par Conference, pp.615-626, 2009.
URL : https://hal.archives-ouvertes.fr/inria-00424002

T. Ropars and C. Morin, Optimistic and Distributed Message Logging for MessagePassing Applications, Concurrency and Computation : Practice and Experience, 2011.
URL : https://hal.archives-ouvertes.fr/hal-00727470

A. P. Sistla and J. L. Welch, « Efficient Distributed Recovery Using Message Logging, Proceedings of the eighth annual ACM Symposium on Principles of distributed computing (PODC '89), pp.223-238, 1989.

S. W. Smith and D. B. Johnson, « Minimizing Timestamp Size for Completely Asynchronous Optimistic Recovery with Minimal Rollback, Proceedings of the 15th Symposium on Reliable Distributed Systems (SRDS '96), p.66, 1996.

R. E. Strom, S. Yemini, and . Optimistic, Distributed Systems, vol.3, pp.204-226, 1985.

Y. Tamir and C. H. Séquin, « Error Recovery in Multicomputers Using Global Checkpoints, International Conference on Parallel Processing, pp.32-41, 1984.

N. Vaidya, Distributed Recovery Units : An Approach for Hybrid and Adaptive Distributed Recovery, 1993.

, www.top500.org, « TOP500 List of Worlds Supercomputers, 2011.

Y. Li, K. F. Li, W. Zhang, and D. , « Trading Off Logging Overhead and Coordinating Overhead to Achieve Efficient Rollback Recovery, Concurrency and Computation : Practice and Experience, vol.21, pp.819-853, 2009.