J. Dongarra, P. Beckman, P. Aerts, F. Cappello, T. Lippert et al., The International Exascale Software Project: a Call To Cooperative Action By the Global High-Performance Community, International Journal of High Performance Computing Applications, vol.23, issue.4, pp.309-322, 2009.
DOI : 10.1177/1094342009347714

V. Sarkar, Exascale software study: Software challenges in extreme scale systems White paper available at: http://users.ece, 2009.

K. Ferreira, J. Stearley, J. Laros, R. Oldfield, K. Pedretti et al., Evaluating the viability of process replication reliability for exascale systems, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, 2011.
DOI : 10.1145/2063384.2063443

A. Bouteiller, T. Herault, G. Krawezik, P. Lemarinier, and F. Cappello, MPICH-V: a multiprotocol fault tolerant MPI, IJHPCA, vol.20, issue.3, pp.319-33310, 2006.
URL : https://hal.archives-ouvertes.fr/hal-00688637

S. Rao, L. Alvisi, H. Viny, and D. Sciences, Egida: an extensible toolkit for low-overhead fault-tolerance, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352), pp.48-55, 1999.
DOI : 10.1109/FTCS.1999.781033

E. Elnozahy, L. Alvisi, Y. Wang, and D. Johnson, A survey of rollback-recovery protocols in message-passing systems, ACM Computing Surveys, vol.34, issue.3, pp.375-408, 2002.
DOI : 10.1145/568522.568525

K. Chandy and L. Lamport, Distributed snapshots: determining global states of distributed systems, ACM Transactions on Computer Systems, vol.3, issue.1, pp.63-75, 1985.
DOI : 10.1145/214451.214456

S. Rao, L. Alvisi, and H. Vin, The cost of recovery in message logging protocols, 17th Symposium on Reliable Distributed Systems (SRDS), pp.10-18, 1998.

A. Bouteiller, T. Herault, G. Bosilca, and J. Dongarra, Correlated Set Coordination in Fault Tolerant Message Logging Protocols, Proc. of Euro-Par'11 (II), pp.51-64, 2011.
DOI : 10.1007/978-3-642-23397-5_6
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.472.2597

A. Guermouche, T. Ropars, M. Snir, and F. Cappello, HydEE: Failure Containment without Event Logging for Large Scale Send-Deterministic MPI Applications, 2012 IEEE 26th International Parallel and Distributed Processing Symposium, 2012.
DOI : 10.1109/IPDPS.2012.111
URL : https://hal.archives-ouvertes.fr/hal-01121941

E. Meneses, C. Kalé, and L. , Team-Based Message Logging: Preliminary Results, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, 2010.
DOI : 10.1109/CCGRID.2010.110
URL : http://charm.cs.illinois.edu/newPapers/10-02/paper.pdf

J. Young, A first order approximation to the optimum checkpoint interval, Communications of the ACM, vol.17, issue.9, pp.530-531, 1974.
DOI : 10.1145/361147.361115

J. Daly, A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, vol.22, issue.3, pp.303-312, 2004.
DOI : 10.1016/j.future.2004.11.016

Y. Ling, M. J. Lin, and X. , A variational calculus approach to optimal checkpoint placement, IEEE Trans. Computers, pp.699-708, 2001.

T. Ozaki, T. Dohi, H. Okamura, and N. Kaio, Distribution-Free Checkpoint Placement Algorithms Based on Min-Max Principle, IEEE Transactions on Dependable and Secure Computing, vol.3, issue.2, pp.130-140, 2006.
DOI : 10.1109/TDSC.2006.22

M. Bougeret, H. Casanova, M. Rabie, Y. Robert, and F. Vivien, Checkpointing strategies for parallel jobs, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, 2011.
DOI : 10.1145/2063384.2063428
URL : https://hal.archives-ouvertes.fr/hal-00738504

J. Plank and M. Thomason, Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems, Journal of Parallel and Distributed Computing, vol.61, issue.11, p.1590, 2001.
DOI : 10.1006/jpdc.2001.1757

H. Jin, Y. Chen, H. Zhu, and X. Sun, Optimizing HPC Fault-Tolerant Environment: An Analytical Approach, 2010 39th International Conference on Parallel Processing, pp.525-534, 2010.
DOI : 10.1109/ICPP.2010.80

L. Wang, P. Karthik, Z. Kalbarczyk, R. Iyer, L. Votta et al., Modeling Coordinated Checkpointing for Large-Scale Supercomputers, 2005 International Conference on Dependable Systems and Networks (DSN'05), pp.812-821, 2005.
DOI : 10.1109/DSN.2005.67

R. Oldfield, S. Arunagiri, P. Teller, S. Seelam, M. Varela et al., Modeling the Impact of Checkpoints on Next-Generation Systems, 24th IEEE Conference on Mass Storage Systems and Technologies (MSST 2007), pp.30-4610, 2007.
DOI : 10.1109/MSST.2007.4367962

Z. Zheng and Z. Lan, Reliability-aware scalability models for high performance computing, 2009 IEEE International Conference on Cluster Computing and Workshops, 2009.
DOI : 10.1109/CLUSTR.2009.5289177

M. Bouguerra, D. Trystram, and F. Wagner, Complexity Analysis of Checkpoint Scheduling with Variable Costs, IEEE Transactions on Computers, vol.62, issue.6, 2012.
DOI : 10.1109/TC.2012.57
URL : https://hal.archives-ouvertes.fr/hal-00788101

M. Wu, X. Sun, and J. H. , Performance under failures of high-end computing, Proceedings of the 2007 ACM/IEEE conference on Supercomputing , SC '07, 2007.
DOI : 10.1145/1362622.1362687

T. Heath, R. Martin, and T. Nguyen, Improving cluster availability using workstation validation, ACM SIGMETRICS Performance Evaluation Review, vol.30, issue.1, pp.217-227, 2002.
DOI : 10.1145/511399.511362

B. Schroeder and G. Gibson, A large-scale study of failures in high-performance computing systems, Proc. of DSN, 2006.

Y. Liu, R. Nassar, C. Leangsuksun, N. Naksinehaboon, M. Paun et al., An optimal checkpoint/restart model for a large scale high performance computing system, IPDPS'08, pp.1-9, 2008.

E. Heien, D. Kondo, A. Gainaru, D. Lapine, B. Kramer et al., Modeling and tolerating heterogeneous failures in large parallel systems, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, 2011.
DOI : 10.1145/2063384.2063444

A. Bouteiller, G. Bosilca, and J. Dongarra, Redesigning the message logging model for high performance, Concurrency and Computation: Practice and Experience, vol.20, issue.5, pp.2196-2211, 2010.
DOI : 10.1002/cpe.1589

L. Cannon, A cellular computer to implement the Kalman filter algorithm, 1969.

S. Sumimoto, An Overview of Fujitsu's Lustre Based File System. Lustre Filesystem Users' Group Meeting, 2011.

S. Agarwal, R. Garg, M. Gupta, and J. Moreira, Adaptive incremental checkpointing for massively parallel systems, Proceedings of the 18th annual international conference on Supercomputing , ICS '04, pp.277-286, 2004.
DOI : 10.1145/1006209.1006248

R. Gioiosa, J. Sancho, S. Jiang, F. Petrini, and K. Davis, Transparent, Incremental Checkpointing at Kernel Level: a Foundation for Fault Tolerance for Parallel Computers, ACM/IEEE SC 2005 Conference (SC'05), 2010.
DOI : 10.1109/SC.2005.76

A. Moody, G. Bronevetsky, K. Mohror, and S. Brd, Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System, Proceedings of the ACM/IEEE SC Conference, pp.1-11, 2010.

J. Plank, K. Li, and M. Puening, Diskless checkpointing, IEEE Transactions on Parallel and Distributed Systems, vol.9, issue.10, pp.972-986, 1998.
DOI : 10.1109/71.730527
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.30.4662

L. Gomez, N. Maruyama, F. Cappello, and S. Matsuoka, Distributed Diskless Checkpoint for Large Scale Systems, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, pp.63-72, 2010.
DOI : 10.1109/CCGRID.2010.40

X. Ouyang, S. Marcarelli, and D. Panda, Enhancing checkpoint performance with staging io and ssd Proceedings of the 2010 International Workshop on Storage Network Architecture and Parallel I/Os, SNAPI '10, 2010.