The International Exascale Software Project: a Call To Cooperative Action By the Global High-Performance Community, International Journal of High Performance Computing Applications, vol.23, issue.4, pp.309-322, 2009. ,
DOI : 10.1177/1094342009347714
Exascale software study: Software challenges in extreme scale systems White paper available at: http://users.ece, 2009. ,
Evaluating the viability of process replication reliability for exascale systems, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, 2011. ,
DOI : 10.1145/2063384.2063443
MPICH-V: a multiprotocol fault tolerant MPI, IJHPCA, vol.20, issue.3, pp.319-33310, 2006. ,
URL : https://hal.archives-ouvertes.fr/hal-00688637
Egida: an extensible toolkit for low-overhead fault-tolerance, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352), pp.48-55, 1999. ,
DOI : 10.1109/FTCS.1999.781033
A survey of rollback-recovery protocols in message-passing systems, ACM Computing Surveys, vol.34, issue.3, pp.375-408, 2002. ,
DOI : 10.1145/568522.568525
Distributed snapshots: determining global states of distributed systems, ACM Transactions on Computer Systems, vol.3, issue.1, pp.63-75, 1985. ,
DOI : 10.1145/214451.214456
The cost of recovery in message logging protocols, 17th Symposium on Reliable Distributed Systems (SRDS), pp.10-18, 1998. ,
Correlated Set Coordination in Fault Tolerant Message Logging Protocols, Proc. of Euro-Par'11 (II), pp.51-64, 2011. ,
DOI : 10.1007/978-3-642-23397-5_6
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.472.2597
HydEE: Failure Containment without Event Logging for Large Scale Send-Deterministic MPI Applications, 2012 IEEE 26th International Parallel and Distributed Processing Symposium, 2012. ,
DOI : 10.1109/IPDPS.2012.111
URL : https://hal.archives-ouvertes.fr/hal-01121941
Team-Based Message Logging: Preliminary Results, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, 2010. ,
DOI : 10.1109/CCGRID.2010.110
URL : http://charm.cs.illinois.edu/newPapers/10-02/paper.pdf
A first order approximation to the optimum checkpoint interval, Communications of the ACM, vol.17, issue.9, pp.530-531, 1974. ,
DOI : 10.1145/361147.361115
A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, vol.22, issue.3, pp.303-312, 2004. ,
DOI : 10.1016/j.future.2004.11.016
A variational calculus approach to optimal checkpoint placement, IEEE Trans. Computers, pp.699-708, 2001. ,
Distribution-Free Checkpoint Placement Algorithms Based on Min-Max Principle, IEEE Transactions on Dependable and Secure Computing, vol.3, issue.2, pp.130-140, 2006. ,
DOI : 10.1109/TDSC.2006.22
Checkpointing strategies for parallel jobs, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, 2011. ,
DOI : 10.1145/2063384.2063428
URL : https://hal.archives-ouvertes.fr/hal-00738504
Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems, Journal of Parallel and Distributed Computing, vol.61, issue.11, p.1590, 2001. ,
DOI : 10.1006/jpdc.2001.1757
Optimizing HPC Fault-Tolerant Environment: An Analytical Approach, 2010 39th International Conference on Parallel Processing, pp.525-534, 2010. ,
DOI : 10.1109/ICPP.2010.80
Modeling Coordinated Checkpointing for Large-Scale Supercomputers, 2005 International Conference on Dependable Systems and Networks (DSN'05), pp.812-821, 2005. ,
DOI : 10.1109/DSN.2005.67
Modeling the Impact of Checkpoints on Next-Generation Systems, 24th IEEE Conference on Mass Storage Systems and Technologies (MSST 2007), pp.30-4610, 2007. ,
DOI : 10.1109/MSST.2007.4367962
Reliability-aware scalability models for high performance computing, 2009 IEEE International Conference on Cluster Computing and Workshops, 2009. ,
DOI : 10.1109/CLUSTR.2009.5289177
Complexity Analysis of Checkpoint Scheduling with Variable Costs, IEEE Transactions on Computers, vol.62, issue.6, 2012. ,
DOI : 10.1109/TC.2012.57
URL : https://hal.archives-ouvertes.fr/hal-00788101
Performance under failures of high-end computing, Proceedings of the 2007 ACM/IEEE conference on Supercomputing , SC '07, 2007. ,
DOI : 10.1145/1362622.1362687
Improving cluster availability using workstation validation, ACM SIGMETRICS Performance Evaluation Review, vol.30, issue.1, pp.217-227, 2002. ,
DOI : 10.1145/511399.511362
A large-scale study of failures in high-performance computing systems, Proc. of DSN, 2006. ,
An optimal checkpoint/restart model for a large scale high performance computing system, IPDPS'08, pp.1-9, 2008. ,
Modeling and tolerating heterogeneous failures in large parallel systems, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, 2011. ,
DOI : 10.1145/2063384.2063444
Redesigning the message logging model for high performance, Concurrency and Computation: Practice and Experience, vol.20, issue.5, pp.2196-2211, 2010. ,
DOI : 10.1002/cpe.1589
A cellular computer to implement the Kalman filter algorithm, 1969. ,
An Overview of Fujitsu's Lustre Based File System. Lustre Filesystem Users' Group Meeting, 2011. ,
Adaptive incremental checkpointing for massively parallel systems, Proceedings of the 18th annual international conference on Supercomputing , ICS '04, pp.277-286, 2004. ,
DOI : 10.1145/1006209.1006248
Transparent, Incremental Checkpointing at Kernel Level: a Foundation for Fault Tolerance for Parallel Computers, ACM/IEEE SC 2005 Conference (SC'05), 2010. ,
DOI : 10.1109/SC.2005.76
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System, Proceedings of the ACM/IEEE SC Conference, pp.1-11, 2010. ,
Diskless checkpointing, IEEE Transactions on Parallel and Distributed Systems, vol.9, issue.10, pp.972-986, 1998. ,
DOI : 10.1109/71.730527
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.30.4662
Distributed Diskless Checkpoint for Large Scale Systems, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, pp.63-72, 2010. ,
DOI : 10.1109/CCGRID.2010.40
Enhancing checkpoint performance with staging io and ssd Proceedings of the 2010 International Workshop on Storage Network Architecture and Parallel I/Os, SNAPI '10, 2010. ,