The International Exascale Software Project: a Call To Cooperative Action By the Global High-Performance Community, International Journal of High Performance Computing Applications, vol.23, issue.4, pp.309-322, 2009. ,
DOI : 10.1177/1094342009347714
Failure tolerance in petascale computers, Journal of Physics: Conference Series, p.12022, 2007. ,
Evaluating the viability of process replication reliability for exascale systems, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, 2011. ,
DOI : 10.1145/2063384.2063443
A survey of rollback-recovery protocols in message-passing systems, ACM Computing Surveys, vol.34, issue.3, pp.375-408, 2002. ,
DOI : 10.1145/568522.568525
Correlated Set Coordination in Fault Tolerant Message Logging Protocols, Proc. of Euro-Par'11 (II), pp.51-64, 2011. ,
DOI : 10.1007/978-3-642-23397-5_6
HydEE: Failure Containment without Event Logging for Large Scale Send-Deterministic MPI Applications, 2012 IEEE 26th International Parallel and Distributed Processing Symposium, pp.1216-1227, 2012. ,
DOI : 10.1109/IPDPS.2012.111
URL : https://hal.archives-ouvertes.fr/hal-01121941
Unified model for assessing checkpointing protocols at extreme-scale, Concurrency and Computation: Practice and Experience, vol.9, issue.16, 2012. ,
DOI : 10.1002/cpe.3173
URL : https://hal.archives-ouvertes.fr/hal-00696154
Algorithm-based fault tolerance for matrix operations, IEEE Transactions on Computers, vol.100, issue.6, pp.518-528, 1984. ,
Fault tolerant high performance computing by a coding approach, Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming , PPoPP '05, pp.213-223, 2005. ,
DOI : 10.1145/1065944.1065973
MPICH-V: a multiprotocol fault tolerant MPI, IJHPCA, vol.20, issue.3, pp.319-333, 2006. ,
DOI : 10.1177/1094342006067469
URL : https://hal.archives-ouvertes.fr/hal-00688637
Multi-criteria Checkpointing Strategies: Response-Time versus Resource Utilization, 2013. ,
DOI : 10.1007/978-3-642-40047-6_43
URL : https://hal.archives-ouvertes.fr/hal-00926606
K computer: 8.162 PetaFLOPS massively parallel scalar supercomputer built with over 548k cores, 2012 IEEE International Solid-State Circuits Conference, pp.192-194, 2012. ,
DOI : 10.1109/ISSCC.2012.6176971
A Fault Tolerance Protocol with Fast Fault Recovery, 2007 IEEE International Parallel and Distributed Processing Symposium, pp.1-10, 2007. ,
DOI : 10.1109/IPDPS.2007.370310
Ftpa: Supporting fault-tolerant parallel computing through parallel recomputing. Parallel and Distributed Systems, IEEE Transactions on, vol.20, issue.10, pp.1471-1486, 2009. ,
Reevaluating Amdahl's law, Communications of the ACM, vol.31, issue.5, pp.532-533, 1988. ,
DOI : 10.1145/42411.42415
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.509.6892
The effectiveness of multiple hardware contexts, Proc. 6th int. conf. on Architectural support for programming languages and operating systems. ASPLOS VI, pp.328-337, 1994. ,
Performance evaluation of adaptive MPI, Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming , PPoPP '06, pp.12-21, 2006. ,
DOI : 10.1145/1122971.1122976
Hybrid Preemptive Scheduling of Message Passing Interface Applications on Grids, International Journal of High Performance Computing Applications, vol.20, issue.1, pp.77-90, 2006. ,
DOI : 10.1177/1094342006062526
A higher order estimate of the optimum checkpoint interval for restart dumps, Future Generation Computer Systems, vol.22, issue.3, pp.303-312, 2004. ,
DOI : 10.1016/j.future.2004.11.016