An assumed partition algorithm for determining processor inter-communication, Parallel Computing, vol.32, issue.5-6, pp.5-6394, 2006. ,
DOI : 10.1016/j.parco.2006.06.009
FTI, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, 2011. ,
DOI : 10.1145/2063384.2063427
URL : https://hal.archives-ouvertes.fr/hal-00721216
Hierarchical Clustering Strategies for Fault Tolerance in Large Scale HPC Systems, IEEE Cluster 2012, 2012. ,
URL : https://hal.archives-ouvertes.fr/hal-01121947
Redesigning the Message Logging Model for High Performance. Concurrency and Computation : Practice and Experience, pp.2196-2211, 2010. ,
Impact of Event Logger on Causal Message Logging Protocols for Fault Tolerant MPI, 19th IEEE International Parallel and Distributed Processing Symposium, p.97, 2005. ,
DOI : 10.1109/IPDPS.2005.249
Correlated Set Coordination in Fault Tolerant Message Logging Protocols, Proceedings of the 17th international conference on Parallel processing, Euro-Par'11, pp.51-64, 2011. ,
DOI : 10.1007/978-3-642-23397-5_6
Reasons for a pessimistic or optimistic message logging protocol in MPI uncoordinated failure, recovery, 2009 IEEE International Conference on Cluster Computing and Workshops, 2009. ,
DOI : 10.1109/CLUSTR.2009.5289157
URL : https://hal.archives-ouvertes.fr/inria-00424017
On Communication Determinism in Parallel HPC Applications, 2010 Proceedings of 19th International Conference on Computer Communications and Networks, 2010. ,
DOI : 10.1109/ICCCN.2010.5560143
Containment Domains: A Scalable, Efficient, and Flexible Resilience Scheme for Exascale Systems, IEEE/ACM SuperComputing 2012, SC '12, pp.581-5811, 2012. ,
The International Exascale Software Project roadmap, International Journal of High Performance Computing Applications, vol.25, issue.1, pp.3-60, 2011. ,
DOI : 10.1177/1094342010391989
Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems, Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface, EuroMPI'10, pp.11-20, 2010. ,
DOI : 10.1007/978-3-642-15646-5_2
Algorithm-based fault tolerance for dense matrix factorizations, Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, pp.225-234, 2012. ,
System Resilience at Extreme Scale, 2008. ,
A survey of rollback-recovery protocols in message-passing systems, ACM Computing Surveys, vol.34, issue.3, pp.375-408, 2002. ,
DOI : 10.1145/568522.568525
Evaluating the viability of process replication reliability for exascale systems, Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on, SC '11, pp.441-4412, 2011. ,
DOI : 10.1145/2063384.2063443
Detection and correction of silent data corruption for large-scale high-performance computing, IEEE/ACM SuperComputing 2012, pp.1-7812, 2012. ,
Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic Message Passing Applications, 25th IEEE International Parallel & Distributed Processing Symposium (IPDPS2011), 2011. ,
HydEE: Failure Containment without Event Logging for Large Scale Send-Deterministic MPI Applications, 2012 IEEE 26th International Parallel and Distributed Processing Symposium, 2012. ,
DOI : 10.1109/IPDPS.2012.111
URL : https://hal.archives-ouvertes.fr/hal-01121941
BoomerAMG: A parallel algebraic multigrid solver and preconditioner, Applied Numerical Mathematics, vol.41, issue.1, pp.155-177, 2002. ,
DOI : 10.1016/S0168-9274(01)00115-5
Sender-Based Message Logging, Digest of Papers: The 17th Annual International Symposium on Fault-Tolerant Computing, pp.14-19, 1987. ,
Checkpointing and Rollback-Recovery for Distributed Systems, Proceedings of 1986 ACM Fall joint computer conference, ACM '86, pp.1150-1158, 1986. ,
DOI : 10.1109/TSE.1987.232562
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.91.1616
Time, clocks, and the ordering of events in a distributed system, Communications of the ACM, vol.21, issue.7, pp.558-565, 1978. ,
DOI : 10.1145/359545.359563
Patterns for Parallel Programming, 2004. ,
Team-Based Message Logging: Preliminary Results, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, 2010. ,
DOI : 10.1109/CCGRID.2010.110
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, pp.1-11, 2010. ,
DOI : 10.1109/SC.2010.18
Modeling the Impact of Checkpoints on Next-Generation Systems, 24th IEEE Conference on Mass Storage Systems and Technologies (MSST 2007), pp.30-46, 2007. ,
DOI : 10.1109/MSST.2007.4367962
Alleviating scalability issues of checkpointing protocols, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis, pp.1-1811, 2012. ,
DOI : 10.1109/SC.2012.18
On the Use of Cluster-Based Partial Message Logging to Improve Fault Tolerance for MPI HPC Applications, Proceedings of the 17th international conference on Parallel processing, Euro-Par'11, pp.567-578, 2011. ,
DOI : 10.1002/cpe.1364
URL : https://hal.archives-ouvertes.fr/hal-00786558
Active optimistic and distributed message logging for message-passing applications. Concurrency and Computation: Practice and Experience, pp.2167-2178, 2011. ,
URL : https://hal.archives-ouvertes.fr/hal-00727470