Unreliable failure detectors for reliable distributed systems, Journal of the ACM, vol.43, issue.2, pp.225-267, 1996. ,
DOI : 10.1145/226643.226647
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.113.498
A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems, The Journal of Supercomputing, vol.6, issue.5, pp.1302-1326, 2013. ,
DOI : 10.1007/s11227-013-0884-0
Practical scalable consensus for pseudosynchronous distributed systems, Proc. SC'15, 2015. ,
An evaluation of user-level failure mitigation support in MPI, pp.95-1171, 2013. ,
Lessons Learned Implementing User-Level Failure Mitigation in MPICH, 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, 2015. ,
DOI : 10.1109/CCGrid.2015.51
Scalable and Fault Tolerant Failure Detection and Consensus, Proceedings of the 22nd European MPI Users' Group Meeting on ZZZ, EuroMPI '15, 2015. ,
DOI : 10.1145/2802658.2802660
Post-failure recovery of MPI communication capability: Design and rationale, International Journal of High Performance Computing Applications, vol.27, issue.3, pp.244-254, 2013. ,
DOI : 10.1177/1094342013488238
Efficient collective communication in distributed heterogeneous systems, Journal of Parallel and Distributed Computing, vol.63, issue.3, pp.251-263, 2003. ,
DOI : 10.1016/S0743-7315(03)00008-X
URL : http://ceng.usc.edu/~prasanna/pubs/bhat-icdcs-99.ps
Characterizing application sensitivity to OS interference using kernel-level noise injection, 2008 SC, International Conference for High Performance Computing, Networking, Storage and Analysis, 2008. ,
DOI : 10.1109/SC.2008.5219920
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.151.478
Characterizing the Influence of System Noise on Large-Scale Applications by Simulation, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, 2010. ,
DOI : 10.1109/SC.2010.12
Assessing HPC Failure Detectors for MPI Jobs, 2012 20th Euromicro International Conference on Parallel, Distributed and Network-based Processing, 2012. ,
DOI : 10.1109/PDP.2012.11
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.298.9987
On the quality of service of failure detectors, IEEE Transactions on Computers, vol.51, issue.5, pp.561-580, 2002. ,
DOI : 10.1109/TC.2002.1004595
Reliable broadcast in hypercube multicomputers, IEEE Transactions on Computers, vol.37, issue.12, pp.1654-1657, 1988. ,
DOI : 10.1109/12.9743
On scalable and efficient distributed failure detectors, Proceedings of the twentieth annual ACM symposium on Principles of distributed computing , PODC '01, pp.170-179, 2001. ,
DOI : 10.1145/383962.384010
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.133.4379
SWIM: scalable weakly-consistent infection-style process group membership protocol, Proceedings International Conference on Dependable Systems and Networks, pp.303-312, 2002. ,
DOI : 10.1109/DSN.2002.1028914
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.9737
A Case for Epidemic Fault Detection and Group Membership in HPC Storage Systems, 5th Int. Workshop on Performance Modeling, Benchmarking, and Simulation (PMBS), pp.237-248, 2014. ,
DOI : 10.1007/978-3-319-17248-4_12
Probability and Computing: Randomized Algorithms and Probabilistic Analysis, 2005. ,
DOI : 10.1017/CBO9780511813603
Intelligent platform management interface (IPMI), 2009. ,
DOI : 10.2172/1104721
URL : http://www.osti.gov/scitech/servlets/purl/1104721
Performance analysis of a hierarchical failure detector, 2003 International Conference on Dependable Systems and Networks, 2003. Proceedings., pp.635-644, 2003. ,
DOI : 10.1109/DSN.2003.1209973
Optimal implementation of the weakest failure detector for solving consensus, Proceedings 19th IEEE Symposium on Reliable Distributed Systems SRDS-2000, pp.52-59, 2000. ,
DOI : 10.1109/RELDI.2000.885392
A Gossip-Style Failure Detection Service, Proceedings of the IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing, Middleware '98, pp.55-70, 1998. ,
DOI : 10.1007/978-1-4471-1283-9_4
Failure detectors for large-scale distributed systems, 21st IEEE Symposium on Reliable Distributed Systems, 2002. Proceedings., pp.13-16, 2002. ,
DOI : 10.1109/RELDIS.2002.1180218
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.100.5723
A scalable and efficient self-organizing failure detector for grid applications, The 6th IEEE/ACM International Workshop on Grid Computing, 2005., pp.202-210, 2005. ,
DOI : 10.1109/GRID.2005.1542743
Design and Implementation of a Scalable Membership Service for Supercomputer Resiliency-Aware Runtime, Euro-Par 2013 Parallel Processing -19th International Conference, p.pp, 2013. ,
DOI : 10.1007/978-3-642-40047-6_37
Fault-tolerant broadcasting and gossiping in communication networks, Networks, vol.28, issue.3, pp.143-156, 1996. ,
DOI : 10.1002/(SICI)1097-0037(199610)28:3<143::AID-NET3>3.0.CO;2-N
Cayley graphs and interconnection networks, Graph Symmetry: Algebraic Methods and Applications, pp.167-224, 1997. ,
DOI : 10.1007/978-94-015-8937-6_5
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.27.9995
Fault diameter of interconnection networks, Computers & Mathematics with Applications, vol.13, issue.5-6, pp.5-6, 1987. ,
DOI : 10.1016/0898-1221(87)90085-X
Asymptotically optimal broadcasting and gossiping in faulty hypercube multicomputers, IEEE Transactions on Computers, vol.41, issue.11, pp.1410-1419, 1992. ,
DOI : 10.1109/12.177311
Binomial graph: A scalable and faulttolerant logical network topology, Parallel and Distributed Processing and Applications ISPA, pp.471-482, 2007. ,
Fault-tolerant routing in circulant networks and cycle prefix networks, Annals of Combinatorics, vol.36, issue.2, pp.165-172, 1998. ,
DOI : 10.1007/BF01608486
Inovallée 655 avenue de l'Europe Montbonnot 38334 Saint Ismier Cedex Publisher Inria Domaine de Voluceau -Rocquencourt BP 105 -78153 Le Chesnay Cedex inria, pp.249-6399 ,