Binomial Graph: A Scalable and Fault-Tolerant Logical Network Topology, Parallel and Distributed Processing and Applications ISPA, pp.471-482, 2007. ,
DOI : 10.1007/978-3-540-74742-0_43
Performance analysis of a hierarchical failure detector, 2003 International Conference on Dependable Systems and Networks, 2003. Proceedings., pp.635-644, 2003. ,
DOI : 10.1109/DSN.2003.1209973
Efficient collective communication in distributed heterogeneous systems, Journal of Parallel and Distributed Computing, vol.63, issue.3, pp.251-263, 2003. ,
DOI : 10.1016/S0743-7315(03)00008-X
Post-failure recovery of MPI communication capability: Design and rationale, International Journal of High Performance Computing Applications, vol.27, issue.3, pp.244-254, 2013. ,
DOI : 10.1177/1094342013488238
An evaluation of user-level failure mitigation support in MPI, Computing, issue.12, pp.951171-1184, 2013. ,
Lessons Learned Implementing User-Level Failure Mitigation in MPICH, 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, 2015. ,
DOI : 10.1109/CCGrid.2015.51
Unreliable failure detectors for reliable distributed systems, Journal of the ACM, vol.43, issue.2, pp.225-267, 1996. ,
DOI : 10.1145/226643.226647
On the quality of service of failure detectors, IEEE Transactions on Computers, vol.51, issue.5, pp.561-580, 2002. ,
DOI : 10.1109/TC.2002.1004595
A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems, The Journal of Supercomputing, vol.6, issue.5, pp.1302-1326, 2013. ,
DOI : 10.1007/s11227-013-0884-0
Characterizing application sensitivity to OS interference using kernel-level noise injection, 2008 SC, International Conference for High Performance Computing, Networking, Storage and Analysis, 2008. ,
DOI : 10.1109/SC.2008.5219920
Asymptotically optimal broadcasting and gossiping in faulty hypercube multicomputers, IEEE Transactions on Computers, vol.41, issue.11, pp.411410-1419, 1992. ,
DOI : 10.1109/12.177311
On scalable and efficient distributed failure detectors, Proceedings of the twentieth annual ACM symposium on Principles of distributed computing , PODC '01, pp.170-179, 2001. ,
DOI : 10.1145/383962.384010
Failure detectors for largescale distributed systems, 21st Symposium on Reliable Distributed Systems, pp.13-16, 2002. ,
Practical scalable consensus for pseudosynchronous distributed systems, Proc. SC'15, 2015. ,
Cayley graphs and interconnection networks, Graph Symmetry: Algebraic Methods and Applications, pp.167-224, 1997. ,
DOI : 10.1007/978-94-015-8937-6_5
Characterizing the Influence of System Noise on Large-Scale Applications by Simulation, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, 2010. ,
DOI : 10.1109/SC.2010.12
A scalable and efficient selforganizing failure detector for grid applications, Proceedings of the 6th IEEE/ACM International Workshop on Grid Computing, GRID '05, pp.202-210, 2005. ,
Scalable and Fault Tolerant Failure Detection and Consensus, Proceedings of the 22nd European MPI Users' Group Meeting on ZZZ, EuroMPI '15, 2015. ,
DOI : 10.1145/2802658.2802660
Assessing HPC Failure Detectors for MPI Jobs, 2012 20th Euromicro International Conference on Parallel, Distributed and Network-based Processing, 2012. ,
DOI : 10.1109/PDP.2012.11
Fault diameter of interconnection networks, Computers & Mathematics with Applications, vol.13, issue.5-6, pp.577-582, 1987. ,
DOI : 10.1016/0898-1221(87)90085-X
Optimal implementation of the weakest failure detector for solving consensus, Proceedings 19th IEEE Symposium on Reliable Distributed Systems SRDS-2000, pp.52-59, 2000. ,
DOI : 10.1109/RELDI.2000.885392
Fault-tolerant routing in circulant networks and cycle prefix networks, Annals of Combinatorics, vol.36, issue.2, pp.165-172, 1998. ,
DOI : 10.1007/BF01608486
Fault-tolerant broadcasting and gossiping in communication networks, Networks, vol.28, issue.3, pp.143-156, 1996. ,
DOI : 10.1002/(SICI)1097-0037(199610)28:3<143::AID-NET3>3.0.CO;2-N
Reliable broadcast in hypercube multicomputers, IEEE Transactions on Computers, vol.37, issue.12, pp.1654-1657, 1988. ,
DOI : 10.1109/12.9743
Oak Ridge National Laboratory, 2016. ,
Design and Implementation of a Scalable Membership Service for Supercomputer Resiliency-Aware Runtime, Euro-Par 2013 Parallel Processing - 19th International Conference. Proceedings, pp.354-366, 2013. ,
DOI : 10.1007/978-3-642-40047-6_37
A Gossip-Style Failure Detection Service, Proceedings of the IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing, Middleware '98, pp.55-70, 1998. ,
DOI : 10.1007/978-1-4471-1283-9_4
Intelligent platform management interface (IPMI), 2009. ,
DOI : 10.2172/1104721