T. Angskun, G. Bosilca, and J. Dongarra, Binomial Graph: A Scalable and Fault-Tolerant Logical Network Topology, Parallel and Distributed Processing and Applications ISPA, pp.471-482, 2007.
DOI : 10.1007/978-3-540-74742-0_43

M. Bertier, O. Marin, and P. Sens, Performance analysis of a hierarchical failure detector, 2003 International Conference on Dependable Systems and Networks, 2003. Proceedings., pp.635-644, 2003.
DOI : 10.1109/DSN.2003.1209973

P. B. Bhat, C. S. Raghavendra, and V. K. Prasanna, Efficient collective communication in distributed heterogeneous systems, Journal of Parallel and Distributed Computing, vol.63, issue.3, pp.251-263, 2003.
DOI : 10.1016/S0743-7315(03)00008-X

W. Bland, A. Bouteiller, T. Herault, G. Bosilca, and J. Dongarra, Post-failure recovery of MPI communication capability: Design and rationale, International Journal of High Performance Computing Applications, vol.27, issue.3, pp.244-254, 2013.
DOI : 10.1177/1094342013488238

W. Bland, A. Bouteiller, T. Herault, J. Hursey, G. Bosilca et al., An evaluation of user-level failure mitigation support in MPI, Computing, issue.12, pp.951171-1184, 2013.

W. Bland, H. Lu, S. Seo, and P. Balaji, Lessons Learned Implementing User-Level Failure Mitigation in MPICH, 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, 2015.
DOI : 10.1109/CCGrid.2015.51

T. D. Chandra and S. Toueg, Unreliable failure detectors for reliable distributed systems, Journal of the ACM, vol.43, issue.2, pp.225-267, 1996.
DOI : 10.1145/226643.226647

W. Chen, S. Toueg, and M. K. Aguilera, On the quality of service of failure detectors, IEEE Transactions on Computers, vol.51, issue.5, pp.561-580, 2002.
DOI : 10.1109/TC.2002.1004595

I. P. Egwutuoha, D. Levy, B. Selic, and S. Chen, A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems, The Journal of Supercomputing, vol.6, issue.5, pp.1302-1326, 2013.
DOI : 10.1007/s11227-013-0884-0

K. B. Ferreira, P. Bridges, and R. Brightwell, Characterizing application sensitivity to OS interference using kernel-level noise injection, 2008 SC, International Conference for High Performance Computing, Networking, Storage and Analysis, 2008.
DOI : 10.1109/SC.2008.5219920

P. Fraigniaud, Asymptotically optimal broadcasting and gossiping in faulty hypercube multicomputers, IEEE Transactions on Computers, vol.41, issue.11, pp.411410-1419, 1992.
DOI : 10.1109/12.177311

I. Gupta, T. D. Chandra, and G. S. Goldszmidt, On scalable and efficient distributed failure detectors, Proceedings of the twentieth annual ACM symposium on Principles of distributed computing , PODC '01, pp.170-179, 2001.
DOI : 10.1145/383962.384010

N. Hayashibara, A. Cherif, and T. Katayama, Failure detectors for largescale distributed systems, 21st Symposium on Reliable Distributed Systems, pp.13-16, 2002.

T. Herault, A. Bouteiller, G. Bosilca, M. Gamell, K. Teranishi et al., Practical scalable consensus for pseudosynchronous distributed systems, Proc. SC'15, 2015.

M. Heydemann, Cayley graphs and interconnection networks, Graph Symmetry: Algebraic Methods and Applications, pp.167-224, 1997.
DOI : 10.1007/978-94-015-8937-6_5

T. Hoefler, T. Schneider, and A. Lumsdaine, Characterizing the Influence of System Noise on Large-Scale Applications by Simulation, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, 2010.
DOI : 10.1109/SC.2010.12

Y. Horita, K. Taura, and T. Chikayama, A scalable and efficient selforganizing failure detector for grid applications, Proceedings of the 6th IEEE/ACM International Workshop on Grid Computing, GRID '05, pp.202-210, 2005.

A. Katti, G. Di-fatta, T. Naughton, and C. Engelmann, Scalable and Fault Tolerant Failure Detection and Consensus, Proceedings of the 22nd European MPI Users' Group Meeting on ZZZ, EuroMPI '15, 2015.
DOI : 10.1145/2802658.2802660

K. Kharbas, D. Kim, T. Hoefler, and F. Mueller, Assessing HPC Failure Detectors for MPI Jobs, 2012 20th Euromicro International Conference on Parallel, Distributed and Network-based Processing, 2012.
DOI : 10.1109/PDP.2012.11

M. Krishnamoorthy and B. Krishnamurthy, Fault diameter of interconnection networks, Computers & Mathematics with Applications, vol.13, issue.5-6, pp.577-582, 1987.
DOI : 10.1016/0898-1221(87)90085-X

M. Larrea, A. Fernández, and S. Arévalo, Optimal implementation of the weakest failure detector for solving consensus, Proceedings 19th IEEE Symposium on Reliable Distributed Systems SRDS-2000, pp.52-59, 2000.
DOI : 10.1109/RELDI.2000.885392

S. Liaw, G. J. Chang, F. Cao, and D. F. Hsu, Fault-tolerant routing in circulant networks and cycle prefix networks, Annals of Combinatorics, vol.36, issue.2, pp.165-172, 1998.
DOI : 10.1007/BF01608486

A. Pelc, Fault-tolerant broadcasting and gossiping in communication networks, Networks, vol.28, issue.3, pp.143-156, 1996.
DOI : 10.1002/(SICI)1097-0037(199610)28:3<143::AID-NET3>3.0.CO;2-N

P. Ramanathan and K. G. Shin, Reliable broadcast in hypercube multicomputers, IEEE Transactions on Computers, vol.37, issue.12, pp.1654-1657, 1988.
DOI : 10.1109/12.9743

. Titan, Oak Ridge National Laboratory, 2016.

Y. Tock, B. Mandler, J. E. Moreira, and T. Jones, Design and Implementation of a Scalable Membership Service for Supercomputer Resiliency-Aware Runtime, Euro-Par 2013 Parallel Processing - 19th International Conference. Proceedings, pp.354-366, 2013.
DOI : 10.1007/978-3-642-40047-6_37

R. Van-renesse, Y. Minsky, and M. Hayden, A Gossip-Style Failure Detection Service, Proceedings of the IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing, Middleware '98, pp.55-70, 1998.
DOI : 10.1007/978-1-4471-1283-9_4

D. S. Wung, Intelligent platform management interface (IPMI), 2009.
DOI : 10.2172/1104721