T. D. Chandra and S. Toueg, Unreliable failure detectors for reliable distributed systems, Journal of the ACM, vol.43, issue.2, pp.225-267, 1996.
DOI : 10.1145/226643.226647

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=

I. P. Egwutuoha, D. Levy, B. Selic, and S. Chen, A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems, The Journal of Supercomputing, vol.6, issue.5, pp.1302-1326, 2013.
DOI : 10.1007/s11227-013-0884-0

T. Herault, A. Bouteiller, G. Bosilca, M. Gamell, K. Teranishi et al., Practical scalable consensus for pseudosynchronous distributed systems, Proc. SC'15, 2015.

3. Bosilca, A. Bouteiller, A. Guermouche, T. Herault, Y. Robert et al., An evaluation of user-level failure mitigation support in MPI, pp.95-1171, 2013.

W. Bland, H. Lu, S. Seo, and P. Balaji, Lessons Learned Implementing User-Level Failure Mitigation in MPICH, 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, 2015.
DOI : 10.1109/CCGrid.2015.51

A. Katti, G. Di-fatta, T. Naughton, and C. Engelmann, Scalable and Fault Tolerant Failure Detection and Consensus, Proceedings of the 22nd European MPI Users' Group Meeting on ZZZ, EuroMPI '15, 2015.
DOI : 10.1145/2802658.2802660

W. Bland, A. Bouteiller, T. Herault, G. Bosilca, and J. Dongarra, Post-failure recovery of MPI communication capability: Design and rationale, International Journal of High Performance Computing Applications, vol.27, issue.3, pp.244-254, 2013.
DOI : 10.1177/1094342013488238

P. B. Bhat, C. S. Raghavendra, and V. K. Prasanna, Efficient collective communication in distributed heterogeneous systems, Journal of Parallel and Distributed Computing, vol.63, issue.3, pp.251-263, 2003.
DOI : 10.1016/S0743-7315(03)00008-X

URL : http://ceng.usc.edu/~prasanna/pubs/bhat-icdcs-99.ps

K. B. Ferreira, P. Bridges, and R. Brightwell, Characterizing application sensitivity to OS interference using kernel-level noise injection, 2008 SC, International Conference for High Performance Computing, Networking, Storage and Analysis, 2008.
DOI : 10.1109/SC.2008.5219920

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=

T. Hoefler, T. Schneider, and A. Lumsdaine, Characterizing the Influence of System Noise on Large-Scale Applications by Simulation, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, 2010.
DOI : 10.1109/SC.2010.12

K. Kharbas, D. Kim, T. Hoefler, and F. Mueller, Assessing HPC Failure Detectors for MPI Jobs, 2012 20th Euromicro International Conference on Parallel, Distributed and Network-based Processing, 2012.
DOI : 10.1109/PDP.2012.11

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=

W. Chen, S. Toueg, and M. K. Aguilera, On the quality of service of failure detectors, IEEE Transactions on Computers, vol.51, issue.5, pp.561-580, 2002.
DOI : 10.1109/TC.2002.1004595

P. Ramanathan and K. G. Shin, Reliable broadcast in hypercube multicomputers, IEEE Transactions on Computers, vol.37, issue.12, pp.1654-1657, 1988.
DOI : 10.1109/12.9743

I. Gupta, T. D. Chandra, and G. S. Goldszmidt, On scalable and efficient distributed failure detectors, Proceedings of the twentieth annual ACM symposium on Principles of distributed computing , PODC '01, pp.170-179, 2001.
DOI : 10.1145/383962.384010

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=

A. Das, I. Gupta, and A. Motivala, SWIM: scalable weakly-consistent infection-style process group membership protocol, Proceedings International Conference on Dependable Systems and Networks, pp.303-312, 2002.
DOI : 10.1109/DSN.2002.1028914

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=

S. Snyder, P. H. Carns, J. Jenkins, K. Harms, R. B. Ross et al., A Case for Epidemic Fault Detection and Group Membership in HPC Storage Systems, 5th Int. Workshop on Performance Modeling, Benchmarking, and Simulation (PMBS), pp.237-248, 2014.
DOI : 10.1007/978-3-319-17248-4_12

M. Mitzenmacher and E. , Probability and Computing: Randomized Algorithms and Probabilistic Analysis, 2005.
DOI : 10.1017/CBO9780511813603

D. S. Wung, Intelligent platform management interface (IPMI), 2009.
DOI : 10.2172/1104721

URL : http://www.osti.gov/scitech/servlets/purl/1104721

M. Bertier, O. Marin, and P. Sens, Performance analysis of a hierarchical failure detector, 2003 International Conference on Dependable Systems and Networks, 2003. Proceedings., pp.635-644, 2003.
DOI : 10.1109/DSN.2003.1209973

M. Larrea, A. Fernández, and S. Arévalo, Optimal implementation of the weakest failure detector for solving consensus, Proceedings 19th IEEE Symposium on Reliable Distributed Systems SRDS-2000, pp.52-59, 2000.
DOI : 10.1109/RELDI.2000.885392

R. Van-renesse, Y. Minsky, and M. Hayden, A Gossip-Style Failure Detection Service, Proceedings of the IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing, Middleware '98, pp.55-70, 1998.
DOI : 10.1007/978-1-4471-1283-9_4

N. Hayashibara, A. Cherif, and T. Katayama, Failure detectors for large-scale distributed systems, 21st IEEE Symposium on Reliable Distributed Systems, 2002. Proceedings., pp.13-16, 2002.
DOI : 10.1109/RELDIS.2002.1180218

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=

Y. Horita, K. Taura, and T. Chikayama, A scalable and efficient self-organizing failure detector for grid applications, The 6th IEEE/ACM International Workshop on Grid Computing, 2005., pp.202-210, 2005.
DOI : 10.1109/GRID.2005.1542743

Y. Tock, B. Mandler, J. E. Moreira, and T. Jones, Design and Implementation of a Scalable Membership Service for Supercomputer Resiliency-Aware Runtime, Euro-Par 2013 Parallel Processing -19th International Conference, p.pp, 2013.
DOI : 10.1007/978-3-642-40047-6_37

A. Pelc, Fault-tolerant broadcasting and gossiping in communication networks, Networks, vol.28, issue.3, pp.143-156, 1996.
DOI : 10.1002/(SICI)1097-0037(199610)28:3<143::AID-NET3>3.0.CO;2-N

M. Heydemann, Cayley graphs and interconnection networks, Graph Symmetry: Algebraic Methods and Applications, pp.167-224, 1997.
DOI : 10.1007/978-94-015-8937-6_5

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=

M. Krishnamoorthy and B. Krishnamurthy, Fault diameter of interconnection networks, Computers & Mathematics with Applications, vol.13, issue.5-6, pp.5-6, 1987.
DOI : 10.1016/0898-1221(87)90085-X

P. Fraigniaud, Asymptotically optimal broadcasting and gossiping in faulty hypercube multicomputers, IEEE Transactions on Computers, vol.41, issue.11, pp.1410-1419, 1992.
DOI : 10.1109/12.177311

T. Angskun, G. Bosilca, and J. Dongarra, Binomial graph: A scalable and faulttolerant logical network topology, Parallel and Distributed Processing and Applications ISPA, pp.471-482, 2007.

S. Liaw, G. J. Chang, F. Cao, and D. F. Hsu, Fault-tolerant routing in circulant networks and cycle prefix networks, Annals of Combinatorics, vol.36, issue.2, pp.165-172, 1998.
DOI : 10.1007/BF01608486

R. Inria, Inovallée 655 avenue de l'Europe Montbonnot 38334 Saint Ismier Cedex Publisher Inria Domaine de Voluceau -Rocquencourt BP 105 -78153 Le Chesnay Cedex inria, pp.249-6399