O. Aumage, E. Brunet, G. Mercier, and R. Namyst, High-Performance Multi-Rail Support with the NEWMADELEINE Communication Library, 2007 IEEE International Parallel and Distributed Processing Symposium, 2007.
DOI : 10.1109/IPDPS.2007.370332

URL : https://hal.archives-ouvertes.fr/inria-00126254

M. Bertier, O. Marin, and P. Sens, Implementation and performance evaluation of an adaptable failure detector, Proceedings International Conference on Dependable Systems and Networks, 2002.
DOI : 10.1109/DSN.2002.1028920

G. Bosilca, A. Bouteiller, F. Cappello, S. Djilali, G. Fedak et al., MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes, ACM/IEEE SC 2002 Conference (SC'02), 2002.
DOI : 10.1109/SC.2002.10048

URL : https://hal.archives-ouvertes.fr/in2p3-00457138

F. Cappello, A. Geist, B. Gropp, L. Kale, B. Kramer et al., Toward Exascale Resilience, International Journal of High Performance Computing Applications, vol.23, issue.4, 2009.
DOI : 10.1177/1094342009347767

G. Fagg and J. Dongarra, FT-MPI: Fault tolerant MPI, supporting dynamic applications in a dynamic world. Recent Advances in Parallel Virtual Machine and Message Passing Interface, 2000.

R. Graham, S. Choi, D. Daniel, N. Desai, R. Minnich et al., A network-failure-tolerant message-passing system for terascale clusters, Proceedings of the 16th international conference on Supercomputing , ICS '02, p.31, 2003.
DOI : 10.1145/514191.514205

M. Koop, R. Kumar, and D. Panda, Can software reliability outperform hardware reliability on high performance interconnects?, Proceedings of the 22nd annual international conference on Supercomputing , ICS '08, 2008.
DOI : 10.1145/1375527.1375551

M. Koop, P. Shamis, I. Rabinovitz, and D. Panda, Designing high-performance and resilient message passing on InfiniBand, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010.
DOI : 10.1109/IPDPSW.2010.5470856

T. C. Maxino and P. J. Koopman, The Effectiveness of Checksums for Embedded Control Networks, IEEE Transactions on Dependable and Secure Computing, vol.6, issue.1, 2009.
DOI : 10.1109/TDSC.2007.70216

M. Menth and R. Martin, Network resilience through multi-topology routing, DRCN 2005). Proceedings.5th International Workshop on Design of Reliable Communication Networks, 2005.
DOI : 10.1109/DRCN.2005.1563878

T. Okamoto, S. Miura, T. Boku, M. Sato, and D. Takahashi, RI2N/UDP: High bandwidth and fault-tolerant network for a PC-cluster based on multi-link Ethernet, 2007 IEEE International Parallel and Distributed Processing Symposium, 2007.
DOI : 10.1109/IPDPS.2007.370477

G. Shipman, R. Graham, and G. Bosilca, Network fault tolerance in open MPI. Euro-Par, 2007.

F. Trahay and A. Denis, A scalable and generic task scheduling system for communication libraries, 2009 IEEE International Conference on Cluster Computing and Workshops
DOI : 10.1109/CLUSTR.2009.5289169

URL : https://hal.archives-ouvertes.fr/inria-00408521

F. Trahay, A. Denis, O. Aumage, and R. Namyst, Improving reactivity and communication overlap in mpi using a generic i/o manager. Recent Advances in Parallel Virtual Machine and Message Passing Interface, 2007.
URL : https://hal.archives-ouvertes.fr/inria-00177167

D. Zwaenepoel and D. Johnson, Sender-based message logging, 17th International Symposium on Fault-Tolerant Computing