HydEE: Failure Containment without Event Logging for Large Scale Send-Deterministic MPI Applications

Abstract : High performance computing will probably reach exascale in this decade. At this scale, mean time between failures is expected to be a few hours. Existing fault tolerant protocols for message passing applications will not be efficient anymore since they either require a global restart after a failure (checkpointing protocols) or result in huge memory occupation (message logging). Hybrid fault tolerant protocols overcome these limits by dividing applications processes into clusters and applying a different protocol within and between clusters. Combining coordinated checkpointing inside the clusters and message logging for the inter-cluster messages allows confining the consequences of a failure to a single cluster, while logging only a subset of the messages. However, in existing hybrid protocols, event logging is required for all application messages to ensure a correct execution after a failure. This can significantly impair failure free performance. In this paper, we propose HydEE, a hybrid rollback-recovery protocol for send-deterministic message passing applications, that provides failure containment without logging any event, and only a subset of the application messages. We prove that HydEE can handle multiple concurrent failures by relying on the send-deterministic execution model. Experimental evaluations of our implementation of HydEE in the MPICH2 library show that it introduces almost no overhead on failure free execution.
Type de document :
Communication dans un congrès
, Shanghai, China. 2012, 〈10.1109/IPDPS.2012.111〉
Liste complète des métadonnées

Contributeur : Thomas Ropars <>
Soumis le : lundi 2 mars 2015 - 21:26:20
Dernière modification le : vendredi 23 février 2018 - 13:42:39
Document(s) archivé(s) le : mardi 2 juin 2015 - 09:55:51


Fichiers produits par l'(les) auteur(s)




Amina Guermouche, Thomas Ropars, Marc Snir, Franck Cappello. HydEE: Failure Containment without Event Logging for Large Scale Send-Deterministic MPI Applications. , Shanghai, China. 2012, 〈10.1109/IPDPS.2012.111〉. 〈hal-01121941〉



Consultations de la notice


Téléchargements de fichiers