On the Use of Cluster-Based Partial Message Logging to Improve Fault Tolerance for MPI HPC Applications

Thomas Ropars 1 Amina Guermouche 2, 3 Bora Uçar 2, 4 Esteban Meneses 5 Laxmikant Kale 5 Franck Cappello 6, 7, 8
2 ROMA - Optimisation des ressources : modèles, algorithmes et ordonnancement
Inria Grenoble - Rhône-Alpes, LIP - Laboratoire de l'Informatique du Parallélisme
4 GRAAL - Algorithms and Scheduling for Distributed Heterogeneous Platforms
Inria Grenoble - Rhône-Alpes, LIP - Laboratoire de l'Informatique du Parallélisme
6 GRAND-LARGE - Global parallel and distributed computing
LRI - Laboratoire de Recherche en Informatique, LIFL - Laboratoire d'Informatique Fondamentale de Lille, UP11 - Université Paris-Sud - Paris 11, Inria Saclay - Ile de France, CNRS - Centre National de la Recherche Scientifique : UMR8623
Abstract : Fault tolerance is becoming a major concern in HPC systems. The two traditional approaches for message passing applications, coordinated checkpointing and message logging, have severe scalability issues. Coordinated checkpointing protocols make all processes roll back after a failure. Message logging protocols log a huge amount of data and can induce an overhead on communication performance. Hierarchical rollback-recovery protocols based on the combination of coordinated checkpointing and message logging are an alternative. These partial message logging protocols are based on process clustering: only messages between clusters are logged to limit the consequence of a failure to one cluster. These protocols would work efficiently only if one can find clusters of processes in the applications such that the ratio of logged messages is very low. We study the communication patterns of message passing HPC applications to show that partial message logging is suitable in most cases. We propose a partitioning algorithm to find suitable clusters of processes given the communication pattern of an application. Finally, we evaluate the efficiency of partial message logging using two state of the art protocols on a set of representative applications.
Type de document :
Chapitre d'ouvrage
Jeannot, Emmanuel and Namyst, Raymond and Roman, Jean. Euro-Par 2011 Parallel Processing, 6852, Springer Berlin / Heidelberg, pp.567-578, 2011
Liste complète des métadonnées

https://hal.inria.fr/hal-00786558
Contributeur : Equipe Roma <>
Soumis le : jeudi 13 décembre 2018 - 09:46:09
Dernière modification le : lundi 25 mars 2019 - 10:26:01
Document(s) archivé(s) le : jeudi 14 mars 2019 - 12:54:10

Fichier

ClusterBased.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : hal-00786558, version 1

Citation

Thomas Ropars, Amina Guermouche, Bora Uçar, Esteban Meneses, Laxmikant Kale, et al.. On the Use of Cluster-Based Partial Message Logging to Improve Fault Tolerance for MPI HPC Applications. Jeannot, Emmanuel and Namyst, Raymond and Roman, Jean. Euro-Par 2011 Parallel Processing, 6852, Springer Berlin / Heidelberg, pp.567-578, 2011. 〈hal-00786558〉

Partager

Métriques

Consultations de la notice

369

Téléchargements de fichiers

132