Unified Model for Assessing Checkpointing Protocols at Extreme-Scale - Inria - Institut national de recherche en sciences et technologies du numérique Accéder directement au contenu
Rapport (Rapport De Recherche) Année : 2012

Unified Model for Assessing Checkpointing Protocols at Extreme-Scale

George Bosilca
  • Fonction : Auteur
  • PersonId : 863939
Elisabeth Brunet
Thomas Hérault
  • Fonction : Auteur
  • PersonId : 833735

Résumé

In this article, we present a unified model for several well-known checkpoint/restart protocols. The proposed model is generic enough to encompass both extremes of the checkpoint/restart space: on one side the coordinated checkpoint, and on the other extreme, a variety of uncoordinated checkpoint strategies (with message logging). We identify a set of parameters that are crucial to instantiate and compare the expected efficiency of the fault tolerant protocols, for a given application/platform pair. We then propose a detailed analysis of several scenarios, including some of the most powerful currently available HPC platforms, as well as anticipated Exascale designs. This comparison outlines the comparative behaviors of checkpoint strategies at scale, thereby providing insight that is hardly accessible to direct experimentation.
Fichier principal
Vignette du fichier
RR-7950.pdf (3.12 Mo) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-00696154 , version 1 (11-05-2012)
hal-00696154 , version 2 (08-10-2012)

Identifiants

  • HAL Id : hal-00696154 , version 1

Citer

George Bosilca, Aurélien Bouteiller, Elisabeth Brunet, Franck Cappello, Jack Dongarra, et al.. Unified Model for Assessing Checkpointing Protocols at Extreme-Scale. [Research Report] RR-7950, 2012. ⟨hal-00696154v1⟩

Collections

INRIA-RRRT
369 Consultations
393 Téléchargements

Partager

Gmail Facebook X LinkedIn More