Revisiting the double checkpointing algorithm

Jack Dongarra; Thomas Herault; Yves Robert

Rapport (Rapport De Recherche) Année : 2012

Revisiting the double checkpointing algorithm

(1) , (1) , (2, 3)

1
2
3

Jack Dongarra

Fonction : Auteur
PersonId : 863940

Innovative Computing Laboratory [Knoxville]

Thomas Herault

Fonction : Auteur
PersonId : 833735

Innovative Computing Laboratory [Knoxville]

Yves Robert

Fonction : Auteur
PersonId : 739318
IdHAL : yves-robert
ORCID : 0000-0003-2361-055X
IdRef : 029813611

Laboratoire de l'Informatique du Parallélisme

Optimisation des ressources : modèles, algorithmes et ordonnancement

Résumé

Fast checkpointing algorithms require distributed access to stable storage. This paper revisits the approach base upon double checkpointing, and compares the blocking algorithm of Zheng, Shi and Kalé, with the non-blocking algorithm of Ni, Meneses and Kalé in terms of both performance and risk. We also extend the model that they have proposed to assess the impact of the overhead associated to non-blocking communications. We then provide a new peer-to-peer checkpointing algorithm, called the triple checkpointing algorithm, that can work at constant memory, and achieves both higher efficiency and better risk handling than the double checkpointing algorithm. We provide performance and risk models for all the evaluated protocols, and compare them through comprehensive simulations.

Domaines

Calcul parallèle, distribué et partagé [cs.DC]

Fichier principal

RR-8196.pdf (1.27 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

Equipe Roma : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-00768491

Soumis le : vendredi 21 décembre 2012-16:25:22

Dernière modification le : jeudi 11 mai 2023-11:56:10

Archivage à long terme le : dimanche 18 décembre 2016-08:10:54

Dates et versions

hal-00768491 , version 1 (21-12-2012)

Identifiants

HAL Id : hal-00768491 , version 1

Citer

Jack Dongarra, Thomas Herault, Yves Robert. Revisiting the double checkpointing algorithm. [Research Report] RR-8196, 2012. ⟨hal-00768491⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

ENS-LYON CNRS INRIA UNIV-LYON1 INRIA-RRRT INRIA2 LARA UDL

132 Consultations

168 Téléchargements

Revisiting the double checkpointing algorithm

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager