From tasks graphs to asynchronous distributed checkpointing with local restart

Romain Lion; Samuel Thibault

doi:10.1109/FTXS51974.2020.00009

Communication Dans Un Congrès Année : 2020

From tasks graphs to asynchronous distributed checkpointing with local restart

(1) , (2, 1)

1
2

Romain Lion

Fonction : Auteur

STatic Optimizations, Runtime Methods

Samuel Thibault

Fonction : Auteur
PersonId : 8135
IdHAL : samuel-thibault
ORCID : 0000-0001-6411-809X
IdRef : 12476486X

Université de Bordeaux

STatic Optimizations, Runtime Methods

Résumé

The ever-increasing number of computation units assembled in current HPC platforms leads to a concerning increase in fault probability. Traditional checkpoint/restart strategies avoid wasting large amounts of computation time when such fault occurs. With the increasing amount of data dealt with by current applications, these strategies however suffer from their data transfer demand becoming unreasonable, or the entailed global synchronizations. Meanwhile, the current trend towards task-based programming is an opportunity to revisit the principles of the checkpoint/restart strategies. We here propose a checkpointing scheme which is closely tied to the execution of task graphs. We describe how it allows for completely asynchronous and distributed checkpointing, as well as localized node restart, thus opening up for very large scalability. We also show how a synergy between the application data transfers and the checkpointing transfers can lead to a reasonable additional network load, measured to be lower than +10% on a dense linear algebra example.

Mots clés

Fault tolerance Task-based programming Checkpoint-restart Buddy in-memory

Domaines

Calcul parallèle, distribué et partagé [cs.DC]

Fichier principal

2020001221.pdf (245.95 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Romain LION : Connectez-vous pour contacter le contributeur

https://hal.science/hal-02970529

Soumis le : lundi 19 octobre 2020-12:02:28

Dernière modification le : vendredi 24 mars 2023-14:53:19

Dates et versions

hal-02970529 , version 1 (18-10-2020)

hal-02970529 , version 2 (19-10-2020)

Identifiants

HAL Id : hal-02970529 , version 2
DOI : 10.1109/FTXS51974.2020.00009

Citer

Romain Lion, Samuel Thibault. From tasks graphs to asynchronous distributed checkpointing with local restart. FTXS 2020 - IEEE/ACM 10th Workshop on Fault Tolerance for HPC at eXtreme Scale, Nov 2020, Atlanta / Virtual, United States. ⟨10.1109/FTXS51974.2020.00009⟩. ⟨hal-02970529v2⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS INRIA INRIA2 PLAFRIM

268 Consultations

351 Téléchargements

From tasks graphs to asynchronous distributed checkpointing with local restart

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager