Optimal Checkpointing Strategies for Iterative Applications

This work provides an optimal checkpointing strategy to protect iterative applications from fail-stop errors. We consider a general framework, where the application repeats the same execution pattern by executing consecutive iterations, and where each iteration is composed of several tasks. These tasks have different execution lengths and different checkpoint costs. Assume that there are n tasks and that task a i , where 0 ≤ i < n, has execution time t i and checkpoint cost c i. A naive strategy would checkpoint after each task. Another naive strategy would checkpoint at the end of each iteration. A strategy inspired by the Young/Daly formula would work for √ 2µcave seconds, where µ is the application MTBF and cave is the average checkpoint time, and checkpoint at the end of the current task (and repeat). Another strategy, also inspired by the Young/Daly formula, would select the task a min with smallest checkpoint cost c min and would checkpoint after every p th instance of that task, leading to a checkpointing period pT , where T = n−1 i=0 a i is the time per iteration. One would choose the period so that pT ≈ √ 2µc min to obey the Young/Daly formula. All these naive and Young/Daly strategies are suboptimal. Our main contribution is to show that the optimal checkpoint strategy is globally periodic, and to design a dynamic programming algorithm that computes the optimal checkpointing pattern. This pattern may well checkpoint many different tasks, and this across many different iterations. We show through simulations, both from synthetic and real-life application scenarios, that the optimal strategy outperforms the naive and Young/Daly strategies.

Mots clés

Iterative application checkpoint strategy fail-stop error resilience

Domaines

Calcul parallèle, distribué et partagé [cs.DC]

Fichier principal

tpds-complete.pdf (2.53 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

Equipe Roma : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-03338278

Soumis le : lundi 27 septembre 2021-10:55:05

Dernière modification le : lundi 24 juillet 2023-09:28:41

Dates et versions

hal-03338278 , version 1 (08-09-2021)

hal-03338278 , version 2 (27-09-2021)

Licence

Paternité

Identifiants

HAL Id : hal-03338278 , version 2
DOI : 10.1109/TPDS.2021.3099440

Citer

Yishu Du, Loris Marchal, Guillaume Pallez, Yves Robert. Optimal Checkpointing Strategies for Iterative Applications. IEEE Transactions on Parallel and Distributed Systems, 2022, 33 (3), pp.507-522. ⟨10.1109/TPDS.2021.3099440⟩. ⟨hal-03338278v2⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

ENS-LYON CNRS INRIA UNIV-LYON1 JLESC INRIA2 UDL

134 Consultations

309 Téléchargements