Optimal Checkpointing Strategies for Iterative Applications

This work provides an optimal checkpointing strategy to protect iterative applications from fail-stop errors. We consider a very general framework, where the application repeats the same execution pattern by executing consecutive iterations, and where each iteration is composed of several tasks. These tasks have different execution lengths and different checkpoint costs. Assume that there are $n$ tasks and that task $a_i$, where $0 ≤ i < n$, has execution time $t_i$ and checkpoint cost $C_i$. A naive strategy would checkpoint after each task. A strategy inspired by the Young/Daly formula would select the task $a_{min}$ with smallest checkpoint cost $C_{min}$ and would checkpoint after every $p^{th}$ instance of that task, leading to a checkpointing period $P_{YD} = pT$ where $\textit{T}{\sum_{i=0}^{n-1}a_i}$ is the time per iteration. One would choose the period so that $P_{YD} \approx \sqrt{ 2{\mu}C_\textrm{min}}$ to obey the Young/Daly formula, where $\mu$ is the application MTBF. Both the naive and Young/Daly strategies are suboptimal. Our main contribution is to show that the optimal checkpoint strategy is globally periodic, and to design a dynamic programming algorithm that computes the optimal checkpointing pattern. This pattern may well checkpoint many different tasks, and this across many different iterations. We show through simulations, both from synthetic and real-life application scenarios, that the optimal strategy significantly outperforms the naive and Young/Daly strategies.

Ce rapport propose une stratégie de checkpoint optimale pour protéger les applications itératives des erreurs fatales. On s’intéresse à un cadre général, où l’application répète le même schéma et exécute des itérations consécutives, et où chaque itération est composée de plusieurs tâches. Ces tâches ont des longueurs différentes, et des coûts de checkpoint différents également. Supposons avoir $n$ tâches, et que la tâche $a_i$, pour 0 ≤ $i < n$, a un temps d’exécution $t_i$ et un coût de checkpoint $C_i$. Une stratégie naïve prendrait un checkpoint après chaque tâche. Une stratégie inspirée par la formule de Young/Daly choisirait la tâche $a_{min}$ dont le coût de checkpoint $C_{min}$ est minimal, et prendrait un checkpoint après chaque $p^{ème}$ instance de cette tâche, ce qui conduit à une période de checkpointing $P_{YD} = pT$ où $\textit{T}{\sum_{i=0}^{n-1}a_i}$ est le temps d’une itération. On choisirait alors p pour que $P_{YD} \approx \sqrt{ 2{\mu}C_\textrm{min}}$ obéisse à la formule de Young/Daly, où µ est le MTBF (temps moyen entre deux pannes) de l’application. La stratégie naïve et celle de Young/Daly sont toutes deux sous-optimales. Notre contribution principale est de montrer que la stratégie optimale est globalement périodique, et de proposer un algorithme de programmation dynamique qui calcule la période de ce checkpoint optimale. Cette dernière peut prendre le checkpoint de plusieurs tâches, et ce sur plusieurs itérations. Des simulations basées sur des scénarios applicatifs à la fois synthétiques et issus d’applications réelles, montrent bien le gain qu’apporte la stratégie optimale.

Mots clés

Fault-tolerance Checkpoint strategy Fail-stop errors Iterative application

Tolérance aux pannes Checkpoint Application itérative

Domaines

Informatique [cs]

Fichier principal

rr9371.pdf (972.35 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Equipe Roma : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-02980455

Soumis le : mardi 27 octobre 2020-14:53:21

Dernière modification le : jeudi 11 mai 2023-11:56:10

Archivage à long terme le : jeudi 28 janvier 2021-19:06:13

Dates et versions

hal-02980455 , version 1 (27-10-2020)

Identifiants

HAL Id : hal-02980455 , version 1

Citer

Yishu Du, Loris Marchal, Guillaume Pallez, Yves Robert. Optimal Checkpointing Strategies for Iterative Applications. [Research Report] RR-9371, Inria - Research Centre Grenoble – Rhône-Alpes. 2020. ⟨hal-02980455⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

ENS-LYON CNRS INRIA UNIV-LYON1 INRIA-RRRT INRIA2 LARA UDL

87 Consultations

190 Téléchargements