Towards Optimal Multi-Level Checkpointing

We provide a framework to analyze multi-level checkpointing protocols, by formally defining a $k$-level checkpointing pattern. We provide a first-order approximation to the optimal checkpointing pattern, and show that the corresponding overhead is of the order of $\sum_{\ell=1}^{k}\sqrt{2\lambda_\ell C_\ell}$, where $\lambda_\ell$ is the error rate at level~$\ell$, and $C_\ell$ the checkpointing cost at level~$\ell$. This nicely extends the classical Young/Daly formula. Furthermore, we are able to fully characterize the shape of the optimal pattern (number and positions of checkpoints), and we provide a dynamic programming algorithm to determine which levels should be used. Finally, we perform simulations to check the accuracy of the theoretical study and to confirm the optimality of the subset of levels returned by the dynamic programming algorithm. The results nicely corroborate the theoretical study, and demonstrate the usefulness of multi-level checkpointing with the optimal subset of levels.

Ce travail analyse les techniques de checkpoint multi-niveaux. On étudie les schémas de calcul périodiques, où les différents niveaux de checkpoint sont imbriqués, et on caractérise le schéma optimal, i.e., celui dont le surcoût par unité de calcul est minimal. On montre que ce surcoût minimal est de l'ordre de $\sum_{\ell=1}^{k}\sqrt{2\lambda_\ell C_\ell}$, où $\lambda_\ell$ est le taux d'erreur au niveau~$\ell$, et $C_\ell$ le coût de checkpoint au niveau~$\ell$. Cette formule étend la célèbre formule de Young/Daly pour un seul niveau. On propose également un algorithme de programmation dynamique pour déterminer le meilleur sous-ensemble de niveuax à utiliser pour minimiser le surcoût global. Enfin, nous conduisons des simulations pour vérifier l'étude théorique, et confirmer l'optimalité du sous-ensemble déterminé par l'algorithme de programmation dynamique. Les résultats corroborent bien l'étude théorique, et montrent toute l'utilité d'une approche multi-niveaux basée sur le sous-ensemble de niveaux optimal.

Mots clés

resilience fail-stop errors multi-level checkpointing optimal pattern

Domaines

Informatique [cs]

Fichier principal

RR-8930.pdf (928.79 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Equipe Roma : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-01339788

Soumis le : lundi 19 décembre 2016-12:40:36

Dernière modification le : mardi 6 février 2024-11:09:05

Dates et versions

hal-01339788 , version 1 (30-06-2016)

hal-01339788 , version 2 (14-12-2016)

hal-01339788 , version 3 (19-12-2016)

Identifiants

HAL Id : hal-01339788 , version 3

Citer

Anne Benoit, Aurélien Cavelan, Valentin Le Fèvre, Yves Robert, Hongyang Sun. Towards Optimal Multi-Level Checkpointing. [Research Report] RR-8930, INRIA Grenoble - Rhone-Alpes. 2016. ⟨hal-01339788v3⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

ENS-LYON CNRS INRIA UNIV-LYON1 INRIA-RRRT INRIA2 LARA UDL

139 Consultations

210 Téléchargements