Two-Level Checkpointing and Verifications for Linear Task Graphs

Abstract : Fail-stop and silent errors are omnipresent on large-scale platforms. Efficient resilience techniques must accommodate both error sources. To cope with the double challenge, a two-level checkpointing and rollback recovery approach can be used, with additional verifications for silent error detection. A fail-stop error leads to the loss of the whole memory content, hence the obligation to checkpoint on a stable storage (e.g., an external disk). On the contrary, it is possible to use in-memory checkpoints for silent errors, which provide a much smaller checkpointing and recovery overhead. Furthermore, recent detectors offer partial verification mechanisms that are less costly than the guaranteed ones but do not detect all silent errors. In this paper, we show how to combine all of these techniques for HPC applications whose dependency graph forms a linear chain. We present a sophisticated dynamic programming algorithm that returns the optimal solution in polynomial time. Simulation results demonstrate that the combined use of multi-level checkpointing and verifications leads to improved performance compared to the standard single-level checkpointing algorithm.
Type de document :
Communication dans un congrès
The 17th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2016), May 2016, Chicago, United States. IEEE, pp.10, 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). 〈10.1109/IPDPSW.2016.106〉
Liste complète des métadonnées

Littérature citée [25 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/hal-01354625
Contributeur : Equipe Roma <>
Soumis le : vendredi 19 août 2016 - 09:57:29
Dernière modification le : vendredi 20 avril 2018 - 15:44:27
Document(s) archivé(s) le : dimanche 20 novembre 2016 - 10:11:35

Fichier

pdsec2016.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

Collections

Citation

Anne Benoit, Aurélien Cavelan, Yves Robert, Hongyang Sun. Two-Level Checkpointing and Verifications for Linear Task Graphs. The 17th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2016), May 2016, Chicago, United States. IEEE, pp.10, 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). 〈10.1109/IPDPSW.2016.106〉. 〈hal-01354625〉

Partager

Métriques

Consultations de la notice

441

Téléchargements de fichiers

50