Two-level checkpointing and partial verifications for linear task graphs

Abstract : Fail-stop and silent errors are unavoidable on large-scale platforms. Efficient resilience techniques must accommodate both error sources. A traditional checkpointing and rollback recovery approach can be used, with added veri-fications to detect silent errors. A fail-stop error leads to the loss of the whole memory content, hence the obligation to checkpoint on a stable storage (e.g., an external disk). On the contrary, it is possible to use in-memory checkpoints for silent errors, which provide a much smaller checkpoint and recovery overhead. Furthermore, recent detectors offer partial verification mechanisms, which are less costly than guaranteed verifications but do not detect all silent errors. In this paper, we show how to combine all these techniques for HPC applications whose dependence graph is a chain of tasks, and provide a sophisticated dynamic programming algorithm returning the optimal solution in polynomial time. Simulations demonstrate that the combined use of multi-level checkpointing and partial verifications further improves performance.
Type de document :
Communication dans un congrès
6th International Workshop in Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS15), Nov 2015, Austin, TX, United States
Liste complète des métadonnées

Littérature citée [14 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/hal-01252400
Contributeur : Equipe Roma <>
Soumis le : jeudi 7 janvier 2016 - 15:13:45
Dernière modification le : samedi 21 avril 2018 - 01:27:27
Document(s) archivé(s) le : vendredi 8 avril 2016 - 13:28:39

Fichier

pmbs.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : hal-01252400, version 1

Collections

Citation

Anne Benoit, Aurélien Cavelan, Yves Robert, Hongyang Sun. Two-level checkpointing and partial verifications for linear task graphs. 6th International Workshop in Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS15), Nov 2015, Austin, TX, United States. 〈hal-01252400〉

Partager

Métriques

Consultations de la notice

188

Téléchargements de fichiers

45