Assessing General-Purpose Algorithms to Cope with Fail-Stop and Silent Errors

Abstract : In this paper, we combine the traditional checkpointing and rollback recovery strategies with verification mechanisms to address both fail-stop and silent errors. The objective is to minimize either makespan or energy consumption. While DVFS is a popular approach for reducing the energy consumption, using lower speeds/voltages can increase the number of errors, thereby complicating the problem. We consider an application workflow whose dependence graph is a chain of tasks, and we study three execution scenarios: (i) a single speed is used during the whole execution; (ii) a second, possibly higher speed is used for any potential re-execution; (iii) different pairs of speeds can be used throughout the execution. For each scenario, we determine the optimal checkpointing and verification locations (and the optimal speeds for the third scenario) to minimize either objective. The different execution scenarios are then assessed and compared through an extensive set of experiments.
Type de document :
Article dans une revue
ACM Transactions on Parallel Computing, Association for Computing Machinery, 2016, 3 (2), pp.1-36. 〈10.1145/2897189〉
Liste complète des métadonnées

Littérature citée [40 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/hal-01358146
Contributeur : Equipe Roma <>
Soumis le : mercredi 31 août 2016 - 10:52:56
Dernière modification le : vendredi 20 avril 2018 - 15:44:27
Document(s) archivé(s) le : vendredi 2 décembre 2016 - 01:45:27

Fichier

wocopyright-TOPC.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

Collections

Citation

Anne Benoit, Aurélien Cavelan, Yves Robert, Hongyang Sun. Assessing General-Purpose Algorithms to Cope with Fail-Stop and Silent Errors. ACM Transactions on Parallel Computing, Association for Computing Machinery, 2016, 3 (2), pp.1-36. 〈10.1145/2897189〉. 〈hal-01358146〉

Partager

Métriques

Consultations de la notice

509

Téléchargements de fichiers

42