When Amdahl Meets Young/Daly

Abstract : This paper investigates the optimal number of processors to execute a parallel job, whose speedup profile obeys Amdahl's law, on a large-scale platform subject to fail-stop and silent errors. We combine the traditional checkpointing and rollback recovery strategies with verification mechanisms to cope with both error sources. We provide an exact formula to express the execution overhead incurred by a periodic checkpointing pattern of length T and with P processors, and we give first-order approximations for the optimal values T * and P * as a function of the individual processor failure rate λind. A striking result is that P * is of the order λ −1/4 ind if the checkpointing cost grows linearly with the number of processors, and of the order λ −1/3 ind if the checkpointing cost stays bounded for any P. We conduct an extensive set of simulations to support the theoretical study. The results confirm the accuracy of first-order approximation under a wide range of parameter settings.
Type de document :
Communication dans un congrès
Cluster'2016, Sep 2016, Taipei, Taiwan, France. IEEE Computer Society, Cluster'2016
Liste complète des métadonnées

https://hal.inria.fr/hal-01355963
Contributeur : Equipe Roma <>
Soumis le : mercredi 24 août 2016 - 15:23:41
Dernière modification le : vendredi 20 avril 2018 - 15:44:27
Document(s) archivé(s) le : vendredi 25 novembre 2016 - 13:44:17

Fichier

cluster2016.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : hal-01355963, version 1

Collections

Citation

Aurélien Cavelan, Jiafan Li, Yves Robert, Hongyang Sun. When Amdahl Meets Young/Daly. Cluster'2016, Sep 2016, Taipei, Taiwan, France. IEEE Computer Society, Cluster'2016. 〈hal-01355963〉

Partager

Métriques

Consultations de la notice

281

Téléchargements de fichiers

47