On the impact of process replication on executions of large-scale parallel applications with coordinated checkpointing

Abstract : Processor failures in post-petascale parallel computing platforms are common occurrences. The traditional fault-tolerance solution, checkpoint-rollback-recovery, severely limits parallel efficiency. One solution is to replicate application processes so that a processor failure does not necessarily imply an application failure. Process replication, combined with checkpoint-rollback-recovery, has been recently advocated. We first derive novel theoretical results for Exponential failure distributions, namely exact values for the Mean Number of Failures To Interruption and the Mean Time To Interruption. We then extend these results to arbitrary failure distributions, obtaining closed-form solutions for Weibull distributions. Finally, we evaluate process replica-tion in simulation using both synthetic and real-world failure traces so as to quantify average application makespan. One interesting result from these experiments is that, when process repli-cation is used, application performance is not sensitive to the checkpointing period, provided that that period is within a large neighborhood of the optimal period. More generally, our empirical results make it possible to identify regimes in which process replication is beneficial.
Liste complète des métadonnées

Littérature citée [36 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/hal-01199752
Contributeur : Equipe Roma <>
Soumis le : mercredi 16 septembre 2015 - 10:09:48
Dernière modification le : vendredi 20 avril 2018 - 15:44:27
Document(s) archivé(s) le : mardi 29 décembre 2015 - 07:26:27

Fichier

FGCS.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

Collections

Citation

Henri Casanova, Yves Robert, Frédéric Vivien, Dounia Zaidouni. On the impact of process replication on executions of large-scale parallel applications with coordinated checkpointing. Future Generation Computer Systems, Elsevier, 2015, 51, pp.13. 〈10.1016/j.future.2015.04.003〉. 〈hal-01199752〉

Partager

Métriques

Consultations de la notice

238

Téléchargements de fichiers

73