Combining Process Replication and Checkpointing for Resilience on Exascale Systems

Processor failures in post-petascale settings are common occurrences. The traditional fault-tolerance solution, checkpoint-rollback, severely limits parallel efficiency. One solution is to replicate application processes so that a processor failure does not necessarily imply an application failure. Process replication, combined with checkpoint-rollback, has been recently advocated by Ferreira et al. We first identify an incorrect analogy made in their work between process replication and the birthday problem, and derive correct values for the Mean Number of Failures To Interruption and Mean Time To Interruption for exponentially distributed failures. We then extend these results to arbitrary failure distributions, including closed-form solutions for Weibull distributions. Finally, we evaluate process replication using both synthetic and real-world failure traces. Our main findings are: (i) replication is less beneficial than claimed by Ferreira et al; (ii) although the choice of the checkpointing period can have a high impact on application execution in the no-replication case, with process replication this choice is no longer critical.

Mots clés

Fault-tolerance checkpointing replication exascale

Domaines

Calcul parallèle, distribué et partagé [cs.DC]

Fichier principal

RR-7951.pdf (958.58 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Frédéric Vivien : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-00697180

Soumis le : mardi 12 mars 2013-18:16:41

Dernière modification le : jeudi 15 février 2024-03:31:09

Archivage à long terme le : dimanche 2 avril 2017-11:44:13

Dates et versions

hal-00697180 , version 1 (14-05-2012)

hal-00697180 , version 2 (12-03-2013)

Identifiants

HAL Id : hal-00697180 , version 2

Citer

Henri Casanova, Yves Robert, Frédéric Vivien, Dounia Zaidouni. Combining Process Replication and Checkpointing for Resilience on Exascale Systems. [Research Report] RR-7951, INRIA. 2012. ⟨hal-00697180v2⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

ENS-LYON UNIV-RENNES1 CNRS INRIA UNIV-LYON1 IRISA INRIA-RRRT INRIA2 LARA UR1-MATH-STIC UR1-UFR-ISTIC UNIV-RENNES UDL UR1-MATH-NUM

265 Consultations

288 Téléchargements