Replication Is More Efficient Than You Think

This paper revisits replication coupled with checkpointing for fail-stop errors. Replication enables the application to survive many fail-stop errors, thereby allowing for longer checkpointing periods. Previously published works use replication with the no-restart strategy, which never restart failed processors until the application crashes. We introduce the restart strategy where failed processors are restarted after each checkpoint. which may introduce additional overhead during checkpoints but prevents the application configuration from degrading throughout successive checkpointing periods. We show how to compute the optimal checkpointing period for this strategy, which is much larger than the one with no-restart, thereby decreasing I/O pressure. We show through simulations that using the restart strategy significantly decreases the overhead induced by replication, in terms of both total execution time and energy consumption.

Mots clés

checkpoint optimal checkpointing period replication restart strate

Domaines

Informatique [cs]

Fichier principal

rr9278.pdf (1.22 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

Equipe Roma : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-02265925

Soumis le : mardi 13 août 2019-15:53:37

Dernière modification le : jeudi 11 mai 2023-11:56:10

Archivage à long terme le : jeudi 9 janvier 2020-20:26:01

Dates et versions

hal-02265925 , version 1 (13-08-2019)

Identifiants

HAL Id : hal-02265925 , version 1

Citer

Anne Benoit, Thomas Herault, Valentin Le Fèvre, Yves Robert. Replication Is More Efficient Than You Think. [Research Report] RR-9278, Inria - Research Centre Grenoble – Rhône-Alpes. 2019. ⟨hal-02265925⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

ENS-LYON CNRS INRIA UNIV-LYON1 INRIA-RRRT INRIA2 LARA UDL

101 Consultations

205 Téléchargements