Replication Is More Efficient Than You Think - Inria - Institut national de recherche en sciences et technologies du numérique Accéder directement au contenu
Communication Dans Un Congrès Année : 2019

Replication Is More Efficient Than You Think

Résumé

This paper revisits replication coupled with checkpointing for fail-stop errors. Replication enables the application to survive many fail-stop errors , thereby allowing for longer checkpointing periods. Previously published works use replication with the no-restart strategy, which works as follows: (i) compute the application Mean Time To Interruption (MTTI) M as a function of the number of processor pairs and the individual processor Mean Time Between Failures (MTBF); (ii) use checkpointing period T = √ 2M C à la Young/Daly, where C is the checkpoint duration; and (iii) never restart failed processors until the application crashes. We introduce the restart strategy where failed processors are restarted after each checkpoint. We compute the optimal checkpointing period Topt for this strategy, which is much larger than T, thereby decreasing I/O pressure. We show through simulations that using Topt and the restart strategy, instead of T and the usual no-restart strategy, significantly decreases the overhead induced by replication.
Fichier principal
Vignette du fichier
sc-hal.pdf (672.3 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-02273142 , version 1 (03-12-2019)

Licence

Paternité

Identifiants

Citer

Anne Benoit, Thomas Hérault, Valentin Le Fèvre, Yves Robert. Replication Is More Efficient Than You Think. SC 2019 - International Conference for High Performance Computing, Networking, Storage, and Analysis (SC'19), Nov 2019, Denver, United States. pp.1-14, ⟨10.1145/3295500.3356171⟩. ⟨hal-02273142⟩
66 Consultations
101 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More