Replication Is More Efficient Than You Think

Abstract : This paper revisits replication coupled with checkpointing for fail-stop errors. Replication enables the application to survive many fail-stop errors, thereby allowing for longer checkpointing periods. Previously published works use replication with the no-restart strategy, which never restart failed processors until the application crashes. We introduce the restart strategy where failed processors are restarted after each checkpoint. which may introduce additional overhead during checkpoints but prevents the application configuration from degrading throughout successive checkpointing periods. We show how to compute the optimal checkpointing period for this strategy, which is much larger than the one with no-restart, thereby decreasing I/O pressure. We show through simulations that using the restart strategy significantly decreases the overhead induced by replication, in terms of both total execution time and energy consumption.
Document type :
Reports
Complete list of metadatas

Cited literature [45 references]  Display  Hide  Download

https://hal.inria.fr/hal-02265925
Contributor : Equipe Roma <>
Submitted on : Tuesday, August 13, 2019 - 3:53:37 PM
Last modification on : Wednesday, August 14, 2019 - 9:04:34 AM

File

rr9278.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-02265925, version 1

Collections

Citation

Anne Benoit, Thomas Hérault, Valentin Le Fèvre, Yves Robert. Replication Is More Efficient Than You Think. [Research Report] RR-9278, Inria - Research Centre Grenoble – Rhône-Alpes. 2019. ⟨hal-02265925⟩

Share

Metrics

Record views

68

Files downloads

87