Skip to Main content Skip to Navigation
Reports

Scheduling for fault-tolerance: an introduction

Abstract : This report provides an introduction to the design of scheduling algorithms to cope with faults on large-scale parallel platforms. We study \emph{checkpointing} and show how to derive the optimal checkpointing period. Then we explain how to combine checkpointing with \emph{fault prediction}, and discuss how the optimal period is modified when this combination is used. Finally we follow the very same approach for the combination of checkpointing with \emph{replication}.
Document type :
Reports
Complete list of metadatas

Cited literature [23 references]  Display  Hide  Download

https://hal.inria.fr/hal-01393192
Contributor : Equipe Roma <>
Submitted on : Tuesday, December 13, 2016 - 11:56:55 AM
Last modification on : Wednesday, November 20, 2019 - 3:17:53 AM
Long-term archiving on: : Tuesday, March 14, 2017 - 12:42:30 PM

File

rr8971.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-01393192, version 2

Collections

Citation

Guillaume Aupy, Yves Robert. Scheduling for fault-tolerance: an introduction. [Research Report] RR-8971, INRIA. 2016. ⟨hal-01393192v2⟩

Share

Metrics

Record views

335

Files downloads

533