Skip to Main content Skip to Navigation
Book sections

Scheduling for Fault-Tolerance: An Introduction

Guillaume Aupy 1 Yves Robert 2, 3, 4
Abstract : Parallel execution time is expected to decrease as the number of processors increases. We show in this chapter that this is not as easy as it seems, even for perfectly parallel applications. In particular, processors are subject to faults. The more processors are available, the more likely faults will strike during execution. The main strategy to cope with faults in High Performance Computing is checkpointing. We introduce the reader to this approach, and explain how to determine the optimal checkpointing period through scheduling techniques. We also detail how to combine checkpointing with prediction and with replication.
Document type :
Book sections
Complete list of metadata

Cited literature [21 references]  Display  Hide  Download
Contributor : Equipe Roma Connect in order to contact the contributor
Submitted on : Wednesday, January 23, 2019 - 10:01:55 PM
Last modification on : Friday, January 21, 2022 - 3:11:11 AM


Files produced by the author(s)


  • HAL Id : hal-01968454, version 1



Guillaume Aupy, Yves Robert. Scheduling for Fault-Tolerance: An Introduction. Topic in parallel and distributed computing: Enhancing the Undergraduate Curriculum: Performance, Concurrency, and Programming on Modern Platforms, Springer International Publishing, pp.143-170, 2018, 978-3-319-93109-8. ⟨hal-01968454⟩



Les métriques sont temporairement indisponibles