Skip to Main content Skip to Navigation
Book sections

Scheduling for Fault-Tolerance: An Introduction

Guillaume Aupy 1 Yves Robert 2, 3, 4
Abstract : Parallel execution time is expected to decrease as the number of processors increases. We show in this chapter that this is not as easy as it seems, even for perfectly parallel applications. In particular, processors are subject to faults. The more processors are available, the more likely faults will strike during execution. The main strategy to cope with faults in High Performance Computing is checkpointing. We introduce the reader to this approach, and explain how to determine the optimal checkpointing period through scheduling techniques. We also detail how to combine checkpointing with prediction and with replication.
Document type :
Book sections
Complete list of metadatas

Cited literature [21 references]  Display  Hide  Download
Contributor : Equipe Roma <>
Submitted on : Wednesday, January 23, 2019 - 10:01:55 PM
Last modification on : Monday, November 16, 2020 - 9:56:04 AM


Files produced by the author(s)


  • HAL Id : hal-01968454, version 1



Guillaume Aupy, Yves Robert. Scheduling for Fault-Tolerance: An Introduction. Topic in parallel and distributed computing: Enhancing the Undergraduate Curriculum: Performance, Concurrency, and Programming on Modern Platforms, Springer International Publishing, pp.143-170, 2018, 978-3-319-93109-8. ⟨hal-01968454⟩



Record views


Files downloads