Skip to Main content Skip to Navigation
Book sections

Scheduling for Fault-Tolerance: An Introduction

Guillaume Aupy 1 Yves Robert 2, 3, 4
Abstract : Parallel execution time is expected to decrease as the number of processors increases. We show in this chapter that this is not as easy as it seems, even for perfectly parallel applications. In particular, processors are subject to faults. The more processors are available, the more likely faults will strike during execution. The main strategy to cope with faults in High Performance Computing is checkpointing. We introduce the reader to this approach, and explain how to determine the optimal checkpointing period through scheduling techniques. We also detail how to combine checkpointing with prediction and with replication.
Complete list of metadatas

Cited literature [21 references]  Display  Hide  Download

https://hal.inria.fr/hal-01968454
Contributor : Equipe Roma <>
Submitted on : Wednesday, January 23, 2019 - 10:01:55 PM
Last modification on : Wednesday, November 20, 2019 - 3:23:42 AM

File

new-cder-springer.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-01968454, version 1

Collections

Citation

Guillaume Aupy, Yves Robert. Scheduling for Fault-Tolerance: An Introduction. Topic in parallel and distributed computing: Enhancing the Undergraduate Curriculum: Performance, Concurrency, and Programming on Modern Platforms, Springer International Publishing, pp.143-170, 2018, 978-3-319-93109-8. ⟨hal-01968454⟩

Share

Metrics

Record views

116

Files downloads

356