Scheduling for Fault-Tolerance: An Introduction - Archive ouverte HAL Access content directly
Book Sections Year : 2018

Scheduling for Fault-Tolerance: An Introduction

(1) , (2, 3, 4)
1
2
3
4

Abstract

Parallel execution time is expected to decrease as the number of processors increases. We show in this chapter that this is not as easy as it seems, even for perfectly parallel applications. In particular, processors are subject to faults. The more processors are available, the more likely faults will strike during execution. The main strategy to cope with faults in High Performance Computing is checkpointing. We introduce the reader to this approach, and explain how to determine the optimal checkpointing period through scheduling techniques. We also detail how to combine checkpointing with prediction and with replication.
Fichier principal
Vignette du fichier
new-cder-springer.pdf (256.08 Ko) Télécharger le fichier
Origin : Files produced by the author(s)
Loading...

Dates and versions

hal-01968454 , version 1 (23-01-2019)

Identifiers

  • HAL Id : hal-01968454 , version 1

Cite

Guillaume Aupy, Yves Robert. Scheduling for Fault-Tolerance: An Introduction. Topic in parallel and distributed computing: Enhancing the Undergraduate Curriculum: Performance, Concurrency, and Programming on Modern Platforms, Springer International Publishing, pp.143-170, 2018, 978-3-319-93109-8. ⟨hal-01968454⟩
46 View
134 Download

Share

Gmail Facebook Twitter LinkedIn More