Scheduling for Fault-Tolerance: An Introduction - Inria - Institut national de recherche en sciences et technologies du numérique Accéder directement au contenu
Chapitre D'ouvrage Année : 2018

Scheduling for Fault-Tolerance: An Introduction

Résumé

Parallel execution time is expected to decrease as the number of processors increases. We show in this chapter that this is not as easy as it seems, even for perfectly parallel applications. In particular, processors are subject to faults. The more processors are available, the more likely faults will strike during execution. The main strategy to cope with faults in High Performance Computing is checkpointing. We introduce the reader to this approach, and explain how to determine the optimal checkpointing period through scheduling techniques. We also detail how to combine checkpointing with prediction and with replication.
Fichier principal
Vignette du fichier
new-cder-springer.pdf (256.08 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-01968454 , version 1 (23-01-2019)

Identifiants

  • HAL Id : hal-01968454 , version 1

Citer

Guillaume Aupy, Yves Robert. Scheduling for Fault-Tolerance: An Introduction. Topic in parallel and distributed computing: Enhancing the Undergraduate Curriculum: Performance, Concurrency, and Programming on Modern Platforms, Springer International Publishing, pp.143-170, 2018, 978-3-319-93109-8. ⟨hal-01968454⟩
49 Consultations
143 Téléchargements

Partager

Gmail Facebook X LinkedIn More