Scheduling for Fault-Tolerance: An Introduction

Guillaume Aupy; Yves Robert

Chapitre D'ouvrage Année : 2018

Scheduling for Fault-Tolerance: An Introduction

(1) , (2, 3, 4)

1
2
3
4

Guillaume Aupy

Fonction : Auteur
PersonId : 526
IdHAL : guillaume-aupy
ORCID : 0000-0001-8862-3277
IdRef : 181087006

Topology-Aware System-Scale Data Management for High-Performance Computing

Yves Robert

Fonction : Auteur
PersonId : 739318
IdHAL : yves-robert
ORCID : 0000-0003-2361-055X
IdRef : 029813611

Optimisation des ressources : modèles, algorithmes et ordonnancement

Laboratoire de l'Informatique du Parallélisme

Innovative Computing Laboratory [Knoxville]

Résumé

Parallel execution time is expected to decrease as the number of processors increases. We show in this chapter that this is not as easy as it seems, even for perfectly parallel applications. In particular, processors are subject to faults. The more processors are available, the more likely faults will strike during execution. The main strategy to cope with faults in High Performance Computing is checkpointing. We introduce the reader to this approach, and explain how to determine the optimal checkpointing period through scheduling techniques. We also detail how to combine checkpointing with prediction and with replication.

Mots clés

Relevant core courses: Data Structures and Algorithms Probabilities Relevant PDC topics: Scalability in algorithms and architectures Fault tolerance Time

Domaines

Informatique [cs]

Fichier principal

new-cder-springer.pdf (256.08 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Equipe Roma : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-01968454

Soumis le : mercredi 23 janvier 2019-22:01:55

Dernière modification le : jeudi 11 mai 2023-11:56:10

Dates et versions

hal-01968454 , version 1 (23-01-2019)

Identifiants

HAL Id : hal-01968454 , version 1

Citer

Guillaume Aupy, Yves Robert. Scheduling for Fault-Tolerance: An Introduction. Topic in parallel and distributed computing: Enhancing the Undergraduate Curriculum: Performance, Concurrency, and Programming on Modern Platforms, Springer International Publishing, pp.143-170, 2018, 978-3-319-93109-8. ⟨hal-01968454⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

ENS-LYON CNRS INRIA UNIV-LYON1 INRIA2 UDL

49 Consultations

143 Téléchargements

Scheduling for Fault-Tolerance: An Introduction

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager