An optimal algorithm for scheduling checkpoints with variable costs

Mohamed Slim Bouguerra 1 Denis Trystram 1 Frédéric Wagner 1
1 MOAIS - PrograMming and scheduling design fOr Applications in Interactive Simulation
Inria Grenoble - Rhône-Alpes, LIG - Laboratoire d'Informatique de Grenoble
Abstract : Since the last decade, computing systems turn to large scale parallel platforms composed of thousands of processors. Many actual applications run on such systems for long duration, up to several days or weeks. Recently, statistic studies about failures on high performance computing platforms emphasize that the mean time between failures may not exceed few hours. Thus, it is necessary to develop effcient strategies providing a safe and reliable completion of applications. This may be achieved through redundancy or by storing intermediate computation states on reliable external devices. Saved states are then used to restart computations from the last checkpoint. This last approach called checkpointing is one of the most popular fault tolerance technique in parallel systems.
Type de document :
Rapport
[Technical Report] 2010
Liste complète des métadonnées

Littérature citée [9 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/inria-00558861
Contributeur : Mohamed Slim Bouguerra <>
Soumis le : lundi 24 janvier 2011 - 13:24:36
Dernière modification le : jeudi 11 janvier 2018 - 06:22:02
Document(s) archivé(s) le : vendredi 2 décembre 2016 - 19:06:41

Fichier

trystram_fault_tolerance.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : inria-00558861, version 1

Collections

Citation

Mohamed Slim Bouguerra, Denis Trystram, Frédéric Wagner. An optimal algorithm for scheduling checkpoints with variable costs. [Technical Report] 2010. 〈inria-00558861〉

Partager

Métriques

Consultations de la notice

1027

Téléchargements de fichiers

480