Checkpointing vs. migration for post-petascale supercomputers

Franck Cappello 1, 2, 3 Henri Casanova 4 Yves Robert 5, 6, *
* Auteur correspondant
1 GRAND-LARGE - Global parallel and distributed computing
CNRS - Centre National de la Recherche Scientifique : UMR8623, Inria Saclay - Ile de France, UP11 - Université Paris-Sud - Paris 11, LIFL - Laboratoire d'Informatique Fondamentale de Lille, LRI - Laboratoire de Recherche en Informatique
6 GRAAL - Algorithms and Scheduling for Distributed Heterogeneous Platforms
Inria Grenoble - Rhône-Alpes, LIP - Laboratoire de l'Informatique du Parallélisme
Abstract : An alternative to classical fault-tolerant approaches for large-scale clusters is failure avoidance, by which the occurrence of a fault is predicted and a preventive measure is taken. We develop analytical performance models for two types of preventive measures: preventive checkpointing and preventive migration. We also develop an analytical model of the performance of a standard periodic checkpoint fault-tolerant approach. We instantiate these models for platform scenarios representative of current and future technology trends. We find that preventive migration is the better approach in the short term by orders of magnitude. However, in the longer term, both approaches have comparable merit with a marginal advantage for preventive checkpointing. We also find that standard non-prediction-based fault tolerance achieves poor scaling when compared to prediction-based failure avoidance, thereby demonstrating the importance of failure prediction capabilities. Finally, our results show that achieving good utilization in truly large-scale machines (e.g., 220 nodes) for parallel workloads will require more than the failure avoidance techniques evaluated in this work.
Type de document :
Communication dans un congrès
ICPP'2010 - the 39th International Conference on Parallel Processing, 2010, San Diego, United States. IEEE Computer Society Press, 2010, 〈10.1109/ICPP.2010.26〉
Liste complète des métadonnées

https://hal.inria.fr/hal-00786377
Contributeur : Equipe Roma <>
Soumis le : vendredi 8 février 2013 - 14:51:37
Dernière modification le : vendredi 20 avril 2018 - 15:44:24

Lien texte intégral

Identifiants

Citation

Franck Cappello, Henri Casanova, Yves Robert. Checkpointing vs. migration for post-petascale supercomputers. ICPP'2010 - the 39th International Conference on Parallel Processing, 2010, San Diego, United States. IEEE Computer Society Press, 2010, 〈10.1109/ICPP.2010.26〉. 〈hal-00786377〉

Partager

Métriques

Consultations de la notice

357