Checkpointing vs. migration for post-petascale supercomputers

Franck Cappello 1, 2, 3 Henri Casanova 4 Yves Robert 5, 6, *
* Corresponding author
1 GRAND-LARGE - Global parallel and distributed computing
CNRS - Centre National de la Recherche Scientifique : UMR8623, Inria Saclay - Ile de France, UP11 - Université Paris-Sud - Paris 11, LIFL - Laboratoire d'Informatique Fondamentale de Lille, LRI - Laboratoire de Recherche en Informatique
6 GRAAL - Algorithms and Scheduling for Distributed Heterogeneous Platforms
Inria Grenoble - Rhône-Alpes, LIP - Laboratoire de l'Informatique du Parallélisme
Abstract : An alternative to classical fault-tolerant approaches for large-scale clusters is failure avoidance, by which the occurrence of a fault is predicted and a preventive measure is taken. We develop analytical performance models for two types of preventive measures: preventive checkpointing and preventive migration. We also develop an analytical model of the performance of a standard periodic checkpoint fault-tolerant approach. We instantiate these models for platform scenarios representative of current and future technology trends. We find that preventive migration is the better approach in the short term by orders of magnitude. However, in the longer term, both approaches have comparable merit with a marginal advantage for preventive checkpointing. We also find that standard non-prediction-based fault tolerance achieves poor scaling when compared to prediction-based failure avoidance, thereby demonstrating the importance of failure prediction capabilities. Finally, our results show that achieving good utilization in truly large-scale machines (e.g., 220 nodes) for parallel workloads will require more than the failure avoidance techniques evaluated in this work.
Document type :
Conference papers
Complete list of metadatas

https://hal.inria.fr/hal-00786377
Contributor : Equipe Roma <>
Submitted on : Friday, February 8, 2013 - 2:51:37 PM
Last modification on : Thursday, February 21, 2019 - 10:52:50 AM

Links full text

Identifiers

Citation

Franck Cappello, Henri Casanova, Yves Robert. Checkpointing vs. migration for post-petascale supercomputers. ICPP'2010 - the 39th International Conference on Parallel Processing, 2010, San Diego, United States. ⟨10.1109/ICPP.2010.26⟩. ⟨hal-00786377⟩

Share

Metrics

Record views

389