Skip to Main content Skip to Navigation
New interface
Conference papers

Checkpointing vs. migration for post-petascale supercomputers

Franck Cappello 1, 2, 3 Henri Casanova 4 Yves Robert 5, 6, * 
* Corresponding author
1 GRAND-LARGE - Global parallel and distributed computing
LRI - Laboratoire de Recherche en Informatique, LIFL - Laboratoire d'Informatique Fondamentale de Lille, UP11 - Université Paris-Sud - Paris 11, Inria Saclay - Ile de France, CNRS - Centre National de la Recherche Scientifique : UMR8623
6 GRAAL - Algorithms and Scheduling for Distributed Heterogeneous Platforms
Inria Grenoble - Rhône-Alpes, LIP - Laboratoire de l'Informatique du Parallélisme
Abstract : An alternative to classical fault-tolerant approaches for large-scale clusters is failure avoidance, by which the occurrence of a fault is predicted and a preventive measure is taken. We develop analytical performance models for two types of preventive measures: preventive checkpointing and preventive migration. We also develop an analytical model of the performance of a standard periodic checkpoint fault-tolerant approach. We instantiate these models for platform scenarios representative of current and future technology trends. We find that preventive migration is the better approach in the short term by orders of magnitude. However, in the longer term, both approaches have comparable merit with a marginal advantage for preventive checkpointing. We also find that standard non-prediction-based fault tolerance achieves poor scaling when compared to prediction-based failure avoidance, thereby demonstrating the importance of failure prediction capabilities. Finally, our results show that achieving good utilization in truly large-scale machines (e.g., 220 nodes) for parallel workloads will require more than the failure avoidance techniques evaluated in this work.
Document type :
Conference papers
Complete list of metadata
Contributor : Equipe Roma Connect in order to contact the contributor
Submitted on : Friday, February 8, 2013 - 2:51:37 PM
Last modification on : Friday, November 18, 2022 - 9:26:24 AM

Links full text



Franck Cappello, Henri Casanova, Yves Robert. Checkpointing vs. migration for post-petascale supercomputers. ICPP'2010 - the 39th International Conference on Parallel Processing, 2010, San Diego, United States. ⟨10.1109/ICPP.2010.26⟩. ⟨hal-00786377⟩



Record views