Skip to Main content Skip to Navigation
New interface
Reports (Research report)

Unified Model for Assessing Checkpointing Protocols at Extreme-Scale

George Bosilca 1 Aurélien Bouteiller 1 Elisabeth Brunet 2 Franck Cappello 3, 4, 5 Jack Dongarra 1 Amina Guermouche 6, 7 Thomas Hérault 1 Yves Robert 6, 7 Frédéric Vivien 6, 7 Dounia Zaidouni 6, 7 
3 GRAND-LARGE - Global parallel and distributed computing
LRI - Laboratoire de Recherche en Informatique, LIFL - Laboratoire d'Informatique Fondamentale de Lille, UP11 - Université Paris-Sud - Paris 11, Inria Saclay - Ile de France, CNRS - Centre National de la Recherche Scientifique : UMR8623
6 ROMA - Optimisation des ressources : modèles, algorithmes et ordonnancement
Inria Grenoble - Rhône-Alpes, LIP - Laboratoire de l'Informatique du Parallélisme
Abstract : In this article, we present a unified model for several well-known checkpoint/restart protocols. The proposed model is generic enough to encompass both extremes of the check- point/restart space, from coordinated approaches to a variety of uncoordinated checkpoint strate- gies (with message logging). We identify a set of crucial parameters, instantiate them and compare the expected efficiency of the fault tolerant protocols, for a given application/platform pair. We then propose a detailed analysis of several scenarios, including some of the most powerful currently available HPC platforms, as well as anticipated Exascale designs. The results of this analytical comparison are corroborated by a comprehensive set of simulations. Altogether, they outline com- parative behaviors of checkpoint strategies at very large scale, thereby providing insight that is hardly accessible to direct experimentation.
Document type :
Reports (Research report)
Complete list of metadata

Cited literature [34 references]  Display  Hide  Download
Contributor : Amina Guermouche Connect in order to contact the contributor
Submitted on : Monday, October 8, 2012 - 4:47:19 PM
Last modification on : Friday, November 18, 2022 - 9:23:54 AM
Long-term archiving on: : Friday, December 16, 2016 - 9:58:01 PM


Files produced by the author(s)


  • HAL Id : hal-00696154, version 2


George Bosilca, Aurélien Bouteiller, Elisabeth Brunet, Franck Cappello, Jack Dongarra, et al.. Unified Model for Assessing Checkpointing Protocols at Extreme-Scale. [Research Report] RR-7950, INRIA. 2012. ⟨hal-00696154v2⟩



Record views


Files downloads