Skip to Main content Skip to Navigation
New interface
Journal articles

Unified Model for Assessing Checkpointing Protocols at Extreme-Scale

George Bosilca 1 Aurélien Bouteiller 1 Elisabeth Brunet 2 Franck Cappello 3, 4, 5, 6 Jack Dongarra 1 Amina Guermouche 7, 8 Thomas Herault 1 Yves Robert 7, 8 Frédéric Vivien 7, 8 Dounia Zaidouni 7, 8 
4 GRAND-LARGE - Global parallel and distributed computing
LRI - Laboratoire de Recherche en Informatique, LIFL - Laboratoire d'Informatique Fondamentale de Lille, UP11 - Université Paris-Sud - Paris 11, Inria Saclay - Ile de France, CNRS - Centre National de la Recherche Scientifique : UMR8623
7 ROMA - Optimisation des ressources : modèles, algorithmes et ordonnancement
Inria Grenoble - Rhône-Alpes, LIP - Laboratoire de l'Informatique du Parallélisme
Abstract : In this paper, we present a unified model for several well-known checkpoint/restart protocols. The proposed model is generic enough to encompass both extremes of the checkpoint/restart space, from coordinated approaches to a variety of uncoordinated checkpoint strategies (with message logging). We identify a set of crucial parameters, instantiate them and compare the expected efficiency of the fault tolerant protocols, for a given application/platform pair. We then propose a detailed analysis of several scenarios, including some of the most powerful currently available HPC platforms, as well as anticipated Exascale designs. The results of this analytical comparison are corroborated by a comprehensive set of simulations. Altogether, they outline comparative behaviors of checkpoint strategies at very large scale, thereby providing insight that is hardly accessible to direct experimentation.
Complete list of metadata

Cited literature [36 references]  Display  Hide  Download
Contributor : Equipe Roma Connect in order to contact the contributor
Submitted on : Saturday, November 23, 2013 - 2:45:38 AM
Last modification on : Friday, November 18, 2022 - 9:23:38 AM
Long-term archiving on: : Monday, February 24, 2014 - 2:35:20 AM


Files produced by the author(s)



George Bosilca, Aurélien Bouteiller, Elisabeth Brunet, Franck Cappello, Jack Dongarra, et al.. Unified Model for Assessing Checkpointing Protocols at Extreme-Scale. Concurrency and Computation: Practice and Experience, 2013, 26 (17), pp.2727-2810. ⟨10.1002/cpe.3173⟩. ⟨hal-00908447⟩



Record views


Files downloads