Unified Model for Assessing Checkpointing Protocols at Extreme-Scale

George Bosilca 1 Aurélien Bouteiller 1 Elisabeth Brunet 2 Franck Cappello 3, 4, 5, 6 Jack Dongarra 1 Amina Guermouche 7, 8 Thomas Hérault 1 Yves Robert 7, 8 Frédéric Vivien 7, 8 Dounia Zaidouni 7, 8
4 GRAND-LARGE - Global parallel and distributed computing
CNRS - Centre National de la Recherche Scientifique : UMR8623, Inria Saclay - Ile de France, UP11 - Université Paris-Sud - Paris 11, LIFL - Laboratoire d'Informatique Fondamentale de Lille, LRI - Laboratoire de Recherche en Informatique
7 ROMA - Optimisation des ressources : modèles, algorithmes et ordonnancement
Inria Grenoble - Rhône-Alpes, LIP - Laboratoire de l'Informatique du Parallélisme
Abstract : In this paper, we present a unified model for several well-known checkpoint/restart protocols. The proposed model is generic enough to encompass both extremes of the checkpoint/restart space, from coordinated approaches to a variety of uncoordinated checkpoint strategies (with message logging). We identify a set of crucial parameters, instantiate them and compare the expected efficiency of the fault tolerant protocols, for a given application/platform pair. We then propose a detailed analysis of several scenarios, including some of the most powerful currently available HPC platforms, as well as anticipated Exascale designs. The results of this analytical comparison are corroborated by a comprehensive set of simulations. Altogether, they outline comparative behaviors of checkpoint strategies at very large scale, thereby providing insight that is hardly accessible to direct experimentation.
Complete list of metadatas

Cited literature [36 references]  Display  Hide  Download

https://hal.inria.fr/hal-00908447
Contributor : Equipe Roma <>
Submitted on : Saturday, November 23, 2013 - 2:45:38 AM
Last modification on : Thursday, September 12, 2019 - 3:40:03 PM
Long-term archiving on : Monday, February 24, 2014 - 2:35:20 AM

File

concurrency-revised.pdf
Files produced by the author(s)

Identifiers

Citation

George Bosilca, Aurélien Bouteiller, Elisabeth Brunet, Franck Cappello, Jack Dongarra, et al.. Unified Model for Assessing Checkpointing Protocols at Extreme-Scale. Concurrency and Computation: Practice and Experience, Wiley, 2013, 26 (17), pp.2727-2810. ⟨10.1002/cpe.3173⟩. ⟨hal-00908447⟩

Share

Metrics

Record views

941

Files downloads

335