Unified Model for Assessing Checkpointing Protocols at Extreme-Scale

George Bosilca 1 Aurélien Bouteiller 1 Elisabeth Brunet 2 Franck Cappello 3, 4, 5 Jack Dongarra 1 Amina Guermouche 6, 7 Thomas Hérault 1 Yves Robert 6, 7 Frédéric Vivien 6, 7 Dounia Zaidouni 6, 7
3 GRAND-LARGE - Global parallel and distributed computing
CNRS - Centre National de la Recherche Scientifique : UMR8623, Inria Saclay - Ile de France, UP11 - Université Paris-Sud - Paris 11, LIFL - Laboratoire d'Informatique Fondamentale de Lille, LRI - Laboratoire de Recherche en Informatique
6 ROMA - Optimisation des ressources : modèles, algorithmes et ordonnancement
Inria Grenoble - Rhône-Alpes, LIP - Laboratoire de l'Informatique du Parallélisme
Abstract : In this article, we present a unified model for several well-known checkpoint/restart protocols. The proposed model is generic enough to encompass both extremes of the check- point/restart space, from coordinated approaches to a variety of uncoordinated checkpoint strate- gies (with message logging). We identify a set of crucial parameters, instantiate them and compare the expected efficiency of the fault tolerant protocols, for a given application/platform pair. We then propose a detailed analysis of several scenarios, including some of the most powerful currently available HPC platforms, as well as anticipated Exascale designs. The results of this analytical comparison are corroborated by a comprehensive set of simulations. Altogether, they outline com- parative behaviors of checkpoint strategies at very large scale, thereby providing insight that is hardly accessible to direct experimentation.
Complete list of metadatas

Cited literature [34 references]  Display  Hide  Download

https://hal.inria.fr/hal-00696154
Contributor : Amina Guermouche <>
Submitted on : Monday, October 8, 2012 - 4:47:19 PM
Last modification on : Thursday, September 12, 2019 - 3:40:03 PM
Long-term archiving on : Friday, December 16, 2016 - 9:58:01 PM

File

RR-7950.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-00696154, version 2

Citation

George Bosilca, Aurélien Bouteiller, Elisabeth Brunet, Franck Cappello, Jack Dongarra, et al.. Unified Model for Assessing Checkpointing Protocols at Extreme-Scale. [Research Report] RR-7950, INRIA. 2012. ⟨hal-00696154v2⟩

Share

Metrics

Record views

622

Files downloads

1038