Unified Model for Assessing Checkpointing Protocols at Extreme-Scale

George Bosilca 1 Aurélien Bouteiller 1 Élisabeth Brunet 2 Franck Cappello 3, 4, 5, 6 Jack Dongarra 1 Amina Guermouche 7, 8 Thomas Hérault 1 Yves Robert 7, 8 Frédéric Vivien 7, 8 Dounia Zaidouni 7, 8
4 GRAND-LARGE - Global parallel and distributed computing
LRI - Laboratoire de Recherche en Informatique, LIFL - Laboratoire d'Informatique Fondamentale de Lille, UP11 - Université Paris-Sud - Paris 11, Inria Saclay - Ile de France, CNRS - Centre National de la Recherche Scientifique : UMR8623
7 ROMA - Optimisation des ressources : modèles, algorithmes et ordonnancement
Inria Grenoble - Rhône-Alpes, LIP - Laboratoire de l'Informatique du Parallélisme
Abstract : In this paper, we present a unified model for several well-known checkpoint/restart protocols. The proposed model is generic enough to encompass both extremes of the checkpoint/restart space, from coordinated approaches to a variety of uncoordinated checkpoint strategies (with message logging). We identify a set of crucial parameters, instantiate them and compare the expected efficiency of the fault tolerant protocols, for a given application/platform pair. We then propose a detailed analysis of several scenarios, including some of the most powerful currently available HPC platforms, as well as anticipated Exascale designs. The results of this analytical comparison are corroborated by a comprehensive set of simulations. Altogether, they outline comparative behaviors of checkpoint strategies at very large scale, thereby providing insight that is hardly accessible to direct experimentation.
Type de document :
Article dans une revue
Concurrency and Computation: Practice and Experience, Wiley, 2013, 26 (17), pp.2727-2810. 〈10.1002/cpe.3173〉
Liste complète des métadonnées

Littérature citée [36 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/hal-00908447
Contributeur : Equipe Roma <>
Soumis le : samedi 23 novembre 2013 - 02:45:38
Dernière modification le : vendredi 20 avril 2018 - 15:44:26
Document(s) archivé(s) le : lundi 24 février 2014 - 02:35:20

Fichier

concurrency-revised.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

Citation

George Bosilca, Aurélien Bouteiller, Élisabeth Brunet, Franck Cappello, Jack Dongarra, et al.. Unified Model for Assessing Checkpointing Protocols at Extreme-Scale. Concurrency and Computation: Practice and Experience, Wiley, 2013, 26 (17), pp.2727-2810. 〈10.1002/cpe.3173〉. 〈hal-00908447〉

Partager

Métriques

Consultations de la notice

691

Téléchargements de fichiers

168