A distributed and replicated service for checkpoint storage

Fatiha Bouabache 1, 2 Thomas Hérault 2, 3 Gilles Fedak 2, 4 Franck Cappello 1, 3
3 GRAND-LARGE - Global parallel and distributed computing
LRI - Laboratoire de Recherche en Informatique, LIFL - Laboratoire d'Informatique Fondamentale de Lille, UP11 - Université Paris-Sud - Paris 11, Inria Saclay - Ile de France, CNRS - Centre National de la Recherche Scientifique : UMR8623
4 AVALON - Algorithms and Software Architectures for Distributed and HPC Platforms
Inria Grenoble - Rhône-Alpes, LIP - Laboratoire de l'Informatique du Parallélisme
Abstract : As High Performance platforms (Clusters, Grids, etc.) continue to grow in size, the average time between failures decreases to a critical level. An efficient and reliable fault tolerance protocol plays a key role in High Performance Computing. Rollback recovery is the most common fault tolerance technique used in High Performance Computing and especially in MPI applications. This technique relies on the reliability of the checkpoint storage, most of the rollback recovery protocols assume that the checkpoint servers machines are reliable. However, in a grid environment any unit can fail at any moment, including components used to connect different administrative domains. Such a failure leads to the loss of a whole set of machines, including the more reliable machines used to store the checkpoints in this administrative domain. It is thus not safe to rely on the high MTBF of specific machines to store the checkpoint images. This paper introduces a new protocol that ensure the checkpoint storage reliability even if one or more Checkpoint Servers fail. To provide this reliability the protocol is based on a replication process. We evaluate our solution through simulations against several criteria: scalability, topology, and reliability of the nodes. We also compare between two replication strategies to decide which one should be used in the implementation.
Type de document :
Chapitre d'ouvrage
, Springer, pp.295-306, 2008, 〈10.1007/978-0-387-78448-9_24〉
Liste complète des métadonnées

https://hal.inria.fr/hal-00689921
Contributeur : Ist Rennes <>
Soumis le : vendredi 20 avril 2012 - 15:21:24
Dernière modification le : vendredi 20 avril 2018 - 15:44:26

Identifiants

Citation

Fatiha Bouabache, Thomas Hérault, Gilles Fedak, Franck Cappello. A distributed and replicated service for checkpoint storage. , Springer, pp.295-306, 2008, 〈10.1007/978-0-387-78448-9_24〉. 〈hal-00689921〉

Partager

Métriques

Consultations de la notice

298