A Skeletal-Based Approach for the Development of Fault-Tolerant SPMD Applications

Constantinos Makassikis 1, 2 Virginie Galtier 2 Stéphane Vialle 1, 2
1 ALGORILLE - Algorithms for the Grid
INRIA Lorraine, LORIA - Laboratoire Lorrain de Recherche en Informatique et ses Applications
Abstract : Distributing applications over PC clusters to speed-up or size-up the execution is now commonplace. Yet efficiently tolerating faults of these systems is a major issue. To ease the addition of checkpoint-based fault tolerance at the application level, we introduce a Model for Low-Overhead Tolerance of Faults (MoLOToF) which is based on structuring applications using fault-tolerant skeletons. MoLOToF also encourages collaborations with the programmer and the execution environment. The skeletons are adapted to specific parallelization paradigms and yield what can be called fault-tolerant algorithmic skeletons. The application of MoLOToF to the SPMD parallelization paradigm results in our proposed FT-SPMD framework. Experiments show that the complexity for developing an application is small and the use of the framework has a small impact on performance. Comparisons with existing system-level checkpoint solutions, namely LAM/MPI and DMTCP, point out that FT-SPMD has a lower runtime overhead while being more robust when a higher level of fault tolerance is required.
Type de document :
Communication dans un congrès
The 11th International Conference on Parallel and Distributed Computing, Applications and Technologies - PDCAT 2010, Dec 2010, Wuhan, China. 2010, 〈10.1109/PDCAT.2010.89〉
Liste complète des métadonnées

https://hal.inria.fr/inria-00548953
Contributeur : Constantinos Makassikis <>
Soumis le : mardi 21 décembre 2010 - 08:19:13
Dernière modification le : jeudi 11 janvier 2018 - 06:19:48

Identifiants

Collections

Citation

Constantinos Makassikis, Virginie Galtier, Stéphane Vialle. A Skeletal-Based Approach for the Development of Fault-Tolerant SPMD Applications. The 11th International Conference on Parallel and Distributed Computing, Applications and Technologies - PDCAT 2010, Dec 2010, Wuhan, China. 2010, 〈10.1109/PDCAT.2010.89〉. 〈inria-00548953〉

Partager

Métriques

Consultations de la notice

195