Fault Tolerance in Cluster Federations with O2P-CF

Thomas Ropars 1 Christine Morin 1
1 PARIS - Programming distributed parallel systems for large scale numerical simulation
IRISA - Institut de Recherche en Informatique et Systèmes Aléatoires, ENS Cachan - École normale supérieure - Cachan, Inria Rennes – Bretagne Atlantique
Abstract : Fault tolerance is one of the key issues for large scale applications executed on high performance computing systems. In a cluster federation, clusters are gathered to provide huge computing power. To work efficiently on such systems, networks characteristics have to be taken into account: the latency between two nodes of different clusters is much higher than the latency between two nodes of the same cluster. In this paper, we present O2P-CF a message logging protocol well-suited to provide fault tolerance for message passing applications executed on cluster federations. O2P-CF is based on the combination of O2P, an extremely optimistic message logging protocol, with a pessimistic message logging protocol.
Type de document :
Communication dans un congrès
Resilience 2008, Workshop on Resiliency in High Performance Computing, May 2008, Lyon, France. 2008
Liste complète des métadonnées

https://hal.inria.fr/inria-00424025
Contributeur : Thomas Ropars <>
Soumis le : mardi 13 octobre 2009 - 17:17:55
Dernière modification le : mercredi 16 mai 2018 - 11:23:04

Identifiants

  • HAL Id : inria-00424025, version 1

Citation

Thomas Ropars, Christine Morin. Fault Tolerance in Cluster Federations with O2P-CF. Resilience 2008, Workshop on Resiliency in High Performance Computing, May 2008, Lyon, France. 2008. 〈inria-00424025〉

Partager

Métriques

Consultations de la notice

223