HAL will be down for maintenance from Friday, June 10 at 4pm through Monday, June 13 at 9am. More information
Skip to Main content Skip to Navigation
Conference papers

Fault Tolerance in Cluster Federations with O2P-CF

Thomas Ropars 1 Christine Morin 1
1 PARIS - Programming distributed parallel systems for large scale numerical simulation
IRISA - Institut de Recherche en Informatique et Systèmes Aléatoires, ENS Cachan - École normale supérieure - Cachan, Inria Rennes – Bretagne Atlantique
Abstract : Fault tolerance is one of the key issues for large scale applications executed on high performance computing systems. In a cluster federation, clusters are gathered to provide huge computing power. To work efficiently on such systems, networks characteristics have to be taken into account: the latency between two nodes of different clusters is much higher than the latency between two nodes of the same cluster. In this paper, we present O2P-CF a message logging protocol well-suited to provide fault tolerance for message passing applications executed on cluster federations. O2P-CF is based on the combination of O2P, an extremely optimistic message logging protocol, with a pessimistic message logging protocol.
Document type :
Conference papers
Complete list of metadata

https://hal.inria.fr/inria-00424025
Contributor : Thomas Ropars Connect in order to contact the contributor
Submitted on : Tuesday, October 13, 2009 - 5:17:55 PM
Last modification on : Friday, February 4, 2022 - 3:24:12 AM

Identifiers

  • HAL Id : inria-00424025, version 1

Citation

Thomas Ropars, Christine Morin. Fault Tolerance in Cluster Federations with O2P-CF. Resilience 2008, Workshop on Resiliency in High Performance Computing, May 2008, Lyon, France. ⟨inria-00424025⟩

Share

Metrics

Record views

77