Skip to Main content Skip to Navigation
Conference papers

Fault Tolerance in Cluster Federations with O2P-CF

Thomas Ropars 1 Christine Morin 1
1 PARIS - Programming distributed parallel systems for large scale numerical simulation
IRISA - Institut de Recherche en Informatique et Systèmes Aléatoires, ENS Cachan - École normale supérieure - Cachan, Inria Rennes – Bretagne Atlantique
Abstract : Fault tolerance is one of the key issues for large scale applications executed on high performance computing systems. In a cluster federation, clusters are gathered to provide huge computing power. To work efficiently on such systems, networks characteristics have to be taken into account: the latency between two nodes of different clusters is much higher than the latency between two nodes of the same cluster. In this paper, we present O2P-CF a message logging protocol well-suited to provide fault tolerance for message passing applications executed on cluster federations. O2P-CF is based on the combination of O2P, an extremely optimistic message logging protocol, with a pessimistic message logging protocol.
Document type :
Conference papers
Complete list of metadatas
Contributor : Thomas Ropars <>
Submitted on : Tuesday, October 13, 2009 - 5:17:55 PM
Last modification on : Friday, July 10, 2020 - 4:20:27 PM


  • HAL Id : inria-00424025, version 1


Thomas Ropars, Christine Morin. Fault Tolerance in Cluster Federations with O2P-CF. Resilience 2008, Workshop on Resiliency in High Performance Computing, May 2008, Lyon, France. ⟨inria-00424025⟩



Record views