Optimised recovery with a coordinated checkpoint/rollback protocol for domain decomposition applications

Xavier Besseron 1 Thierry Gautier 1
1 MOAIS - PrograMming and scheduling design fOr Applications in Interactive Simulation
Inria Grenoble - Rhône-Alpes, LIG - Laboratoire d'Informatique de Grenoble
Abstract : Fault-tolerance protocols play an important role in today long runtime scientific parallel applications. The probability of a failure may be important due to the number of unreliable components involved during an execution. In this paper we present our approach and preliminary results about a new checkpoint/rollback protocol based on a coordinated scheme. One feature of this protocol is that fault recovery only requires a partial restart of other processes thanks to the availability of an abstract representation of the execution. Simulations on a domain decomposition application show that the amount of computations required to restart and the number of involved processes are reduced compared to the classical global rollback protocol.
Document type :
Conference papers
Complete list of metadatas

https://hal.inria.fr/hal-00691997
Contributor : Ist Rennes <>
Submitted on : Friday, April 27, 2012 - 4:05:45 PM
Last modification on : Thursday, October 11, 2018 - 8:48:03 AM

Links full text

Identifiers

Collections

Citation

Xavier Besseron, Thierry Gautier. Optimised recovery with a coordinated checkpoint/rollback protocol for domain decomposition applications. MCO 2008 : Second International Conference on Modelling, Computation and Optimization in Information Systems and Management Sciences, Sep 2008, Metz, France. pp.497-506, ⟨10.1007/978-3-540-87477-5_53⟩. ⟨hal-00691997⟩

Share

Metrics

Record views

237