Optimised recovery with a coordinated checkpoint/rollback protocol for domain decomposition applications - Inria - Institut national de recherche en sciences et technologies du numérique Accéder directement au contenu
Communication Dans Un Congrès Année : 2008

Optimised recovery with a coordinated checkpoint/rollback protocol for domain decomposition applications

Résumé

Fault-tolerance protocols play an important role in today long runtime scientific parallel applications. The probability of a failure may be important due to the number of unreliable components involved during an execution. In this paper we present our approach and preliminary results about a new checkpoint/rollback protocol based on a coordinated scheme. One feature of this protocol is that fault recovery only requires a partial restart of other processes thanks to the availability of an abstract representation of the execution. Simulations on a domain decomposition application show that the amount of computations required to restart and the number of involved processes are reduced compared to the classical global rollback protocol.

Domaines

Autre [cs.OH]
Fichier non déposé

Dates et versions

hal-00691997 , version 1 (27-04-2012)

Identifiants

Citer

Xavier Besseron, Thierry Gautier. Optimised recovery with a coordinated checkpoint/rollback protocol for domain decomposition applications. MCO 2008 : Second International Conference on Modelling, Computation and Optimization in Information Systems and Management Sciences, Sep 2008, Metz, France. pp.497-506, ⟨10.1007/978-3-540-87477-5_53⟩. ⟨hal-00691997⟩
93 Consultations
0 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More