Optimised recovery with a coordinated checkpoint/rollback protocol for domain decomposition applications

Xavier Besseron 1 Thierry Gautier 1
1 MOAIS - PrograMming and scheduling design fOr Applications in Interactive Simulation
Inria Grenoble - Rhône-Alpes, LIG - Laboratoire d'Informatique de Grenoble
Abstract : Fault-tolerance protocols play an important role in today long runtime scientific parallel applications. The probability of a failure may be important due to the number of unreliable components involved during an execution. In this paper we present our approach and preliminary results about a new checkpoint/rollback protocol based on a coordinated scheme. One feature of this protocol is that fault recovery only requires a partial restart of other processes thanks to the availability of an abstract representation of the execution. Simulations on a domain decomposition application show that the amount of computations required to restart and the number of involved processes are reduced compared to the classical global rollback protocol.
Type de document :
Communication dans un congrès
Springer. MCO 2008 : Second International Conference on Modelling, Computation and Optimization in Information Systems and Management Sciences, Sep 2008, Metz, France. Springer, 14, pp.497-506, 2008, Communications in Computer and Information Science. 〈10.1007/978-3-540-87477-5_53〉
Liste complète des métadonnées

https://hal.inria.fr/hal-00691997
Contributeur : Ist Rennes <>
Soumis le : vendredi 27 avril 2012 - 16:05:45
Dernière modification le : jeudi 11 octobre 2018 - 08:48:03

Identifiants

Collections

Citation

Xavier Besseron, Thierry Gautier. Optimised recovery with a coordinated checkpoint/rollback protocol for domain decomposition applications. Springer. MCO 2008 : Second International Conference on Modelling, Computation and Optimization in Information Systems and Management Sciences, Sep 2008, Metz, France. Springer, 14, pp.497-506, 2008, Communications in Computer and Information Science. 〈10.1007/978-3-540-87477-5_53〉. 〈hal-00691997〉

Partager

Métriques

Consultations de la notice

187