A Checkpoint/Recovery Model for Heterogeneous Dataflow Computations Using Work-Stealing

Abstract : This paper presents a new checkpoint/recovery method for dataflow computations using work-stealing in heterogeneous environments as found in grid or cluster computing. Basing the state of the computation on a dynamic macro dataflow graph, it is shown that the mechanisms provide effective checkpointing for multithreaded applications in heterogeneous environments. Two methods, Systematic Event Logging and Theft-Induced Checkpointing, are presented that are efficient and extremely flexible under the system-state model, allowing for recovery on different platforms under different number of processors. A formal analysis of the overhead induced by both methods is presented, followed by an experimental evaluation in a large cluster. It is shown that both methods have very small overhead and that trade-offs between checkpointing and recovery cost can be controlled.
Type de document :
Communication dans un congrès
José C. Cunha and Pedro Medeiros. Euro-Par 2005, Aug 2005, Lisbonne, Portugal. Springer, 3648, 2005, Lecture Notes in Computer Science. 〈10.1007/11549468_74〉
Liste complète des métadonnées

https://hal.inria.fr/hal-00685314
Contributeur : Ist Rennes <>
Soumis le : mercredi 4 avril 2012 - 16:59:22
Dernière modification le : mercredi 11 avril 2018 - 01:53:59

Lien texte intégral

Identifiants

Citation

Samir Jafar, Thierry Gautier, Axel W. Krings, Jean-Louis Roch. A Checkpoint/Recovery Model for Heterogeneous Dataflow Computations Using Work-Stealing. José C. Cunha and Pedro Medeiros. Euro-Par 2005, Aug 2005, Lisbonne, Portugal. Springer, 3648, 2005, Lecture Notes in Computer Science. 〈10.1007/11549468_74〉. 〈hal-00685314〉

Partager

Métriques

Consultations de la notice

322