A Checkpoint/Recovery Model for Heterogeneous Dataflow Computations Using Work-Stealing - Inria - Institut national de recherche en sciences et technologies du numérique Accéder directement au contenu
Communication Dans Un Congrès Année : 2005

A Checkpoint/Recovery Model for Heterogeneous Dataflow Computations Using Work-Stealing

Résumé

This paper presents a new checkpoint/recovery method for dataflow computations using work-stealing in heterogeneous environments as found in grid or cluster computing. Basing the state of the computation on a dynamic macro dataflow graph, it is shown that the mechanisms provide effective checkpointing for multithreaded applications in heterogeneous environments. Two methods, Systematic Event Logging and Theft-Induced Checkpointing, are presented that are efficient and extremely flexible under the system-state model, allowing for recovery on different platforms under different number of processors. A formal analysis of the overhead induced by both methods is presented, followed by an experimental evaluation in a large cluster. It is shown that both methods have very small overhead and that trade-offs between checkpointing and recovery cost can be controlled.

Domaines

Autre [cs.OH]

Dates et versions

hal-00685314 , version 1 (04-04-2012)

Identifiants

Citer

Samir Jafar, Thierry Gautier, Axel W. Krings, Jean-Louis Roch. A Checkpoint/Recovery Model for Heterogeneous Dataflow Computations Using Work-Stealing. Euro-Par 2005, Aug 2005, Lisbonne, Portugal. ⟨10.1007/11549468_74⟩. ⟨hal-00685314⟩
192 Consultations
0 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More