A Checkpoint/Recovery Model for Heterogeneous Dataflow Computations Using Work-Stealing

Abstract : This paper presents a new checkpoint/recovery method for dataflow computations using work-stealing in heterogeneous environments as found in grid or cluster computing. Basing the state of the computation on a dynamic macro dataflow graph, it is shown that the mechanisms provide effective checkpointing for multithreaded applications in heterogeneous environments. Two methods, Systematic Event Logging and Theft-Induced Checkpointing, are presented that are efficient and extremely flexible under the system-state model, allowing for recovery on different platforms under different number of processors. A formal analysis of the overhead induced by both methods is presented, followed by an experimental evaluation in a large cluster. It is shown that both methods have very small overhead and that trade-offs between checkpointing and recovery cost can be controlled.
Document type :
Conference papers
Complete list of metadatas

https://hal.inria.fr/hal-00685314
Contributor : Ist Rennes <>
Submitted on : Wednesday, April 4, 2012 - 4:59:22 PM
Last modification on : Thursday, October 11, 2018 - 8:48:03 AM

Links full text

Identifiers

Collections

Citation

Samir Jafar, Thierry Gautier, Axel W. Krings, Jean-Louis Roch. A Checkpoint/Recovery Model for Heterogeneous Dataflow Computations Using Work-Stealing. Euro-Par 2005, Aug 2005, Lisbonne, Portugal. ⟨10.1007/11549468_74⟩. ⟨hal-00685314⟩

Share

Metrics

Record views

425