Skip to Main content Skip to Navigation
Conference papers

A Checkpoint/Recovery Model for Heterogeneous Dataflow Computations Using Work-Stealing

Samir Jafar 1 Thierry Gautier 1 Axel W. Krings 2 Jean-Louis Roch 1
1 MOAIS - PrograMming and scheduling design fOr Applications in Interactive Simulation
Inria Grenoble - Rhône-Alpes, LIG [2007-2015] - Laboratoire d'Informatique de Grenoble [2007-2015]
Abstract : This paper presents a new checkpoint/recovery method for dataflow computations using work-stealing in heterogeneous environments as found in grid or cluster computing. Basing the state of the computation on a dynamic macro dataflow graph, it is shown that the mechanisms provide effective checkpointing for multithreaded applications in heterogeneous environments. Two methods, Systematic Event Logging and Theft-Induced Checkpointing, are presented that are efficient and extremely flexible under the system-state model, allowing for recovery on different platforms under different number of processors. A formal analysis of the overhead induced by both methods is presented, followed by an experimental evaluation in a large cluster. It is shown that both methods have very small overhead and that trade-offs between checkpointing and recovery cost can be controlled.
Document type :
Conference papers
Complete list of metadatas

https://hal.inria.fr/hal-00685314
Contributor : Ist Rennes <>
Submitted on : Wednesday, April 4, 2012 - 4:59:22 PM
Last modification on : Friday, July 17, 2020 - 11:10:25 AM

Links full text

Identifiers

Citation

Samir Jafar, Thierry Gautier, Axel W. Krings, Jean-Louis Roch. A Checkpoint/Recovery Model for Heterogeneous Dataflow Computations Using Work-Stealing. Euro-Par 2005, Aug 2005, Lisbonne, Portugal. ⟨10.1007/11549468_74⟩. ⟨hal-00685314⟩

Share

Metrics

Record views

484