Theft-Induced Checkpointing for Reconfigurable Dataflow Applications
Abstract
n this paper a new checkpoint/recovery protocol called theft-induced checkpointing is defined for dataflow computations in large heterogeneous environments. The protocol is especially useful in massively parallel multi-threaded computations as found in cluster or grid computing and utilizes the principle of work-stealing to distribute work. By basing the state of executions on a macro dataflow graph, the protocol shows extreme flexibility with respect to rollback. Specifically, it allows local rollback in dynamic heterogeneous systems, even under a different number of processors and processes. To maximize run-time efficiency, the overhead associated with checkpointing is shifted to the rollback operations whenever possible. Experimental results show the overhead induced is very small