Skip to Main content Skip to Navigation

Optimal Checkpointing Strategies for Iterative Applications

Abstract : This work provides an optimal checkpointing strategy to protect iterative applications from fail-stop errors. We consider a very general framework, where the application repeats the same execution pattern by executing consecutive iterations, and where each iteration is composed of several tasks. These tasks have different execution lengths and different checkpoint costs. Assume that there are $n$ tasks and that task $a_i$, where $0 ≤ i < n$, has execution time $t_i$ and checkpoint cost $C_i$. A naive strategy would checkpoint after each task. A strategy inspired by the Young/Daly formula would select the task $a_{min}$ with smallest checkpoint cost $C_{min}$ and would checkpoint after every $p^{th}$ instance of that task, leading to a checkpointing period $P_{YD} = pT$ where $\textit{T}{\sum_{i=0}^{n-1}a_i}$ is the time per iteration. One would choose the period so that $P_{YD} \approx \sqrt{ 2{\mu}C_\textrm{min}}$ to obey the Young/Daly formula, where $\mu$ is the application MTBF. Both the naive and Young/Daly strategies are suboptimal. Our main contribution is to show that the optimal checkpoint strategy is globally periodic, and to design a dynamic programming algorithm that computes the optimal checkpointing pattern. This pattern may well checkpoint many different tasks, and this across many different iterations. We show through simulations, both from synthetic and real-life application scenarios, that the optimal strategy significantly outperforms the naive and Young/Daly strategies.
Document type :
Complete list of metadatas

Cited literature [47 references]  Display  Hide  Download
Contributor : Equipe Roma <>
Submitted on : Tuesday, October 27, 2020 - 2:53:21 PM
Last modification on : Monday, November 16, 2020 - 9:56:04 AM


Files produced by the author(s)


  • HAL Id : hal-02980455, version 1


Yishu Du, Loris Marchal, Guillaume Pallez, Yves Robert. Optimal Checkpointing Strategies for Iterative Applications. [Research Report] RR-9371, Inria - Research Centre Grenoble – Rhône-Alpes. 2020. ⟨hal-02980455⟩



Record views


Files downloads