Skip to Main content Skip to Navigation
Conference papers

Towards Efficient I/O Scheduling for Collaborative Multi-Level Checkpointing

Abstract : Efficient checkpointing of distributed data structures periodically at key moments during runtime is a recurring fundamental pattern in a large number of uses cases: fault tolerance based on checkpoint-restart, in-situ or post-analytics, reproducibility, adjoint computations, etc. In this context, multilevel checkpointing is a popular technique: distributed processes can write their shard of the data independently to fast local storage tiers, then flush asynchronously to a shared, slower tier of higher capacity. However, given the limited capacity of fast tiers (e.g. GPU memory) and the increasing checkpoint frequency, the processes often run out of space and need to fall back to blocking writes to the slow tiers. To mitigate this problem, compression is often applied in order to reduce the checkpoint sizes. Unfortunately, this reduction is not uniform: some processes will have spare capacity left on the fast tiers, while others still run out of space. In this paper, we study the problem of how to leverage this imbalance in order to reduce I/O overheads for multi-level checkpointing. To this end, we solve an optimization problem of how much data to send from each process that runs out of space to the processes that have spare capacity in order to minimize the amount of time spent blocking in I/O. We propose two algorithms: one based on a greedy approach and the other based on modified minimum cost flows. We evaluate our proposal using synthetic and real-life application traces. Our evaluation shows that both algorithms achieve significant improvements in checkpoint performance over traditional multilevel checkpointing.
Complete list of metadata

https://hal.archives-ouvertes.fr/hal-03344362
Contributor : Bogdan Nicolae Connect in order to contact the contributor
Submitted on : Wednesday, September 15, 2021 - 2:29:26 AM
Last modification on : Monday, October 4, 2021 - 4:58:02 PM
Long-term archiving on: : Thursday, December 16, 2021 - 6:06:34 PM

File

SimG__MASCOTS_2021.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-03344362, version 1

Citation

Avinash Maurya, Bogdan Nicolae, M Mustafa Rafique, Thierry Tonellot, Franck Cappello. Towards Efficient I/O Scheduling for Collaborative Multi-Level Checkpointing. MASCOTS'21: 29th IEEE International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, Nov 2021, Virtual, Portugal. ⟨hal-03344362⟩

Share

Metrics

Record views

55

Files downloads

61