Skip to Main content Skip to Navigation
Reports

Optimal Cooperative Checkpointing for Shared High-Performance Computing Platforms

Abstract : In high-performance computing environments, input/output (I/O) from various sources often contend for scare available bandwidth. Adding to the I/O operations inherent to the failure-free execution of an application, I/O from checkpoint/restart (CR) operations (used to ensure progress in the presence of failures) places an additional burden as it increases I/O contention, leading to degraded performance. In this work, we consider a cooperative scheduling policy that optimizes the overall performance of concurrently executing CR-based applications which share valuable I/O resources. First, we provide a theoretical model and then derive a set of necessary constraints needed to minimize the global waste on the platform. Our results demonstrate that the optimal checkpoint interval as defined by Young/Daly, while providing a sensible metric for a single application, is not sufficient to optimally address resource contention at the platform scale. We therefore show that combining optimal checkpointing periods with I/O scheduling strategies can provide a significant improvement on the overall application performance, thereby maximizing platform throughput. Overall, these results provide critical analysis and direct guidance on checkpointing large-scale workloads in the presence of competing I/O while minimizing the impact on application performance.
Document type :
Reports
Complete list of metadatas

Cited literature [40 references]  Display  Hide  Download

https://hal.inria.fr/hal-01621295
Contributor : Equipe Roma <>
Submitted on : Monday, October 23, 2017 - 10:57:20 AM
Last modification on : Wednesday, November 20, 2019 - 3:18:09 AM
Long-term archiving on: : Wednesday, January 24, 2018 - 1:12:09 PM

File

rr9109.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-01621295, version 1

Collections

Citation

Thomas Hérault, Yves Robert, Aurélien Bouteiller, Dorian Arnold, Kurt Ferreira, et al.. Optimal Cooperative Checkpointing for Shared High-Performance Computing Platforms. [Research Report] RR-9109, INRIA. 2017, pp.1-20. ⟨hal-01621295⟩

Share

Metrics

Record views

477

Files downloads

353