Skip to Main content Skip to Navigation
Conference papers

Optimal Cooperative Checkpointing for Shared High-Performance Computing Platforms

Abstract : In high-performance computing environments, in-put/output (I/O) from various sources often contend for scarce available bandwidth. Adding to the I/O operations inherent to the failure-free execution of an application, I/O from check-point/restart (CR) operations (used to ensure progress in the presence of failures) place an additional burden as it increase I/O contention, leading to degraded performance. In this work, we consider a cooperative scheduling policy that optimizes the overall performance of concurrently executing CR-based applications which share valuable I/O resources. First, we provide a theoretical model and then derive a set of necessary constraints needed to minimize the global waste on the platform. Our results demonstrate that the optimal checkpoint interval, as defined by Young/Daly, despite providing a sensible metric for a single application , is not sufficient to optimally address resource contention at the platform scale. We therefore show that combining optimal checkpointing periods with I/O scheduling strategies can provide a significant improvement on the overall application performance, thereby maximizing platform throughput. Overall, these results provide critical analysis and direct guidance on checkpointing large-scale workloads in the presence of competing I/O while minimizing the impact on application performance.
Document type :
Conference papers
Complete list of metadata

Cited literature [37 references]  Display  Hide  Download
Contributor : Equipe Roma Connect in order to contact the contributor
Submitted on : Wednesday, January 23, 2019 - 10:06:31 PM
Last modification on : Friday, September 30, 2022 - 4:12:21 AM


Files produced by the author(s)


  • HAL Id : hal-01968441, version 1



Thomas Herault, Yves Robert, Aurelien Bouteiller, Dorian Arnold, Kurt B Ferreira, et al.. Optimal Cooperative Checkpointing for Shared High-Performance Computing Platforms. APDCM 2018 : 20th Workshop on Advances in Parallel and Distributed Computational Models, May 2018, Vancouver, Canada. ⟨hal-01968441⟩



Record views


Files downloads