Skip to Main content Skip to Navigation
Conference papers

VeloC: Towards High Performance Adaptive Asynchronous Checkpointing at Large Scale

Abstract : Global checkpointing to external storage (e.g., a parallel file system) is a common I/O pattern of many HPC applications. However, given the limited I/O throughput of external storage, global checkpointing can often lead to I/O bottlenecks. To address this issue, a shift from synchronous checkpointing (i.e., blocking until writes have finished) to asyn-chronous checkpointing (i.e., writing to faster local storage and flushing to external storage in the background) is increasingly being adopted. However, with rising core count per node and heterogeneity of both local and external storage, it is non-trivial to design efficient asynchronous checkpointing mechanisms due to the complex interplay between high concurrency and I/O performance variability at both the node-local and global levels. This problem is not well understood but highly important for modern supercomputing infrastructures. This paper proposes a versatile asynchronous checkpointing solution that addresses this problem. To this end, we introduce a concurrency-optimized technique that combines performance modeling with lightweight monitoring to make informed decisions about what local storage devices to use in order to dynamically adapt to background flushes and reduce the checkpointing overhead. We illustrate this technique using the VeloC prototype. Extensive experiments on a pre-Exascale supercomputing system show significant benefits.
Complete list of metadata

Cited literature [25 references]  Display  Hide  Download
Contributor : Bogdan Nicolae Connect in order to contact the contributor
Submitted on : Monday, July 15, 2019 - 6:15:41 PM
Last modification on : Tuesday, July 16, 2019 - 2:22:49 PM


Files produced by the author(s)


  • HAL Id : hal-02184203, version 1


Bogdan Nicolae, Adam Moody, Elsa Gonsiorowski, Kathryn Mohror, Franck Cappello. VeloC: Towards High Performance Adaptive Asynchronous Checkpointing at Large Scale. IPDPS'19: The 2019 IEEE International Parallel and Distributed Processing Symposium, May 2019, Rio de Janeiro, Brazil. pp.911-920. ⟨hal-02184203⟩



Record views


Files downloads