Skip to Main content Skip to Navigation
Conference papers

Towards High Performance Resilience using Performance Portable Abstractions

Abstract : In the drive towards Exascale, the extreme heterogeneity of supercomputers at all levels places a major development burden on HPC applications. To this end, performance portable abstractions such as those advocated by Kokkos, RAJA and HPX are becoming increasingly popular. At the same time, the unprecedented scalability requirements of such heterogeneous components means higher failure rates, motivating the need for resilience in systems and applications. Unfortunately, state-of-art resilience techniques based on checkpoint/restart are lagging behind performance portability efforts: users still need to capture consistent states manually, which introduces the need for fine-tuning and customization. In this paper we aim to close this gap by introducing a set of abstractions that make it easier for the application developers to reason about resilience. To this end, we extend the existing abstractions proposed by performance portability efforts towards resilience. By marking critical data structures that need to be checkpointed, one can enable an optimized runtime to automate checkpoint-restart using high performance and scalable asynchronously techniques. We illustrate the feasibility of our proposal using a prototype that combines the Kokkos runtime (HPC performance portability), with the VELOC runtime (large-scale low overhead checkpoint-restart). Our experimental results show negligible performance overhead compared compared with a manually tuned implementation of checkpoint-restart while requiring minimal changes in the application code.
Complete list of metadata

https://hal.archives-ouvertes.fr/hal-03260432
Contributor : Bogdan Nicolae Connect in order to contact the contributor
Submitted on : Tuesday, June 15, 2021 - 4:01:06 AM
Last modification on : Tuesday, July 6, 2021 - 1:09:41 PM
Long-term archiving on: : Thursday, September 16, 2021 - 6:06:43 PM

File

Euro-Par2021_PDF_70.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-03260432, version 1

Citation

Nicolas Morales, Keita Teranishi, Bogdan Nicolae, Christian Trott, Franck Cappello. Towards High Performance Resilience using Performance Portable Abstractions. 27th International European Conference on Parallel and Distributed Computing, Sep 2021, Lisbon (on line), Portugal. ⟨hal-03260432⟩

Share

Metrics

Record views

59

Files downloads

67