Asynchronous Multi-Level Checkpointing: An Enabler of Reproducibility using Checkpoint History Analytics - Inria - Institut national de recherche en sciences et technologies du numérique Access content directly
Conference Papers Year : 2023

Asynchronous Multi-Level Checkpointing: An Enabler of Reproducibility using Checkpoint History Analytics

Kevin Assogba
Bogdan Nicolae
M. Mustafa Rafique

Abstract

High-performance computing applications are increasingly integrating checkpointing libraries for reproducibility analytics. However, capturing an entire checkpoint history for reproducibility study faces the challenges of high-frequency checkpointing across thousands of processes. As a result, the runtime overhead affects application performance and intermediate results when interleaving is introduced during floating-point calculations. In this paper, we extend asynchronous multi-level checkpoint/restart to study the intermediate results generated from scientific workflows. We present an initial prototype of a framework that captures, caches, and compares checkpoint histories from different runs of a scientific application executed using identical input files. We also study the impact of our proposed approach by evaluating the reproducibility of classical molecular dynamics simulations executed using the NWChem software. Experiment results show that our proposed solution improves the checkpoint write bandwidth when capturing checkpoints for reproducibility analysis by a minimum of 30× and up to 211× compared to the default checkpointing approach in NWChem.
Fichier principal
Vignette du fichier
2023_SuperCheck_RECUP.pdf (685.58 Ko) Télécharger le fichier
Origin : Files produced by the author(s)

Dates and versions

hal-04343694 , version 1 (14-12-2023)

Licence

Attribution

Identifiers

Cite

Kevin Assogba, Bogdan Nicolae, Hubertus van Dam, M. Mustafa Rafique. Asynchronous Multi-Level Checkpointing: An Enabler of Reproducibility using Checkpoint History Analytics. SuperCheck’23: The 4th International Symposium on Checkpointing for Supercomputing (with SC'23), Nov 2023, Denver, United States. pp.1748-1756, ⟨10.1145/3624062.3624256⟩. ⟨hal-04343694⟩
8 View
16 Download

Altmetric

Share

Gmail Facebook X LinkedIn More