Scalable Incremental Checkpointing using GPU-Accelerated De-Duplication - Inria - Institut national de recherche en sciences et technologies du numérique Accéder directement au contenu
Communication Dans Un Congrès Année : 2023

Scalable Incremental Checkpointing using GPU-Accelerated De-Duplication

Résumé

Writing large amounts of data concurrently to stable storage is a typical I/O pattern of many HPC workflows. This pattern introduces high I/O overheads and results in increased storage space utilization especially for workflows that need to capture the evolution of data structures with high frequency as checkpoints. In this context, many applications, such as graph pattern matching, perform sparse updates to large data structures between checkpoints. For these applications, incremental checkpointing techniques that save only the differences from one checkpoint to another can dramatically reduce the checkpoint sizes, I/O bottlenecks, and storage space utilization. However, such techniques are not without challenges: it is non-trivial to transparently determine what data has changed since a previous checkpoint and assemble the differences in a compact fashion that does not result in excessive metadata. State-of-art data reduction techniques (e.g., compression and de-duplication) have significant limitations when applied to modern HPC applications that leverage GPUs: slow at detecting the differences, generate a large amount of metadata to keep track of the differences, and ignore crucial spatiotemporal checkpoint data redundancy. This paper addresses these challenges by proposing a Merkle tree-based incremental checkpointing method to exploit GPUs' high memory bandwidth and massive parallelism. Experimental results at scale show a significant reduction of the I/O overhead and space utilization of checkpointing compared with state-of-the-art incremental checkpointing and compression techniques.
Fichier principal
Vignette du fichier
Paper_2023_IEEE_ICPP_IncrementalCheckpointing.pdf (1.03 Mo) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-04173764 , version 1 (30-07-2023)

Licence

Paternité

Identifiants

Citer

Nigel Tan, Jakob Luettgau, Jack Marquez, Keita Terianishi, Nicolas Morales, et al.. Scalable Incremental Checkpointing using GPU-Accelerated De-Duplication. ICPP'23: 52nd International Conference on Parallel Processing, Aug 2023, Salt Lake City, United States. ⟨10.1145/3605573.3605639⟩. ⟨hal-04173764⟩
59 Consultations
78 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More