VeloC: Towards High Performance Adaptive Asynchronous Checkpointing at Large Scale - Inria - Institut national de recherche en sciences et technologies du numérique Accéder directement au contenu
Communication Dans Un Congrès Année : 2019

VeloC: Towards High Performance Adaptive Asynchronous Checkpointing at Large Scale

Bogdan Nicolae
Elsa Gonsiorowski
  • Fonction : Auteur
  • PersonId : 1050721
Kathryn Mohror
  • Fonction : Auteur
  • PersonId : 1050722
Franck Cappello
  • Fonction : Auteur
  • PersonId : 828491

Résumé

Global checkpointing to external storage (e.g., a parallel file system) is a common I/O pattern of many HPC applications. However, given the limited I/O throughput of external storage, global checkpointing can often lead to I/O bottlenecks. To address this issue, a shift from synchronous checkpointing (i.e., blocking until writes have finished) to asyn-chronous checkpointing (i.e., writing to faster local storage and flushing to external storage in the background) is increasingly being adopted. However, with rising core count per node and heterogeneity of both local and external storage, it is non-trivial to design efficient asynchronous checkpointing mechanisms due to the complex interplay between high concurrency and I/O performance variability at both the node-local and global levels. This problem is not well understood but highly important for modern supercomputing infrastructures. This paper proposes a versatile asynchronous checkpointing solution that addresses this problem. To this end, we introduce a concurrency-optimized technique that combines performance modeling with lightweight monitoring to make informed decisions about what local storage devices to use in order to dynamically adapt to background flushes and reduce the checkpointing overhead. We illustrate this technique using the VeloC prototype. Extensive experiments on a pre-Exascale supercomputing system show significant benefits.
Fichier principal
Vignette du fichier
paper.pdf (527.76 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-02184203 , version 1 (15-07-2019)

Identifiants

  • HAL Id : hal-02184203 , version 1

Citer

Bogdan Nicolae, Adam Moody, Elsa Gonsiorowski, Kathryn Mohror, Franck Cappello. VeloC: Towards High Performance Adaptive Asynchronous Checkpointing at Large Scale. IPDPS'19: The 2019 IEEE International Parallel and Distributed Processing Symposium, May 2019, Rio de Janeiro, Brazil. pp.911-920. ⟨hal-02184203⟩
151 Consultations
608 Téléchargements

Partager

Gmail Facebook X LinkedIn More