On the correct application of AD checkpointing to adjoint MPI-parallel programs - Inria - Institut national de recherche en sciences et technologies du numérique Accéder directement au contenu
Communication Dans Un Congrès Année : 2016

On the correct application of AD checkpointing to adjoint MPI-parallel programs

Résumé

Checkpointing is a classical technique to mitigate the overhead of adjoint Al-gorithmic Differentiation (AD). In the context of source transformation AD with the Store-All approach, checkpointing reduces the peak memory consumption of the adjoint, at the cost of duplicate runs of selected pieces of the code. Checkpointing is vital for long run-time codes, which is the case of most MPI parallel applications. However, the presence of MPI communications seriously restricts application of checkpointing. In most attempts to apply checkpointing to adjoint MPI codes (the " popular " approach), a number of restrictions apply on the form of communications that occur in the checkpointed piece of code. In many works, these restrictions are not explicit, and an application that does not respect these restrictions may produce erroneous code. We propose techniques to apply checkpointing to adjoint MPI codes, that either do not suppose these restrictions, or explicit them so that the end users can verify their applicability. These techniques rely on both adapting the snapshot mechanism of checkpointing and on modifying the behavior of communication calls. One technique is based on logging the values received, so that the duplicated communications need not take place. Although this technique completely lifts restrictions on checkpointing MPI codes, message logging makes it more costly than the popular approach. However, we can refine this technique to blend message logging and communications duplication whenever it is possible, so that the refined technique now encompasses the popular approach. We provide elements of proof of correction of our refined technique, i.e. that it preserves the semantics of the adjoint code and that it doesn't introduce deadlocks.
Fichier principal
Vignette du fichier
MPIPaper.pdf (574.72 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-01413394 , version 1 (09-12-2016)

Identifiants

  • HAL Id : hal-01413394 , version 1

Citer

Ala Taftaf, Laurent Hascoët. On the correct application of AD checkpointing to adjoint MPI-parallel programs. VII European Congress on Computational Methods in Applied Sciences and Engineering (ECCOMAS 2016), Jun 2016, Crete, Greece. ⟨hal-01413394⟩
83 Consultations
93 Téléchargements

Partager

Gmail Facebook X LinkedIn More