On the correct application of AD checkpointing to adjoint MPI-parallel programs

Abstract : Checkpointing is a classical technique to mitigate the overhead of adjoint Al-gorithmic Differentiation (AD). In the context of source transformation AD with the Store-All approach, checkpointing reduces the peak memory consumption of the adjoint, at the cost of duplicate runs of selected pieces of the code. Checkpointing is vital for long run-time codes, which is the case of most MPI parallel applications. However, the presence of MPI communications seriously restricts application of checkpointing. In most attempts to apply checkpointing to adjoint MPI codes (the " popular " approach), a number of restrictions apply on the form of communications that occur in the checkpointed piece of code. In many works, these restrictions are not explicit, and an application that does not respect these restrictions may produce erroneous code. We propose techniques to apply checkpointing to adjoint MPI codes, that either do not suppose these restrictions, or explicit them so that the end users can verify their applicability. These techniques rely on both adapting the snapshot mechanism of checkpointing and on modifying the behavior of communication calls. One technique is based on logging the values received, so that the duplicated communications need not take place. Although this technique completely lifts restrictions on checkpointing MPI codes, message logging makes it more costly than the popular approach. However, we can refine this technique to blend message logging and communications duplication whenever it is possible, so that the refined technique now encompasses the popular approach. We provide elements of proof of correction of our refined technique, i.e. that it preserves the semantics of the adjoint code and that it doesn't introduce deadlocks.
Type de document :
Communication dans un congrès
VII European Congress on Computational Methods in Applied Sciences and Engineering (ECCOMAS 2016), Jun 2016, Crete, Greece. 2016, VII European Congress on Computational Methods in Applied Sciences and Engineering (ECCOMAS 2016). 〈https://www.eccomas2016.org/〉
Liste complète des métadonnées

Littérature citée [8 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/hal-01413394
Contributeur : Laurent Hascoet <>
Soumis le : vendredi 9 décembre 2016 - 17:00:14
Dernière modification le : jeudi 11 janvier 2018 - 16:48:47
Document(s) archivé(s) le : mardi 28 mars 2017 - 00:14:41

Fichier

MPIPaper.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : hal-01413394, version 1

Collections

Citation

Ala Taftaf, Laurent Hascoët. On the correct application of AD checkpointing to adjoint MPI-parallel programs. VII European Congress on Computational Methods in Applied Sciences and Engineering (ECCOMAS 2016), Jun 2016, Crete, Greece. 2016, VII European Congress on Computational Methods in Applied Sciences and Engineering (ECCOMAS 2016). 〈https://www.eccomas2016.org/〉. 〈hal-01413394〉

Partager

Métriques

Consultations de la notice

58

Téléchargements de fichiers

31