On the correct application of AD checkpointing to adjoint MPI-parallel programs

Abstract : Checkpointing is a classical technique to mitigate the overhead of adjoint Al-gorithmic Differentiation (AD). In the context of source transformation AD with the Store-All approach, checkpointing reduces the peak memory consumption of the adjoint, at the cost of duplicate runs of selected pieces of the code. Checkpointing is vital for long run-time codes, which is the case of most MPI parallel applications. However, the presence of MPI communications seriously restricts application of checkpointing. In most attempts to apply checkpointing to adjoint MPI codes (the " popular " approach), a number of restrictions apply on the form of communications that occur in the checkpointed piece of code. In many works, these restrictions are not explicit, and an application that does not respect these restrictions may produce erroneous code. We propose techniques to apply checkpointing to adjoint MPI codes, that either do not suppose these restrictions, or explicit them so that the end users can verify their applicability. These techniques rely on both adapting the snapshot mechanism of checkpointing and on modifying the behavior of communication calls. One technique is based on logging the values received, so that the duplicated communications need not take place. Although this technique completely lifts restrictions on checkpointing MPI codes, message logging makes it more costly than the popular approach. However, we can refine this technique to blend message logging and communications duplication whenever it is possible, so that the refined technique now encompasses the popular approach. We provide elements of proof of correction of our refined technique, i.e. that it preserves the semantics of the adjoint code and that it doesn't introduce deadlocks.
Complete list of metadatas

Cited literature [8 references]  Display  Hide  Download

https://hal.inria.fr/hal-01413394
Contributor : Laurent Hascoet <>
Submitted on : Friday, December 9, 2016 - 5:00:14 PM
Last modification on : Thursday, January 11, 2018 - 4:48:47 PM
Long-term archiving on : Tuesday, March 28, 2017 - 12:14:41 AM

File

MPIPaper.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-01413394, version 1

Collections

Citation

Ala Taftaf, Laurent Hascoët. On the correct application of AD checkpointing to adjoint MPI-parallel programs. VII European Congress on Computational Methods in Applied Sciences and Engineering (ECCOMAS 2016), Jun 2016, Crete, Greece. ⟨hal-01413394⟩

Share

Metrics

Record views

134

Files downloads

90