# On the correct application of AD checkpointing to adjoint MPI-parallel programs

Abstract : Checkpointing is a classical technique to mitigate the overhead of adjoint Algorithmic Differentiation (AD). In the context of source transformation AD with the Store-All approach, checkpointing reduces the peak memory consumption of the adjoint, at the cost of duplicate runs of selected pieces of the code. Checkpointing is vital for long run-time codes, which is the case of most MPI parallel applications. However, the presence of MPI communications seriously restricts application of checkpointing. In most attempts to apply checkpointing to adjoint MPI codes (the popular'' approach), a number of restrictions apply on the form of communications that occur in the checkpointed piece of code. In many works, these restrictions are not explicit, and an application that does not respect these restrictions may produce erroneous code. We propose techniques to apply checkpointing to adjoint MPI codes, that either do not suppose these restrictions, or explicit them so that the end users can verify their applicability. These techniques rely on both adapting the snapshot mechanism of checkpointing and on modifying the behavior of communication calls. One technique is based on logging the values received, so that the duplicated communications need not take place.Although this technique completely lifts restrictions on checkpointing MPI codes, message logging makes it more costly than the popular approach. However, we can refine this technique to blend message logging and communications duplication whenever it is possible, so that the refined technique now encompasses the popular approach.We provide elements of proof of correction of our refined technique, i.e. that it preserves the semantics of the adjoint code and that it doesn't introduce deadlocks.
Keywords :
Document type :
Reports
Domain :

Cited literature [8 references]

https://hal.inria.fr/hal-01277449
Contributor : Ala Taftaf Connect in order to contact the contributor
Submitted on : Monday, February 22, 2016 - 3:00:59 PM
Last modification on : Tuesday, April 30, 2019 - 3:16:29 PM
Long-term archiving on: : Sunday, November 13, 2016 - 12:48:06 AM

### File

RR-8864.pdf
Files produced by the author(s)

### Identifiers

• HAL Id : hal-01277449, version 1

### Citation

Laurent Hascoet, Ala Taftaf. On the correct application of AD checkpointing to adjoint MPI-parallel programs. [Research Report] RR-8864, Inria Sophia Antipolis. 2016. ⟨hal-01277449⟩

Record views