On the correct application of AD checkpointing to adjoint MPI-parallel programs

Abstract : Checkpointing is a classical technique to mitigate the overhead of adjoint Algorithmic Differentiation (AD). In the context of source transformation AD with the Store-All approach, checkpointing reduces the peak memory consumption of the adjoint, at the cost of duplicate runs of selected pieces of the code. Checkpointing is vital for long run-time codes, which is the case of most MPI parallel applications. However, the presence of MPI communications seriously restricts application of checkpointing. In most attempts to apply checkpointing to adjoint MPI codes (the ``popular'' approach), a number of restrictions apply on the form of communications that occur in the checkpointed piece of code. In many works, these restrictions are not explicit, and an application that does not respect these restrictions may produce erroneous code. We propose techniques to apply checkpointing to adjoint MPI codes, that either do not suppose these restrictions, or explicit them so that the end users can verify their applicability. These techniques rely on both adapting the snapshot mechanism of checkpointing and on modifying the behavior of communication calls. One technique is based on logging the values received, so that the duplicated communications need not take place.Although this technique completely lifts restrictions on checkpointing MPI codes, message logging makes it more costly than the popular approach. However, we can refine this technique to blend message logging and communications duplication whenever it is possible, so that the refined technique now encompasses the popular approach.We provide elements of proof of correction of our refined technique, i.e. that it preserves the semantics of the adjoint code and that it doesn't introduce deadlocks.
Document type :
Reports
Complete list of metadatas

Cited literature [8 references]  Display  Hide  Download

https://hal.inria.fr/hal-01277449
Contributor : Ala Taftaf <>
Submitted on : Monday, February 22, 2016 - 3:00:59 PM
Last modification on : Tuesday, April 30, 2019 - 3:16:29 PM
Long-term archiving on : Sunday, November 13, 2016 - 12:48:06 AM

File

RR-8864.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-01277449, version 1

Collections

Citation

Laurent Hascoet, Ala Taftaf. On the correct application of AD checkpointing to adjoint MPI-parallel programs. [Research Report] RR-8864, Inria Sophia Antipolis. 2016. ⟨hal-01277449⟩

Share

Metrics

Record views

433

Files downloads

318