Skip to Main content Skip to Navigation
Reports

When Amdahl Meets Young/Daly

Abstract : This paper investigates the optimal number of processors to execute a parallel job, whose speedup profile obeys Amdahl's law, on a large-scale platform subject to fail-stop and silent errors. We combine the traditional checkpointing and rollback recovery strategies with verification mechanisms to cope with both error sources. We provide an exact formula to express the execution overhead incurred by a periodic checkpointing pattern of length $T$ and with $P$ processors, and we give first-order approximations for the optimal values $T^{*}$ and $P^{*}$ as a function of the individual processor failure rate $\lambda_{\mathrm{ind}}$. A striking result is that $P^{*}$ is of the order $\lambda_{\mathrm{ind}}^{-1/4}$ when the checkpointing cost grows linearly with the number of processors, and of the order $\lambda_{\mathrm{ind}}^{-1/3}$ when the checkpointing cost stays bounded for any $P$. We conduct an extensive set of simulations to support the theoretical study. The results confirm the accuracy of the first-order approximation under a wide range of parameter settings.
Complete list of metadatas

Cited literature [38 references]  Display  Hide  Download

https://hal.inria.fr/hal-01280004
Contributor : Equipe Roma <>
Submitted on : Wednesday, July 6, 2016 - 5:46:13 PM
Last modification on : Wednesday, November 20, 2019 - 3:27:41 AM

File

RR-8871_update.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-01280004, version 4

Collections

Citation

Aurélien Cavelan, Jiafan Li, Yves Robert, Hongyang Sun. When Amdahl Meets Young/Daly. [Research Report] RR-8871, ENS Lyon, CNRS & INRIA. 2016. ⟨hal-01280004v4⟩

Share

Metrics

Record views

212

Files downloads

340