Skip to Main content Skip to Navigation
Reports

Reservation and Checkpointing Strategies for Stochastic Jobs (Extended Version)

Ana Gainaru 1 Brice Goglin 2 Valentin Honoré 2 Guillaume Pallez 2 Padma Raghavan 1 Yves Robert 3 Hongyang Sun 1
2 TADAAM - Topology-Aware System-Scale Data Management for High-Performance Computing
LaBRI - Laboratoire Bordelais de Recherche en Informatique, Inria Bordeaux - Sud-Ouest
3 ROMA - Optimisation des ressources : modèles, algorithmes et ordonnancement
Inria Grenoble - Rhône-Alpes, LIP - Laboratoire de l'Informatique du Parallélisme
Abstract : In this work, we are interested in scheduling and checkpointing stochastic jobs ona reservation-based platform. We assume that jobs can be interrupted at any time to take acheckpoint, and that job execution times follow a known probability distribution. The user has todetermine a sequence of fixed-length reservation requests, and to decide whether to checkpoint thestate of the execution, or not, at the end of each request. The execution of the job is successful onlywhen it terminates within a request, otherwise it must be resubmitted, using the next request in thereservation sequence, and restarting execution from the last checkpointed state. The cost of eachreservation depends on both its duration and on the actual utilization of the platform during thatrequest, which includes a restart if some previous reservation was terminated with a checkpoint, andpossibly a checkpoint at the end of the current request. The cost of a job is then the cumulatedcost of all the reservations that were needed until its completion. Overall, the objective is tofind a reservation sequence that minimizes the total expected cost to execute a job. We providean optimal strategy for discrete probability distributions of job execution times, and we designfully polynomial-time approximation strategies for continuous distributions with bounded support.We experimentally evaluate these strategies for jobs following a wide range of usual probabilitydistributions, as well as one distribution obtained from traces of a neuroscience application. Wecompare our strategies with standard approaches that use periodic-length reservations (the nextreservation is longer than the previous one by a constant amount of time) and simple checkpointingstrategies (either checkpoint all reservations, or none).
Complete list of metadatas

Cited literature [35 references]  Display  Hide  Download

https://hal.inria.fr/hal-02328013
Contributor : Valentin Honoré <>
Submitted on : Friday, January 10, 2020 - 10:45:41 AM
Last modification on : Monday, November 16, 2020 - 9:56:04 AM
Long-term archiving on: : Saturday, April 11, 2020 - 4:08:51 PM

File

research_report_hal_v2.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-02328013, version 2

Citation

Ana Gainaru, Brice Goglin, Valentin Honoré, Guillaume Pallez, Padma Raghavan, et al.. Reservation and Checkpointing Strategies for Stochastic Jobs (Extended Version). [Research Report] RR-9294, Inria & Labri, Univ. Bordeaux; Department of EECS, Vanderbilt University, Nashville, TN, USA; Laboratoire LIP, ENS Lyon & University of Tennessee Knoxville, Lyon, France. 2019. ⟨hal-02328013v2⟩

Share

Metrics

Record views

125

Files downloads

212