Skip to Main content Skip to Navigation
Conference papers

CoSim: A Simulator for Co-Scheduling of Batch and On-Demand Jobs in HPC Datacenters

Abstract : The increasing scale and complexity of scientific applications are rapidly transforming the ecosystem of tools, methods, and workflows adopted by the high-performance computing (HPC) community. Big data analytics and deep learning are gaining traction as essential components in this ecosystem in a variety of scenarios, such as, steering of experimental instruments , acceleration of high-fidelity simulations through surrogate computations, and guided ensemble searches. In this context, the batch job model traditionally adopted by the supercomputing in-frastructures needs to be complemented with support to schedule opportunistic on-demand analytics jobs, leading to the problem of efficient preemption of batch jobs with minimum loss of progress. In this paper, we design and implement a simulator, CoSim, that enables on-the-fly analysis of the trade-offs arising between delaying the start of opportunistic on-demand jobs, which leads to longer analytics latency, and loss of progress due to preemption of batch jobs, which is necessary to make room for on-demand jobs. To this end, we propose an algorithm based on dynamic programming with predictable performance and scalability that enables supercomputing infrastructure schedulers to analyze the aforementioned trade-off and take decisions in near real-time. Compared with other state-of-art approaches using traces of the Theta pre-Exascale machine, our approach is capable of finding the optimal solution, while achieving high performance and scalability.
Complete list of metadata

Cited literature [27 references]  Display  Hide  Download
Contributor : Bogdan Nicolae Connect in order to contact the contributor
Submitted on : Friday, August 28, 2020 - 9:13:07 PM
Last modification on : Wednesday, November 3, 2021 - 7:06:32 AM
Long-term archiving on: : Sunday, November 29, 2020 - 12:58:09 PM


Files produced by the author(s)


  • HAL Id : hal-02925237, version 1


Avinash Maurya, Bogdan Nicolae, Ishan Guliani, M Mustafa Rafique. CoSim: A Simulator for Co-Scheduling of Batch and On-Demand Jobs in HPC Datacenters. DS-RT'20: The 24th IEEE/ACM International Symposium on Distributed Simulation and Real Time Applications, Sep 2020, Prague, Czech Republic. pp.167-174. ⟨hal-02925237⟩



Record views


Files downloads