Topology-aware resource management for HPC applications

Abstract : The Resource and Job Management System (RJMS) is a crucial system software part of the HPC stack. It is responsible for efficiently delivering computing power to applications in supercomputing environments. Its main intelligence relies on resource selection techniques to find the most adapted resources to schedule the users' jobs. Improper resource selection operations may lead to poor performance executions and global system utilization along with an increase of the system fragmentation and jobs starvation. These phenomena play a role in the increase of the platforms' total cost of ownership and should be minimized. This paper introduces a new method that takes into account the topology of the machine and the application characteristics to determine the best choice among the available nodes of the platform based upon their position within the network and taking into account the applications communication pattern. To validate our approach, we integrate this algorithm as a plugin for Slurm, a popular and widespread HPC resource and job management system (RJMS). We assess our plugin with different optimization schemes by comparing with the default topology-aware Slurm algorithm using both emulation and simulation of a large-scale platform, and by carrying out experiments in a real cluster. We show that transparently taking into account the job communication pattern and the topology allows for relevant performance gains.
Type de document :
Communication dans un congrès
ICDCN 2017, Jan 2017, Hyderabad, India. 〈10.1145/3007748.3007768〉
Liste complète des métadonnées

Littérature citée [24 références]  Voir  Masquer  Télécharger
Contributeur : Adèle Villiermet <>
Soumis le : mardi 13 décembre 2016 - 18:05:39
Dernière modification le : jeudi 11 janvier 2018 - 06:27:21





Yiannis Georgiou, Emmanuel Jeannot, Guillaume Mercier, Adèle Villiermet. Topology-aware resource management for HPC applications. ICDCN 2017, Jan 2017, Hyderabad, India. 〈10.1145/3007748.3007768〉. 〈hal-01414196〉



Consultations de la notice