Performance Analysis and Optimality Results for Data-Locality Aware Tasks Scheduling with Replicated Inputs

Olivier Beaumont 1, 2 Thomas Lambert 3, 4 Loris Marchal 5, 6, 7 Bastien Thomas 8
2 HiePACS - High-End Parallel Algorithms for Challenging Numerical Simulations
LaBRI - Laboratoire Bordelais de Recherche en Informatique, Inria Bordeaux - Sud-Ouest
4 STACK - Software Stack for Massively Geo-Distributed Infrastructures
Inria Rennes – Bretagne Atlantique , LS2N - Laboratoire des Sciences du Numérique de Nantes
5 ROMA - Optimisation des ressources : modèles, algorithmes et ordonnancement
Inria Grenoble - Rhône-Alpes, LIP - Laboratoire de l'Informatique du Parallélisme
8 SUMO - SUpervision of large MOdular and distributed systems
Inria Rennes – Bretagne Atlantique , IRISA-D4 - LANGAGE ET GÉNIE LOGICIEL
Abstract : Replication of data files, as automatically performed by Distributed File Systems such as HDFS, is known to have a crucial impact on data locality in addition to system fault tolerance. Indeed, intuitively, having more replicas of the same input file gives more opportunities for this task to be processed locally, i.e. without any input file transfer. Given the practical importance of this problem, a vast literature has been proposed to schedule tasks, based on a random placement of replicated input files. Our goal in this paper is to study the performance of these algorithms, both in terms of makespan minimization (minimize the completion time of the last task when non-local processing is forbidden) and communication minimization (minimize the number of non-local tasks when no idle time on resources is allowed). In the case of homogenous tasks, we are able to prove, using models based on ”balls into bins” and ”power of two choices” problems, that the well known good behavior of classical strategies can be theoretically grounded. Going further, we even establish that it is possible, using semi-matchings theory, to find the optimal solution in very small time. We also use known graph-orientation results to prove that this optimal solution is indeed near-perfect with strong probability. In the more general case of heterogeneous tasks, we propose heuristics solutions both in the clairvoyant and non-clairvoyant cases (i.e. task length is known in advance or not), and we evaluate them through simulations, using actual traces of a Hadoop cluster.
Complete list of metadatas

Cited literature [32 references]  Display  Hide  Download

https://hal.inria.fr/hal-02275473
Contributor : Thomas Lambert <>
Submitted on : Tuesday, September 3, 2019 - 10:08:07 AM
Last modification on : Thursday, October 3, 2019 - 2:29:59 PM

Identifiers

Citation

Olivier Beaumont, Thomas Lambert, Loris Marchal, Bastien Thomas. Performance Analysis and Optimality Results for Data-Locality Aware Tasks Scheduling with Replicated Inputs. Future Generation Computer Systems, Elsevier, 2019, pp.1-28. ⟨10.1016/j.future.2019.08.024⟩. ⟨hal-02275473⟩

Share

Metrics

Record views

72

Files downloads

533