Performance Analysis and Optimality Results for Data-Locality Aware Tasks Scheduling with Replicated Inputs - Archive ouverte HAL Access content directly
Journal Articles Future Generation Computer Systems Year : 2020

Performance Analysis and Optimality Results for Data-Locality Aware Tasks Scheduling with Replicated Inputs

(1, 2) , (3, 4) , (5, 6, 7) , (8)
1
2
3
4
5
6
7
8

Abstract

Replication of data files, as automatically performed by Distributed File Systems such as HDFS, is known to have a crucial impact on data locality in addition to system fault tolerance. Indeed, intuitively, having more replicas of the same input file gives more opportunities for this task to be processed locally, i.e. without any input file transfer. Given the practical importance of this problem, a vast literature has been proposed to schedule tasks, based on a random placement of replicated input files. Our goal in this paper is to study the performance of these algorithms, both in terms of makespan minimization (minimize the completion time of the last task when non-local processing is forbidden) and communication minimization (minimize the number of non-local tasks when no idle time on resources is allowed). In the case of homogenous tasks, we are able to prove, using models based on ”balls into bins” and ”power of two choices” problems, that the well known good behavior of classical strategies can be theoretically grounded. Going further, we even establish that it is possible, using semi-matchings theory, to find the optimal solution in very small time. We also use known graph-orientation results to prove that this optimal solution is indeed near-perfect with strong probability. In the more general case of heterogeneous tasks, we propose heuristics solutions both in the clairvoyant and non-clairvoyant cases (i.e. task length is known in advance or not), and we evaluate them through simulations, using actual traces of a Hadoop cluster.
Fichier principal
Vignette du fichier
paper_revision (1).pdf (1.24 Mo) Télécharger le fichier
Loading...

Dates and versions

hal-02275473 , version 1 (03-09-2019)

Identifiers

Cite

Olivier Beaumont, Thomas Lambert, Loris Marchal, Bastien Thomas. Performance Analysis and Optimality Results for Data-Locality Aware Tasks Scheduling with Replicated Inputs. Future Generation Computer Systems, 2020, 111, pp.582-598. ⟨10.1016/j.future.2019.08.024⟩. ⟨hal-02275473⟩
170 View
312 Download

Altmetric

Share

Gmail Facebook Twitter LinkedIn More