Skip to Main content Skip to Navigation
Journal articles

Performance Analysis and Optimality Results for Data-Locality Aware Tasks Scheduling with Replicated Inputs

Abstract : Replication of data files, as automatically performed by Distributed File Systems such as HDFS, is known to have a crucial impact on data locality in addition to system fault tolerance. Indeed, intuitively, having more replicas of the same input file gives more opportunities for this task to be processed locally, i.e. without any input file transfer. Given the practical importance of this problem, a vast literature has been proposed to schedule tasks, based on a random placement of replicated input files. Our goal in this paper is to study the performance of these algorithms, both in terms of makespan minimization (minimize the completion time of the last task when non-local processing is forbidden) and communication minimization (minimize the number of non-local tasks when no idle time on resources is allowed). In the case of homogenous tasks, we are able to prove, using models based on ”balls into bins” and ”power of two choices” problems, that the well known good behavior of classical strategies can be theoretically grounded. Going further, we even establish that it is possible, using semi-matchings theory, to find the optimal solution in very small time. We also use known graph-orientation results to prove that this optimal solution is indeed near-perfect with strong probability. In the more general case of heterogeneous tasks, we propose heuristics solutions both in the clairvoyant and non-clairvoyant cases (i.e. task length is known in advance or not), and we evaluate them through simulations, using actual traces of a Hadoop cluster.
Complete list of metadatas

Cited literature [32 references]  Display  Hide  Download
Contributor : Thomas Lambert <>
Submitted on : Tuesday, September 3, 2019 - 10:08:07 AM
Last modification on : Wednesday, June 24, 2020 - 4:19:53 PM
Document(s) archivé(s) le : Thursday, January 9, 2020 - 4:13:07 AM



Olivier Beaumont, Thomas Lambert, Loris Marchal, Bastien Thomas. Performance Analysis and Optimality Results for Data-Locality Aware Tasks Scheduling with Replicated Inputs. Future Generation Computer Systems, Elsevier, 2019, pp.1-28. ⟨10.1016/j.future.2019.08.024⟩. ⟨hal-02275473⟩



Record views


Files downloads