Leveraging Adaptive I/O to Optimize Collective Data Shuffling Patterns for Big Data Analytics - Inria - Institut national de recherche en sciences et technologies du numérique Accéder directement au contenu
Article Dans Une Revue IEEE Transactions on Parallel and Distributed Systems Année : 2017

Leveraging Adaptive I/O to Optimize Collective Data Shuffling Patterns for Big Data Analytics

Résumé

Big data analytics is an indispensable tool in transforming science, engineering, medicine, health-care, finance and ultimately business itself. With the explosion of data sizes and need for shorter time-to-solution, in-memory platforms such as Apache Spark gain increasing popularity. In this context, data shuffling, a particularly difficult transformation pattern, introduces important challenges. Specifically, data shuffling is a key component of complex computations that has a major impact on the overall performance and scalability. Thus, speeding up data shuffling is a critical goal. To this end, state-of-the-art solutions often rely on overlapping the data transfers with the shuffling phase. However, they employ simple mechanisms to decide how much data and where to fetch it from, which leads to sub-optimal performance and excessive auxiliary memory utilization for the purpose of prefetching. The latter aspect is a growing concern, given evidence that memory per computation unit is continuously decreasing while interconnect bandwidth is increasing. This paper contributes a novel shuffle data transfer strategy that addresses the two aforementioned dimensions by dynamically adapting the prefetching to the computation. We implemented this novel strategy in Spark, a popular in-memory data analytics framework. To demonstrate the benefits of our proposal, we run extensive experiments on an HPC cluster with large core count per node. Compared with the default Spark shuffle strategy, our proposal shows: up to 40% better performance with 50% less memory utilization for buffering and excellent weak scalability.
Fichier principal
Vignette du fichier
tpds.pdf (2.66 Mo) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-01531374 , version 1 (01-06-2017)

Identifiants

Citer

Bogdan Nicolae, Carlos Costa, Claudia Misale, Kostas Katrinis, Yoonho Park. Leveraging Adaptive I/O to Optimize Collective Data Shuffling Patterns for Big Data Analytics. IEEE Transactions on Parallel and Distributed Systems, 2017, 28 (6), pp.1663 - 1674. ⟨10.1109/TPDS.2016.2627558⟩. ⟨hal-01531374⟩
132 Consultations
504 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More