Leveraging Adaptive I/O to Optimize Collective Data Shuffling Patterns for Big Data Analytics

Abstract : Big data analytics is an indispensable tool in transforming science, engineering, medicine, health-care, finance and ultimately business itself. With the explosion of data sizes and need for shorter time-to-solution, in-memory platforms such as Apache Spark gain increasing popularity. In this context, data shuffling, a particularly difficult transformation pattern, introduces important challenges. Specifically, data shuffling is a key component of complex computations that has a major impact on the overall performance and scalability. Thus, speeding up data shuffling is a critical goal. To this end, state-of-the-art solutions often rely on overlapping the data transfers with the shuffling phase. However, they employ simple mechanisms to decide how much data and where to fetch it from, which leads to sub-optimal performance and excessive auxiliary memory utilization for the purpose of prefetching. The latter aspect is a growing concern, given evidence that memory per computation unit is continuously decreasing while interconnect bandwidth is increasing. This paper contributes a novel shuffle data transfer strategy that addresses the two aforementioned dimensions by dynamically adapting the prefetching to the computation. We implemented this novel strategy in Spark, a popular in-memory data analytics framework. To demonstrate the benefits of our proposal, we run extensive experiments on an HPC cluster with large core count per node. Compared with the default Spark shuffle strategy, our proposal shows: up to 40% better performance with 50% less memory utilization for buffering and excellent weak scalability.
Type de document :
Article dans une revue
IEEE Transactions on Parallel and Distributed Systems, Institute of Electrical and Electronics Engineers, 2017, 28 (6), pp.1663 - 1674. 〈10.1109/TPDS.2016.2627558〉
Liste complète des métadonnées

Littérature citée [31 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/hal-01531374
Contributeur : Bogdan Nicolae <>
Soumis le : jeudi 1 juin 2017 - 16:01:24
Dernière modification le : jeudi 1 juin 2017 - 16:29:28
Document(s) archivé(s) le : mercredi 6 septembre 2017 - 19:11:43

Fichier

tpds.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

Citation

Bogdan Nicolae, Carlos Costa, Claudia Misale, Kostas Katrinis, Yoonho Park. Leveraging Adaptive I/O to Optimize Collective Data Shuffling Patterns for Big Data Analytics. IEEE Transactions on Parallel and Distributed Systems, Institute of Electrical and Electronics Engineers, 2017, 28 (6), pp.1663 - 1674. 〈10.1109/TPDS.2016.2627558〉. 〈hal-01531374〉

Partager

Métriques

Consultations de la notice

700

Téléchargements de fichiers

98