Spark versus Flink: Understanding Performance in Big Data Analytics Frameworks

Ovidiu-Cristian Marcu; Alexandru Costan; Gabriel Antoniu; María S. Pérez-Hernández

doi:10.1109/cluster.2016.22

Communication Dans Un Congrès Année : 2016

Spark versus Flink: Understanding Performance in Big Data Analytics Frameworks

(1) , (2, 1) , (1) , (3)

1
2
3

Ovidiu-Cristian Marcu

Fonction : Auteur
PersonId : 986445

Scalable Storage for Clouds and Beyond

Alexandru Costan

Fonction : Auteur
PersonId : 9361
IdHAL : alexandru-costan
ORCID : 0000-0003-3111-6308
IdRef : 220478279

Institut National des Sciences Appliquées - Rennes

Scalable Storage for Clouds and Beyond

Gabriel Antoniu

Fonction : Auteur
PersonId : 746326
IdHAL : gabriel-antoniu
ORCID : 0000-0001-6525-3736
IdRef : 095615296

Scalable Storage for Clouds and Beyond

María S. Pérez-Hernández

Fonction : Auteur

Universidad Politécnica de Madrid

Résumé

Big Data analytics has recently gained increasing popularity as a tool to process large amounts of data on-demand. Spark and Flink are two Apache-hosted data analytics frameworks that facilitate the development of multi-step data pipelines using directly acyclic graph patterns. Making the most out of these frameworks is challenging because efficient executions strongly rely on complex parameter configurations and on an in-depth understanding of the underlying architectural choices. Although extensive research has been devoted to improving and evaluating the performance of such analytics frameworks, most of them benchmark the platforms against Hadoop, as a baseline, a rather unfair comparison considering the fundamentally different design principles. This paper aims to bring some justice in this respect, by directly evaluating the performance of Spark and Flink. Our goal is to identify and explain the impact of the different architectural choices and the parameter configurations on the perceived end-to-end performance. To this end, we develop a methodology for correlating the parameter settings and the operators execution plan with the resource usage. We use this methodology to dissect the performance of Spark and Flink with several representative batch and iterative workloads on up to 100 nodes. Our key finding is that there none of the two framework outperforms the other for all data types, sizes and job patterns. This paper performs a fine characterization of the cases when each framework is superior, and we highlight how this performance correlates to operators, to resource usage and to the specifics of the internal framework design.

Mots clés

Spark Big Data performance evaluation Flink

Domaines

Calcul parallèle, distribué et partagé [cs.DC]

Fichier principal

clusterFS.pdf (782.38 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Ovidiu-Cristian Marcu : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-01347638

Soumis le : samedi 6 août 2016-16:42:15

Dernière modification le : vendredi 24 mars 2023-14:53:02

Dates et versions

hal-01347638 , version 1 (21-07-2016)

hal-01347638 , version 2 (06-08-2016)

Identifiants

HAL Id : hal-01347638 , version 2
DOI : 10.1109/cluster.2016.22

Citer

Ovidiu-Cristian Marcu, Alexandru Costan, Gabriel Antoniu, María S. Pérez-Hernández. Spark versus Flink: Understanding Performance in Big Data Analytics Frameworks. Cluster 2016 - The IEEE 2016 International Conference on Cluster Computing, Sep 2016, Taipei, Taiwan. ⟨10.1109/cluster.2016.22⟩. ⟨hal-01347638v2⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

INSERM INSTITUT-TELECOM UNIV-RENNES1 CNRS INRIA INSA-RENNES IRISA GRID5000 IRISA-INSA-R CENTRALESUPELEC IRISA-D1 INRIA2 UR1-MATH-STIC UR1-UFR-ISTIC UNIV-RENNES INSA-GROUPE SILECS ANR UR1-MATH-NUM

1300 Consultations

15869 Téléchargements

Spark versus Flink: Understanding Performance in Big Data Analytics Frameworks

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager