OverFlow: Multi-Site Aware Big Data Management for Scientific Workflows on Clouds

Radu Tudoran 1, 2 Alexandru Costan 2 Gabriel Antoniu 2
2 KerData - Scalable Storage for Clouds and Beyond
Inria Rennes – Bretagne Atlantique , IRISA-D1 - SYSTÈMES LARGE ÉCHELLE
Abstract : The global deployment of cloud datacenters is enabling large scale scientific workflows to improve performance and deliver fast responses. This unprecedented geographical distribution of the computation is doubled by an increase in the scale of the data handled by such applications, bringing new challenges related to the efficient data management across sites. High throughput, low latencies or cost-related trade-offs are just a few concerns for both cloud providers and users when it comes to handling data across datacenters. Existing solutions are limited to cloud-provided storage, which offers low performance based on rigid cost schemes. In turn, workflow engines need to improvise substitutes, achieving performance at the cost of complex system configurations, maintenance overheads, reduced reliability and reusability. In this paper, we introduce OverFlow, a uniform data management system for scientific workflows running across geographically distributed sites, aiming to reap economic benefits from this geo-diversity. Our solution is environment-aware, as it monitors and models the global cloud infrastructure, offering high and predictable data handling performance for transfer cost and time, within and across sites. OverFlow proposes a set of pluggable services, grouped in a data scientist cloud kit. They provide the applications with the possibility to monitor the underlying infrastructure, to exploit smart data compression, deduplication and geo-replication, to evaluate data management costs, to set a tradeoff between money and time, and optimize the transfer strategy accordingly. The system was validated on the Microsoft Azure cloud across its 6 EU and US datacenters. The experiments were conducted on hundreds of nodes using synthetic benchmarks and real-life bio-informatics applications (A-Brain, BLAST). The results show that our system is able to model accurately the cloud performance and to leverage this for efficient data dissemination, being able to reduce the monetary costs and transfer time by up to 3 times.
Type de document :
Article dans une revue
IEEE Transactions on Cloud Computing, 2016, 〈10.1109/TCC.2015.2440254〉
Liste complète des métadonnées

Littérature citée [27 références]  Voir  Masquer  Télécharger

https://hal.inria.fr/hal-01239128
Contributeur : Alexandru Costan <>
Soumis le : lundi 7 décembre 2015 - 14:25:44
Dernière modification le : vendredi 18 mai 2018 - 13:38:02
Document(s) archivé(s) le : samedi 29 avril 2017 - 09:57:04

Fichier

bare_jrnl_compsoc.pdf
Fichiers produits par l'(les) auteur(s)

Licence


Domaine public

Identifiants

Citation

Radu Tudoran, Alexandru Costan, Gabriel Antoniu. OverFlow: Multi-Site Aware Big Data Management for Scientific Workflows on Clouds. IEEE Transactions on Cloud Computing, 2016, 〈10.1109/TCC.2015.2440254〉. 〈hal-01239128〉

Partager

Métriques

Consultations de la notice

1118

Téléchargements de fichiers

746