Optimizing Intermediate Data Management in MapReduce Computations - Inria - Institut national de recherche en sciences et technologies du numérique Access content directly
Conference Papers Year : 2011

Optimizing Intermediate Data Management in MapReduce Computations

Abstract

Many cloud computations process large datasets. Programming paradigms have been proposed to design this type of applications, so as to take advantage of the huge processing and storage options the cloud holds, but at the same time, to provide the user with a clean and easy to use interface. Among these programming models, we consider the MapReduce paradigm and its reference implementation, the Hadoop framework. We focus on the aspect of intermediate data, that is data produced and transferred between the two stages of the computation (map and reduce). The goal of this paper is to propose a storage mechanism for intermediate data with the purpose of optimizing the execution of MapReduce applications in the presence of failures, while keeping the impact on the job completion time to the minimum. To meet this goal, we rely on a fault-tolerant, concurrency-optimized data storage layer based on the BlobSeer data management service. We modify the Hadoop MapReduce framework to store the intermediate data in this layer (acting as a BlobSeer-based distributed file system) rather than using the local storage of the mappers, as in the vanilla version of Hadoop. To validate this work, we perform experiments on a large number of nodes of the Grid'5000 testbed. We demonstrate that our approach not only provides for intermediate data availability in case of failures, but also efficiently handles read/write accesses so that the overall job completion time is substantially improved.
Fichier principal
Vignette du fichier
main.pdf (879.41 Ko) Télécharger le fichier
Origin : Files produced by the author(s)
Loading...

Dates and versions

inria-00574351 , version 1 (01-04-2011)

Identifiers

  • HAL Id : inria-00574351 , version 1

Cite

Diana Moise, Thi-Thu-Lan Trieu, Gabriel Antoniu, Luc Bougé. Optimizing Intermediate Data Management in MapReduce Computations. CloudCP 2011 -- 1st International Workshop on Cloud Computing Platforms, Held in conjunction with the ACM SIGOPS Eurosys 11 conference, Apr 2011, Salzburg, Austria. ⟨inria-00574351⟩
509 View
1067 Download

Share

Gmail Facebook X LinkedIn More