Skip to Main content Skip to Navigation
Conference papers

Optimizing Intermediate Data Management in MapReduce Computations

Abstract : Many cloud computations process large datasets. Programming paradigms have been proposed to design this type of applications, so as to take advantage of the huge processing and storage options the cloud holds, but at the same time, to provide the user with a clean and easy to use interface. Among these programming models, we consider the MapReduce paradigm and its reference implementation, the Hadoop framework. We focus on the aspect of intermediate data, that is data produced and transferred between the two stages of the computation (map and reduce). The goal of this paper is to propose a storage mechanism for intermediate data with the purpose of optimizing the execution of MapReduce applications in the presence of failures, while keeping the impact on the job completion time to the minimum. To meet this goal, we rely on a fault-tolerant, concurrency-optimized data storage layer based on the BlobSeer data management service. We modify the Hadoop MapReduce framework to store the intermediate data in this layer (acting as a BlobSeer-based distributed file system) rather than using the local storage of the mappers, as in the vanilla version of Hadoop. To validate this work, we perform experiments on a large number of nodes of the Grid'5000 testbed. We demonstrate that our approach not only provides for intermediate data availability in case of failures, but also efficiently handles read/write accesses so that the overall job completion time is substantially improved.
Document type :
Conference papers
Complete list of metadatas

Cited literature [7 references]  Display  Hide  Download

https://hal.inria.fr/inria-00574351
Contributor : Diana Moise <>
Submitted on : Friday, April 1, 2011 - 5:47:25 PM
Last modification on : Friday, July 10, 2020 - 4:02:48 PM
Long-term archiving on: : Saturday, July 2, 2011 - 2:28:31 AM

File

main.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : inria-00574351, version 1

Citation

Diana Moise, Thi-Thu-Lan Trieu, Gabriel Antoniu, Luc Bougé. Optimizing Intermediate Data Management in MapReduce Computations. CloudCP 2011 -- 1st International Workshop on Cloud Computing Platforms, Held in conjunction with the ACM SIGOPS Eurosys 11 conference, Apr 2011, Salzburg, Austria. ⟨inria-00574351⟩

Share

Metrics

Record views

950

Files downloads

1466