Skip to Main content Skip to Navigation
Conference papers

Data Preparation as a Service Based on Apache Spark

Abstract : Data preparation is the process of collecting, cleaning and consolidating raw datasets into cleaned data of certain quality. It is an important aspect in almost every data analysis process, and yet it remains tedious and time-consuming. The complexity of the process is further increased by the recent tendency to derive knowledge from very large datasets. Existing data preparation tools provide limited capabilities to effectively process such large volumes of data. On the other hand, frameworks and software libraries that do address the requirements of big data, require expert knowledge in various technical areas. In this paper, we propose a dynamic, service-based, scalable data preparation approach that aims to solve the challenges in data preparation on a large scale, while retaining the accessibility and flexibility provided by data preparation tools. Furthermore, we describe its implementation and integration with an existing framework for data preparation – Grafterizer. Our solution is based on Apache Spark, and exposes application programming interfaces (APIs) to integrate with external tools. Finally, we present experimental results that demonstrate the improvements to the scalability of Grafterizer.
Document type :
Conference papers
Complete list of metadata

Cited literature [23 references]  Display  Hide  Download

https://hal.inria.fr/hal-01677626
Contributor : Hal Ifip <>
Submitted on : Monday, January 8, 2018 - 3:01:36 PM
Last modification on : Thursday, December 3, 2020 - 9:26:02 AM
Long-term archiving on: : Thursday, May 3, 2018 - 4:33:42 PM

File

449571_1_En_10_Chapter.pdf
Files produced by the author(s)

Licence


Distributed under a Creative Commons Attribution 4.0 International License

Identifiers

Citation

Nivethika Mahasivam, Nikolay Nikolov, Dina Sukhobok, Dumitru Roman. Data Preparation as a Service Based on Apache Spark. 6th European Conference on Service-Oriented and Cloud Computing (ESOCC), Sep 2017, Oslo, Norway. pp.125-139, ⟨10.1007/978-3-319-67262-5_10⟩. ⟨hal-01677626⟩

Share

Metrics

Record views

261

Files downloads

202