SDAC: Porting Scientific Data to Spark RDDs

. Scientiﬁc data processing has exposed a range of technical problems in industrial exploration and speciﬁc-domain applications due to its huge input volume and data format diversity. While Big Data analytic frameworks such as Hadoop and Spark lack their native supports for processing increasing heterogeneous scientiﬁc data eﬃciently. In this paper, we introduce our work named SDAC (Scientiﬁc Data Auto Chunk) for porting various scientiﬁc data to RDDs to support parallel processing and analytics in Apache Spark framework. With the integration of auto-chunk task granularity-specify method, a better-planned theoretical pipeline can be derived to navigate data partitioning and parallel I/O. We showcase performance comparison with H5Spark within 6 benchmarks in both standalone and distributed mode. Experimental results showed SDAC module achieved an overall improvement of 2.1 times over H5Spark in standalone mode, and 1.34 times in distributed mode.


Introduction
Science is increasingly becoming data-driven [1]. Nowadays, with exponentially proliferating of scientific data volume generated from scientific instruments and computer simulations, the storage capacity, processing efficiency and analytical accuracy are becoming critical challenging. To address these issues, ad-hoc frameworks such as Hadoop and Spark are taken into account. With the seamless integration of MapReduce programming paradigm, rapidly manipulating large amounts of data in parallel becomes feasible. Meanwhile a range of scientific data formats such as HDF5 (Hierarchical Data Format 5) [2] and NetCDF (The Network Common Data Format) [3] are put forward with the similar purpose of solving high volume data storage and platform-independent processing, and have been well proved for specific-domain study and analytics. However, when it comes to utilize parallel frameworks such Spark for processing scientific data, some technical defects are exposed such as lacking methods to specify semantic indexing delimiters as pointed by "Scientists need a way to use intelligent indices and data organizations to subset the search "in [1].
In this paper, we introduce our module SDAC (Scientific Data Auto Chunk) to bridge the gap between scientific data and Spark RDDs. In order to be better fitted for MapReduce paradigm in Spark, we propose an auto-chunk algorithm to improve the data parallelism level by partitioning the data layout into predefined chunks. SDAC is available at http://github.com/TYoung1221/SDAC.

Overview of SDAC
In this paper, we design and implement a module named SDAC (Scientific Data Auto Chunk) to enable various scientific data processing atop Spark framework. The architecture of SDAC module is illustrated in Figure 1  In order to seamlessly integrate scientific data with Spark RDDs, we propose 3 components to implement the porting process: 1. SD Identifier: Recognize scientific data format and map with corresponding read method. 2. Access Selector: Decide data access strategy to optionally process the file entirely or to process in parallel for performance improvement. 3. RDDs Generator: Implement the bridge to port scientific data to RDDs by first parallelizing a collection of total chunk numbers from auto-chunk output. This porting process is shown in Figure 1 (b). Note that the input scientific data has 3815 rows and 9000 columns and is in SGY format.
In SDAC, we reference the Spark vanilla way of RDDs generation by parallelizing the total number of chunks from auto-chunk calculated output. Then the generated RDDs will be mapped with corresponding chunks to generate sub RDDs for distributing among workers.

Auto-chunk Algorithm
We then propose an auto-chunk algorithm to calculate a better-planned task granularity to navigate in parallel operation. First, a dimension array to specify the chunk unit size is calculated collaboratively by total amount of input multidimensional array and available computer resources. Once the chunk dimension size is determined, a B-tree structure will be generated which contains chunk index and chunk offset for retrieving in parallel. The details of the auto-chunk algorithm is shown as follows:

Algorithm 1 Auto-chunk Algorithm
Precondition: nw is number of workers in Spark, n d is dimension of scientific data. 1: function auto chunk(nw, n d ) 2: if ALL i ∈ n d mod nw = 0 then 3:

Experimental Evaluation
We evaluate SDAC performance comparing with H5Spark [5] via 6 benchmarks which are evaluated on one single machine with 4 AMD Opteron(tm) Processor 6380 CPUs with 64 cores and on 8 worker nodes with a 8-core 2.10GHz Inter E5-2620v4, 16GB of RAM and 4 Spark executor threads in each node. And 3 datasets including 14.15GB HDF5 and 14.37GB NetCDF data are involved in evaluation. Figure 2(a) shows the overall speedups relative to H5Spark in standalone mode. In which, the benchmark results of Max, Min and PKTM draw speedups between 1.8x and 4.3x, and LR, Genetic and K-means draw smaller speedups between 1.2x and 2.0x. And Figure 2(b) draws the overall speedups over H5Spark in distributed mode. In which the benchmark result of Max, Min and PKTM draw speedups between 1.2x and 2.3x, and Genetic and K-means get relatively smaller speedups between 1.0x and 1.5x.

Conclusion
In this paper, we propose a light-weight module named SDAC(Scientific Data Auto Chunk) atop Apache Spark framework to bridge the gap between heterogeneous scientific data to Spark RDDs. We describe our efforts in supporting scientific data formats such as HDF5, NetCDF, ADIOS, SGY and FITS. And an auto-chunk algorithm is integrated to navigate parallel I/O by offering a more meticulous strategy to determine task granularity by partitioning the input dataset into pre-defined chunks. We showcase the performance gains across 6 benchmarks compared with H5Spark in both standalone and distributed mode. As future work, we plan to exploit Spark GraphX analytics and machine learning library (MLlib) to make deeper survey of scientific data analytics.