CSAS: Cost-Based Storage Auto-Selection, a Fine Grained Storage Selection Mechanism for Spark

. To improve system performance, Spark places the RDDs into memory for further access through the caching mechanism. And it provides a variety of storage levels to put cache RDDs. However, the RDD-grained manual storage level selection mechanism can not adjust depending on computing resources of the node. In this paper, we ﬁrstly present a ﬁne-grained automatic storage level selection mechanism. And then we provide a storage level for a partition based on a cost model which fully considering the system resources status, compression and serialization costs. Experiments show that our approach can oﬀer a up to 77% performance improvement compared to the default storage level scheme provided by Spark.


Introduction
To balance volume and speed, Spark [1] provides five flags to mark storage level, corresponding to whether use disk, memory, offHeap, serialization and replication. However, the storage level selection mechanism of Spark has the following two problems: Firstly, storage level of a RDD is set by programmers manually, by default storage level is MEMORY ONLY which the RDD can only be cached in memory. Experiments show that there are significant performance differences among different storage level. A reasonable storage level decision results in performance improvements; A wrong decision can lead to performance degradation or even failure inversely. Secondly, RDD-grained storage level selection mechanism may lead lower resource utilization. In Spark, the same cached RDD uses the same storage level, on the contrary, different RDDs may use different storage levels. While a RDD is divided into several partitions which have different size and locate in different executors. Some partitions of a RDD may be computed on executors which have enough free memory, and others will be pended on executors which have not enough free memory on contrary.
In this paper, we propose a fine-grained storage level selection mechanism. Storage level is assigned to a RDD partition, not RDD, before it cache. And storage level selection of a RDD partition is automatically basing on a cost model which takes fully account of memory of the executor and various computing costs of the partition.

Overall Architecture
CSAS (Cost-based Storage Auto-Selection) can wisely select a Storage-level, based on future costs, for a partition before it is to be cached. The overall architecture, shown as Fig.1, consists of three components: (i)Analyzer, which lies in the driver, provides the function of analyzing the DAG structure of the application to obtain the RDDs which will be cached and their execution flows; (ii)Collections, one in each executor, are used to collect real-time information, such as creation time, (De)serialization time of each RDD partition, during the task running; (iii) Storagelevel Selectors, also one in each executor, are arbiters for decision which storage level will be used by RDD partition when they will be cached.

Analyzer
As mentioned above, Analyzer obtains dependencies of RDDs and RDDs that will be cached according to DAG constructed by DAG Scheduler. There are two types of RDDs, called cache RDDs and non-cache RDDs. When computes on a RDD, it is necessary that all RDDs it depends are ready. If there are RDD(s) absent, the absences need to be created firstly. For non-cache RDDs, they are computed each time; For cache RDDs, they are got from memory or disk or both through CacheManager according of storage level after first computing. So we need to obtain the interdependencies among cache RDDs. In this paper, we use LCS's DFS algorithm [3] to get the ancestor cache RDDs of a RDD and the creating path of each cache RDD. At last, all these informations are recorded and are used to compute create cost of a RDD in Collector.

Collector
Informations that A collector collect are listed in the followings: Create Cost: Time spends on computing a RDD partition after all ancestor cache partitions are ready, denoted as C create . When (De)Serialization cost of a RDD partition is absent, we need estimate it using its cost per MB data [3], denoted as SPM and DSPM. When (de)serialization cost is unknown for a RDD partition, we estimate its (de)serialization cost according by the size of the partition and SPM or DSPM of corresponding RDD respectively. (De)Compression cost is also estimated using the same way.
According the above definition, we can calculate cache cost of a RDD partition in different scenarios, denoted as C cache . To get optimum storage level for each partition, we compute cache cost of a RDD partition of Normal scenario, Serialization scenario, Disk scenario and Compression scenario to determine each storage level flag of a partition.
As shown in Fig.2, RDD partitions in memory will reach saturation at the time t. At this time, the pending RDD partitions should wait until some tasks finished and freed enough space to run. So the whole computing is divided into parallel computing and sequence computing two phases in a stage. Among them, computing operations on RDD partitions have no interference each other in parallel computing phase. In contrast, operations on a RDD partition must delay until another computing finished in sequence computing phase. The moment that sequence computing phase begins is after the first finish task in parallel computing phase has released its memory. Thus, the worst case caching cost in stage i can be concluded in 1: Where n is the number of total RDD partitions will be cached in the future of this stage; m is the number of RDD partitions computing in parallel; maxC k cache is one of the top (n − m)/m max cache cost among RDD partitions in sequence computing phase. The second half is the sum of top (n−m)/m cache cost among RDD partitions in sequence computing phase.

Storagelevel Selectors
Storage level selector in this executor evaluates an appropriate storage level for the partition based on worst case caching cost of various scenarios before a RDD partition is to be cached. Algorithm 1 shows the storage level selection strategy for a cache RDD partition is determined by the values of worst case caching cost among various scenarios under the current memory circumstance. All costs used in algorithm 1 are caculated based on 1.

Performance Evaluations
The experiment platform includes a cluster with three different nodes, one as both master and executor and the remaining only act as executors. And we adopt HDFS for storage, each partition has one replications. The datasets are generated by BigDataBench [4]. We use WordCount and KMeans two benchmarks in our experiments. For the convenience of test, we have two RDD cache for wordcount, respectively textFileRDD and flatMapRDD. The size of the two RDDs is about nine times difference. All data are normalized based on CSAS execution time. Fig. 3 shows the difference of performance between CSAS and Spark native system which under different cache storage levels. It shows that there is a huge difference in execution time under different storage levels in native Spark. And CSAS can reduce 66.7% time compared to M M which are the default scheme in Spark.