GraphScSh: Eﬀicient I/O Scheduling and Graph Sharing for Concurrent Graph Processing

. With the increasing need for analyzing graph data, graph systems have to eﬀiciently deal with concurrent graph processing (CGP) jobs. However, existing platforms are inherently designed for a single job, they incur the high cost when CGP jobs are executed. In this work, we observed that existing systems do not allow CGP jobs to share graph structure data of each iteration, introducing redundant accesses to same graph. Moreover, all the graphs are real-world graphs with highly skewed power-law degree distributions. The gain from extending multiple ex-ternal storage devices is diminishing rapidly, which needs reasonable schedulings to balance I/O pressure into each storage. Following this direction, we propose GraphScSh that handles CGP jobs eﬀiciently on a single machine, which focuses on reducing I/O conflict and sharing graph structure data among CGP jobs. We apply a CGP balanced partition method to break graphs into multiple partitions that are stored in multiple external storage devices. Additionally, we present a CGP I/O scheduling method, so that I/O conflict can be reduced and graph data can be shared among multiple jobs. We have implemented GraphScSh in C++ and the experiment shows that GraphScSh outperforms existing out-of-core systems by up to 82%.


Introduction
In the past decade, graph analysis has become important in a large variety of domains.Due to the increasing need to analyze graph structure data, it is common that Concurrent Graph Processing (CGP) jobs are executed on same processing platforms, in order to acquire different information from same graphs.For example, Facebook uses Apache Giraph [6] to execute various graph algorithms, such as the variants of PageRank [12], SSSP [10], etc. Figure 1 depicts the number of CGP jobs over a large Chinese social network [17].The stable distribution shows that more than 83.4% of the time has at least two CGP jobs executed simultaneously.At the peak time, over 20 CGP jobs are submitted to the same platform.Also, Figure 2 shows the usage of Chinese map Apps in a week of 2017.We can observe that each map App is used by each user more than five times within a week.Particularly, Amap App [2] ranks the first and handles over 10 billion route plannings every week, that is to say, it is used more than 60 thousand times per minute on average.
The existing processing systems can process a single graph job efficiently.They improve the efficiency either by fully utilizing the sequential usage of memory bandwidth, or by achieving a better data locality and less redundant data accesses, like GraphChi [8], X-Stream [13], GridGraph [20] and Graphene [9], PreEdge [11], etc.However, these systems are usually designed for a single graph processing job, which are much more inefficient when executing multiple CGP jobs.The inefficiencies include I/O conflict and repeated access to same graph structure data.

I/O Conflict:
When multiple CGP jobs are executed over same graph, it is commonplace that these jobs visit same partition data, resulting in I/O conflict among multiple jobs.Fortunately, extending multiple external storage devices is possible to reduce this conflict, which can distribute multiple I/O of CGP jobs to multiple external storage devices.However, graphs derived from real-world phenomena, like social networks and the web, typically have highly skewed power-law degree distributions [1], which implies that a small subset of vertices connects to a large fraction of the graph.Figure 3  distribution of graph from LiveJournal [14], which is a free online community with almost 10 million members.The highly skewed characteristic of graph challenges the above assumption and make it more difficult.Although using multiple storage devices reduces I/O conflict, this conflict is still the bottleneck of overall performance.Data access problems: Graph processing jobs are usually operated on two types of data [9]: graph structure data and graph state data.The graph structure data mainly consists of vertices, edges, and the information associated with each edge.The graph state data, such as ranking scores for PageRank, is computed within each iteration and consumed in the next iteration.The graph structure data usually occupies a large volume of the memory, whose proportions are varying from 71% to 83% for different datasets [19].However, existing graph platforms do not allow CGP jobs to share the graph structure data in memory, resulting in redundant access to the graph from external storage.Furthermore, existing out-of-core systems leverage various mechanisms to utilize the sequential usage of memory bandwidth and achieve a better data locality, such as PSW in GraphChi, Edge-Centric in X-Stream and 2-level hierarchical partitioning in GridGraph, etc.Unfortunately, CGP jobs destroy these optimized mechanisms above, increasing overhead of randomized access significantly.
In this paper, we propose GraphScSh, a graph processing system based on multiple external storage devices.Our design concentrates on reducing I/O conflict and sharing the graph structure data among CGP jobs.Specifically, the graph structure data is divided into multiple external storage devices evenly by CGP balanced partition method.The subgraph of each partition can match the size of memory well, which reduces the overhead of frequent swap operations.Furthermore, we present a new CGP I/O scheduling method based on multiple external storage and graph sharing, so that I/O conflict can be reduced and the graph can be shared among multiple CGP jobs.
The system GraphScSh has been implemented in C++.To demonstrate the efficiency of our solutions, we conducted extensive experiments with our system GraphScSh and compared its performance with state-of-the-art systems Grid-Graph over different combinations of CGP jobs.The experiments show that overall performance of GraphScSh outperforms GridGraph by up to 82%.
The rest of this paper is organized as follows.The design details of GraphScSh are presented in Section 2, including CGP balanced partition schema, and CGP I/O scheduling method.Section 3 gives the specific implementation of our system GraphScSh, followed by experimental evaluation in Section 4. We then describe related work in Section 5 and conclude in Section 6.

Our Proposed Approach
To reduce the I/O conflict and the redundant access to graph efficiently, we propose GraphScSh based on multiple external storage devices, which is designed to reduce I/O conflict and share the graph structure data among CGP jobs.The existing partitioning methods are usually designed for a single job.When CGP jobs are executed, we cannot make sure that partitioning size of all jobs match the size of memory, resulting in frequently swap-in and swap-out operations.We propose a new partitioning method to process CGP jobs, as shown in Figure 4.

CGP Balanced Partition
The graph is divided into n partitions, and each partition includes a vertex set and an edge set.Within a vertex set, the index id of vertices is continuous.The edge set of a partition consists of all edges whose source vertex is in the partition's vertex set.When GraphScSh executes graph algorithms, each partition size depends on both memory configuration and number of CGP jobs, so that data of each vertex set can be fit into memory.Additionally, GraphScSh leverages multiple external devices to store the graph data.For the load balance, different partitions are stored in multiple storage devices and the number of edges for each partition is same.The position disk_id of each partition in multiple external storage can be described as, where partition_id is the id of graph partition, disk_num is the number of external storage.

CGP I/O Scheduling
Based on the above partitioning method, we break graph structure data into multiple partitions evenly which are stored in multiple external storage devices.
To reduce the I/O conflict and share the graph among CGP jobs, we propose a CGP I/O scheduling method based on CGP Balanced Partition method.The scheduling method includes two strategies for load balance and graph sharing.First, we count the total number of jobs in each external storage and select one external storage that has the fewest jobs as the target, for loading balance.During execution of CGP jobs, system records partition_id that each job visits.The position of graph partition is computed according to the mapping between partitions and the external storage.For example, there are n jobs executed, where m jobs visit the first external storage for graph, and (n − 1 − m) jobs access the second external storage.If m > (n − 1)/2, the second one will be selected as the target, otherwise the first will be targeted.Assume that the number of external storage is k, where the number of jobs is n − 1, n − 2, ..., n − k, the storage with the fewest jobs will be targeted.
Second, we leverage synchronous field to reduce total number of I/O as much as possible to share the same graph, as Figure 5 shows.The sync field mainly records information about the mapping from graphs to memory, including mapping address mmap_addr [18], the number edge_num of edges, and the descriptor f d of file.In addition, the field must include the total number unit_num of jobs and determines whether to remove the mapping of partition according to it.Specifically, according to unit_num, the system decides if partition data has been mapped into the memory according to the sync field.If unit_num = 0, the partition is not visited by jobs and should be filled into memory through mapping.Otherwise, the partition has been loaded into memory by other jobs, and the current job visits partition by the address of field.The specific process of CGP I/O scheduling method includes several steps.Suppose that the number of the external storage is k, the concurrent graph job is A, the I/O scheduling of CGP jobs contains the following steps: -According to synchronous information of CGP jobs and mapping information between partition and disk, the system counts the number of jobs executed in each external storage as n 1 , n 2 , ..., n k , respectively.-According to synchronous information of CGP jobs and mapping information between partition and disk, the system counts the number of partitions in each external storage, as s 1 , s 2 , ..., s k , respectively, and records partition_id.-The system sorts the external storage according to the values of n 1 , n 2 , ..., n k .Then the corresponding id of the external storage is added into set U , where the number of jobs in each external storage is in ascending order.-The system decides each external storage of U one by one.If the set s i of one external storage i contains a partition that has not been accessed, the external storage i is selected as the target.-If the partition data in memory has been processed by job A, A will visit each storage in U to find the data which has not been used.If the data exists, the corresponding external storage will be as the target and the current iteration ends.
Assume that the total execution time of a graph job is T , its computation time is T c and its I/O wait time is T w .When N jobs are executed on the same graph, the computation time of jobs is T C1 , T C2 ...T CN respectively, and I/O wait time is T w .The total execution time of existing systems can be described as, where T C−M AX = max(T C1 , T C2 , ...T CN ).So the total time can be described as, Suppose that the number of external storage devices is D. Based on loading balancing, the I/O pressure is balanced into each external storage.Therefore, the number of jobs running on each device is N/D.The new total execution time can be described as, the total number of I/O is from N T W to N/D * T W .The new total execution time is described as, We can see that the new I/O Scheduling outperforms the existing methods by up to (N − N/D) theoretically.We have implemented our system GraphScSh in C++. Figure 6 illustrates the modules of GraphScSh, including graph management, mapping management, data structure, operation module, and graph algorithms.We mainly focus on two parts in this section: operation module and graph algorithms.

Operation Module
The function of this module is achieved by operations of Scatter and Gather.In Scatter phase, it accesses to graph in streaming way by function get_next_edge() and generates the updated information according to state data.In Gather phase, it read updated data and updates the state data.The Traversal operation is the kernel operation and implements by the function get_next_edge().First, the function needs to determine partitions of graph whether to be visited.If false, the next edge data will be accessed.Then, get_next_edge() decides all partition of this iteration whether to be visited.If true, the next iteration will be started.If false, the function findNextPartition() will be activated to find the next partition to visit.The implementation details of FindNextPartition are described in Algorithm 1.

Input:
The partition set of graph unaccess_partition; The set of external storage U ; The visited partition set of graph s 1 , s 2 , ..., s k ; Output: The next partition to be visited partition_index;

Implementations of Graph Algorithm
We define Graph as the base class, which provides a programming interface for graph algorithms.Class Graph defines five virtual functions, including initU nit() for initialization, output() for outputting result, reset() for cleaning after one iteration, Scatter(), and Gather().The function initU nit() initializes the related work of graph algorithms, for example, the out-degree of each vertex in PageRank.The function reset() resets partition sets that workers have visited, and the number of partitions that each external storage has accessed.Algorithm 2,3 give examples to show how to implement graph algorithms on GraphScSh, which uses edge-centric Scatter-Gather model to run graph algorithms.

Algorithm 2 PageRank Scatter
1: for each edge e of graph do 2: update_t upt;

Experiment Environment and Datasets
The hardware platform used in our experiments is a single machine containing 6-core 1.60 GHz Intel(R) Xeon(R) CPU E5-2603.Its memory is 8GB and has two SSDs with 300GB.The program is compiled with g++ version 11.0.

Comparison with GridGraph
To compare the performance of GridGraph and GraphScSh, we simultaneously submit multiple jobs to each system.The partition number of GraphScSh is set same as GridGraph, and different datasets have a different number of partitions.
The execution time of various graph processing algorithms has been computed, as Table 2 depicted.For better comparing the performance of systems, CGP jobs consist of two graph algorithms with same converge speed based on different datasets.To acquire better integrity, experiments are designed under different degree of parallelism (DOP) [16].Twitter: First, for graph dataset Twitter, we evaluate the total execution time and the speed-up ratio of various CGP jobs (e.g. the DOP is 2, 3 and 4, respectively, as Figure 7(a), 7(b) and 7(c).In general, for different combinations of CGP jobs, the execution time of GraphScSh is less than that of GridGraph, and the speed-up ratio grows up as DOP increases.Under the same DOP but a different combination, the longer execution time of CGP jobs is, the greater GraphScSh outperforms GridGraph.Specifically, when two systems are executed on dataset Twitter, the combinations of 2WCC, 3WCC, and 4WCC are accelerated by 56.93%, 65.75%, and 70.8% respectively.Because CGP jobs are executed on the GridGraph, resulting in the I/O conflict greatly.
RMAT26: Next, we execute different combinations of CGP jobs on RMAT26 to compare GridGraph and GraphScSh, as 8(b) and 8(c) show.When the DOP is 3 or 4, the performance of GraphScSh is better than that of GridGraph.In particular, with the increase DOP, the speed-up ratio grows up gradually.For example, GraphScSh outperforms GridGraph by 34%, 40.5% and 45.6% under the combinations of 2BFS, 3BFS, and 4BFS, respectively.
ER26: Besides, from Figure 9(a), 9(b) and 9(c), we can observe that the total execution time of GraphScSh is much less than those of GridGraph over dataset ER26.For example, for the combinations of 2PR, 3PR and 4PR, GraphScSh outperforms GridGraph by 64.67%, 76.03% and 82%, respectively.Under the same DOP, the difference that GraphScSh executes different combinations of CGP jobs is smaller than that of GridGraph.It also means that GraphScSh with GSSC and MSGL is suitable to cope with CGP jobs.

Related Work
With the explosion of graph scale, lots of graph processing systems are created to achieve high efficiency for graph analysis.They improve the efficiency either by a prefetcher for graph algorithms, or by full utilizing the sequential usage of memory bandwidth.
PrefEdge [11] is a prefetcher for graph algorithms that parallelises requests to derive maximum throughput from SSDs.PrefEdge combines a judicious distribution of graph state between main memory and SSDs with an innovative read-ahead algorithm to prefetch needed data in parallel.GraphChi [8] a diskbased system for computing efficiently on graphs with billions of edges.By using a novel parallel sliding windows method, GraphChi is able to execute several advanced data mining, graph mining, and machine learning algorithms on very large graphs, using just a single consumer-level computer.X-Stream [13] is novel in using an edge-centric rather than a vertex-centric implementation of this model, and streaming completely unordered edge lists rather than performing random access.GridGraph [20] is an out-of-core graph engine using a grid representation for large-scale graphs by partitioning vertices and edges to 1D chunks and 2D blocks respectively, which can be produced efficiently through a lightweight range-based shuffling.
Unfortunately, when CGP jobs are executed on these systems above, they incur the extra high cost (e.g., inefficient memory use and high fault tolerance cost).Following this observation, Seraph [17] is designed to handle with CGP jobs based on a decoupled data model, which allows multiple concurrent jobs to share graph structure data in memory [19].Based on this observation that there are strong spatial and temporal correlations among the data accesses issued by different CGP jobs because these concurrently running jobs usually need to repeatedly traverse the shared graph structure for the iterative processing of each vertex, CGraph [19] proposed a correlations-aware execution model.Together with a core-subgraph based scheduling algorithm, CGraph enables these CGP jobs to efficiently share the graph structure data in memory and their accesses by fully exploiting such correlations.

Conclusion
This paper introduces GraphScSh, a large scale graph processing system that can support CGP jobs running on a single machine with multiple external storage devices.GraphScSh adopts a CGP balanced partition method to break graphs into multiple partitions that are stored in multiple external storage devices.In addition, we present a CGP I/O scheduling method, so that I/O conflict can be reduced and the same graph can be shared among multiple CGP jobs.Experimental results depict that our approach significantly outperforms existing out-of-core systems when running CGP jobs.In the future, we will research to further optimize our solution with a snapshot mechanism for efficient graph processing.

Table 1 .
Data Sets Properties

Table 2 .
Execution Time of Algorithms on GridGraph(s)