A Fine-Grained Performance Bottleneck Analysis Method for HDFS

. The performance issue of HDFS has always been a great concern due to its widely adoption in both production and research environments. However, a ﬁne-grained performance analysis tool is missing to eﬀectively identify the bottlenecks as well as to provide useful guidance for performance optimization. In this paper, we propose a ﬁne-grained performance bottleneck analysis tool, which extends HTrace with ﬁne-grained instrumentation points that are missing in Hadoop oﬃcial distribution. In addition, we propose an eﬀective trace merging method that improves the understandability of our analysis. We analyze the performance of HDFS under diﬀerent kinds of workloads and get undiscovered insights.


Introduction
Distributed file systems are widely used in various computing domains such as supercomputing and big data analytics. However, diagnosing performance issues of distributed file systems is still a challenging task, because the performance bottleneck of a distributed file system may come from various components of the system, and even interaction between different components. Therefore, effective performance analysis tools for distributed file systems such as Hadoop are of vital importance. Currently, many researches focus on end-to-end performance analysis frameworks, which capture the information flow of each request of the distributed file system and then obtain the performance information of each component of the system and the interaction between the components such as Dapper [5], Magpie, Stardust [6], Xtrace [1], HTrace [2], etc.
Among the above performance analysis tools, HTrace has been merged into Hadoop release to provide useful performance data. However, the default HTrace instrumentation within Hadoop has the following limitations for fine-grained performance analysis. Firstly, the default Hadoop provides very limited instrumentation points without detailed information captured. For example, the major components of HDFS [4] such as Namenode, Datanode and their interactions are not instrumented. For example, we can not conclude whether Namenode bookkeeping is the bottleneck because Hadoop's official implementation haven't instrumented Namenode. Secondly, the default instrumentation in Hadoop cannot obtain the detailed parameter information for the function calls instrumented. For better analyzing the performance of a distributed file system, not only the time series of each function call but also the size of bytes processed by each function need to be known in order to identify the potential performance bottlenecks. Lastly, instrumentation information provided in default Hadoop is difficult to retrieve and visualize. For example, in just a few minutes, hundreds of megabytes of trace files are generated, making it hard to locate and diagnose the performance issues.
Therefore, this paper focuses on the performance analysis of HDFS by extending HTrace to provide fine-grained instrumentation. In addition to solve the trace explosion problem, we propose a trace compression method that merges the traces of repeated function calls and only maintains the representative statistics during instrumentation. Finally, through experiments on representative big data workloads, we obtain some useful insights.

Fine-grained Instrumentation
The instrumentation of Hadoop's official distribution mainly instrument client sensed delay or Datanode sensed delay. HDFS contains more complex interaction beyond Datanodes and the client node. What's more, we can not distinguish network delay from the local file I/O delay. Due to this reason, we instrument some new performance-related blocks. They mainly reside in Namenode, Datanodes. Our purpose is to get fine-grained Namenode performance, Datanode local I/O performance, Datanode network performance, Datanode and Datanode data exchange performance. Except for simply obtaining function call duration, our instrumentation also encodes important function arguments into traces such as data size processed, block id and filename so we can monitor data process rate, I/O error occurrence. One of the biggest challenges of our instrumentation is that Java has many polymorphous functions. In the case, we will instrument every function and merge them in the trace processing procedure.

Trace Compression
The running of HDFS will generate a huge volume of traces. In our experiments, after several minutes of Spark execution, a trace larger than 1 GB will generate. Traditional HDFS performance analysis tools neglect this fact and rely on human labor to find the bottleneck in a large amount of data.
We present an effective method for compressing traces. We observe that before compressing, there are many repeated function call. For example, the receiveBlock function usually contains hundreds of receivePacket functions. We merge repeated function call receivePacket in this circumstance and only extract several representative statistics from these merged function calls. The number of call trees will reduce by more than 90% after trace compression. Formally, we do a breadth-first traversal from bottom to top inside a call tree and merge the subtrees with the same structure. After compression inside every call tree, we compress these trees with the same structure.

Experiment Setup
Our experiments are conducted with a cluster with seven nodes with one master node (which is Namenode in HDFS), five slave nodes (which are Datanode in HDFS) and one client Node. The master node and slave nodes are equipped with Xeon E5-5620, 16GB memory. To achieve higher throughput, we use Intel Xeon Phi (Knights Landing) for workload generating. The many-core and high volume of memory enable Phi to start many HDFS clients simultaneously. The implementation is shown in Figure 1. The trace is generated into local files that are collected and stored in database.

Performance Bottleneck Analysis
Across Workloads -We choose the tiny sized workload input from Hibench. For machine learning workloads, data will be iterated for many times generating large traces thus we use sampling (sample rate is 0.05) to reduce trace size. For Wordcount workload, the largest delay is caused by FileSystem#createFileSystem which spends total 90.21s. The second largest delay is caused by DFSOutput-Stream#close which spends total 10.54s. Local I/O plays an ignorable role here. The delay of Datanode flushing buffer into local file system is too small to measure. And also we can conclude that using faster storage medium won't speed up application greatly. We can see the bottleneck is in the client node. The process for initiating FileSystem object has a large potential to optimize. We have a similar conclusion for Sort, Terasort, Pagerank, LogisticRegression and Nweight workloads. Bayes workload is different from the above workloads. The largest delay is caused by BlockSender#sendBlock. Reading from local file system causes 3.10s delay and reading from remote Datanode causes 0.49s delay.
Impact of File Size -We use Wordcount workload to explore the impact of file size on HDFS performance. We use tiny, small, large, huge sized workloads which contains 32000, 320000000, 3200000000, 32000000000 respectively. Due to the trace size explosion, we use a sample rate of 1, 0.01, 0.001, 0.0001 respectively. With the increase of data size, the impact of FileSystem#createFileSystem is becoming weaker. In tiny sized workload, this operation causes total 91.92s delay compared with application time 28s(we add up delay from different Datanodes).
In small-sized workload, it takes 1.55s compared with application time 32s. And in larger sized workloads, it hasn't been sampled. So in small-sized workloads, the file system creation process is an import bottleneck.
Impact of Request Frequency -In [3], the authors directly model real request patterns from the AliCloud on IOPS, Inter-arrival time, session size and read request size. However, Alicloud is a very large cluster contains tens of thousands of nodes. For our small cluster, we multiply IOPS with different factors α but retains the distribution the model. With request frequency increasing, we can explore which part of HDFS facing the request pressure as shown in Table 1.
The delay of request mainly caused by sendBlock operation. However, the average delay of this operation is decreasing. Although FileSystem#createFileSystem plays an important role in request delay, its duration has little to do with request frequency. We can find out that the delay of sendBlock first increase but then decrease with request frequency increasing. In small frequency, HDFS is in cold start state thus the delay is relatively large. And when request frequency is very large, the resource contention is more severe. The BlockSender#sendPacket, FS-Namesystem#getBlockLocations (Namenode searching for block locations for a given file) operation has the same conclusion. Contrary to common sense, the bottleneck under frequent request is neither in Namenode nor in Datanode. Thus optimization for the concurrent request in client node is more important.

Conclusion
In this paper, we propose an extension to HTrace in order to support fine-grained performance bottleneck analysis for HDFS. In addition, we propose a trace compression method to merge the repeated function calls for efficient performance analysis. We've also done a series of experiments to explore the bottleneck under different workloads and get useful insights.