Distributed NVRAM Cache – Optimization and Evaluation with Power of Adjacency Matrix

. In this paper we build on our previously proposed MPI I/O NVRAM distributed cache for high performance computing. In each cluster node it incorporates NVRAMs which are used as an intermediate cache layer between an application and a ﬁle for fast read/write operations supported through wrappers of MPI I/O functions. In this paper we propose optimizations of the solution including handling of write requests with a synchronous mode, additional modes preventing data preloading from a ﬁle and synchronization on ﬁle close if the solution is used as temporary cache only. Furthermore, we have evaluated the solution for a real application that computes powers of an adjacency matrix of a graph in parallel. We demonstrated superiority of our solution compared to a regular MPI I/O implementation for various powers and numbers of graph nodes. Finally, we presented good scalability of the solution for more than 600 processes running on a large HPC cluster.


Introduction
High performance computing (HPC) was always related to aiming at better and better hardware.At the beginning, it mainly involved building larger clusters with better CPUs.A breakthrough emerged with manycore processors, such as graphics processing unit (GPU) used for general purpose processing (e.g.NVIDIA CUDA, OpenCL), coprocessors designed especially for computations (e.g.Intel R Xeon Phi TM , Epiphany architecture) or even specially designed CPUs (e.g.Sunway processors used in Sunway TaihuLight, the most powerful supercomputer in TOP500 list, June 2016 edition).
Apart from a processing unit, another essential component of computer architecture is memory.With memory the situation is much different.Typical improvement of RAM is based on a higher capacity and better parameters (e.g. higher frequency, lower latency) of next generation Double Data Rate (DDR) RAM devices, while modern storage is always connected with superior SSDs.So far, no hardware device that would extend memory properties has acquired significant popularity.
Universal memory, that is only hypothetical for now, is one of possible candidates for breakthrough in memory technologies.Combining advantages of RAM (byte-level access, high speed and bandwidth, low latency) with advantages of SSD (large capacity and persistence) should definitely increase performance of many computer systems.Although not available on the market yet, many technologies are being researched to create a practical device and several of these seem to be promising [20].Moreover, recent press reports suggest, that we can expect devices with parameters between NAND based memory (used in SSD) and DRAM soon [9] [10].In order to describe such memory in this paper, we would use the NVRAM term (non-volatile, random access memory) keeping in mind, that this memory is expected to have a byte-level access.
Full replacement of main memory and storage by a single, universal memory would probably trigger the need for redesigning the architecture of computing systems at multiple levels, but such a huge change cannot be expected to be introduced at once.Instead of this, we assume that first NVRAM devices would be used as complementary memory together with RAM and storage.For that reason, we decided to focus on possibilities coming from incorporating supplementary NVRAM devices into HPC platforms.
In 2016 we proposed the idea of NVRAM distributed cache located as an additional layer between a file system and a parallel distributed application [16].The extension was transparent to the developer, because of its compatibility with the well-known Message Passing Interface (MPI) I/O API [18].The motivation for this solution was improving performance and making the development process easier.Initial testing with a set of benchmarks gave promising results.Within this paper, we present further research on our MPI I/O NVRAM distributed cache.The research includes performance optimizations, as well as evaluation of the solution with a real life application -computing power of graph adjacency matrix.In fact, the application could be used e.g. for social network analysis or calculating shortest path lengths between multiple nodes simultaneously.

Related work
In 2009, Kryder and Kim presented a set of thirteen emerging non-volatile memory technologies, that had a potential to replace NAND Flash by 2020 [14].A report, published by Wong and Salahuddin in 2015, agrees on the candidates to the universal memory technology, but did not try to predict when real devices would appear on the market [20].Scientist, that conduct research on magnetoresistive RAM (MRAM) [1][4], spin-transfer-torque MRAM (STT-MRAM) [15][19], or phase change memory (PCM) [2] [23], suggest, that we can expect it soon.
3D XPoint TM technology, announced by Intel R and Micron R in 2015 [9][10], is probably the closest to be used in production environment -first 3D XPoint TM Intel R Optane TM devices are expected in 2016 [11].Comparison between NVMe NAND SSD (nowadays the fastest SSDs) and a prototype of 3D XPoint TM technology powered device gave promising results -the time to access a 4kB block from an application was reduced almost 7 times [6].Unfortunately, at the mo-ment of writing this paper, we do not have any more Intel R Optane TM performance results.
Emerging memory technologies triggered research on architectures, algorithms, and applications that would benefit from properties of the new hardware.NVMcached, key-value cache for byte-addressable NVRAM, is an example -designing the system taking into consideration the new memory type allowed to improve performance up to 2.85 times [21].Another exemplary research on this topic could be a log-structured file system NOVA [22], or our idea for checkpointing in NVRAM using the MPI One-sided API [5].
There are also solutions based on NVRAM related with speeding up I/O operations in HPC.Two papers concern Active NVRAM -a device that, apart from memory component, includes low power CPU [13] [12].Although the proposed architecture has potential benefits, computing units are not expected in first production devices.Another set of solutions use SSD devices.Systems like S4D-Cache [7] or SLA-Cache [8] significantly improve parameters of PFS, however, we believe that differences between typical SSD and byte-addressable NVRAM require more dedicated solutions.
3 Proposed solution

NVRAM distributed cache architecture
This section is a short summary of our previous research on NVRAM distributed cache.In a typical MPI application, MPI I/O is used in order to communicate with parallel file system (PFS).Figure 1 shows a difference between the classical approach and our solution.Instead of calling an MPI I/O implementation directly, an application uses NVRAM cache routines.This process is transparent to the programmer as the cache API is the same as MPI I/O API.NVRAM cache communicates with PFS through MPI I/O, which gives an instant support for many file systems.
We assume that each node is equipped with its own NVRAM device and it participates in a distributed cache.On each node a single thread, called a cache manager, is spawned -it is responsible for creation and management of its part of the cache.When a file is opened, it is split into equally sized parts, one for each cache manager.Then, the cache manager is responsible for prefetching its whole part of file (in the file opening phase), serving read, write and sync requests from the application, flushing all of the data into PFS (in the file closing phase).The main advantages of the extension, in comparison to typical solutions, are: low latency, caused by serving requests as fast as possible by omitting complicated data rearrangement algorithms, fully decentralized management with no communication between cache managers -cache parts are assigned at the beginning, so each application process knows exactly where data is, minimal meta-data -no cache blocks, no fetched flags (all data prefetched), no dirty flags (all data treated as dirty).It is clear, that a significant overhead is introduced with prefetching the whole file and flushing it back in the end.Moreover, the solution is aimed at applications that access small chunks of data (gain from byte addressing of NVRAM) from spread file locations (no drawback from omitting staging algorithms).As it was shown in previous papers [16,17], for long running and data intensive HPC applications that operate on small data parts, our solution performs better than unmodified MPI I/O.In this paper, we want to evaluate it with an application that does not meet those criteria strictly.Another important issue is connected with persistence of NVRAM.We have shown, that our solution could be used to prevent data from being damageda consistent state of the file could be recreated from the cache [17].Although it allows to recover only from several failure types, its low overhead and ease of programming could be a solution complementary or even in some cases competitive to checkpointing.

Extension optimizations
The most significant performance optimization applies to serving write requests.In the first version of the extension, a cache manager responded to the process in an asynchronous way, before the exact data was written into a device.Data consistency was provided by sequential processing of the cache manager thread.Although such a strategy reduced the latency slightly when the number of requests was kept at a low level, it caused performance drop with more data-intensive applications.In most popular MPI implementations write requests started queuing in a cache manager, that caused unpredictable and huge growth of latency for successive requests.The solution based on changing communication into a synchronous version and sending a response after all the processing was done, fixed the performance drop.
Another optimization is connected with omitting unnecessary overhead in opening and closing a file.Introduced improvements rely on better support for three MPI File open access modes: -MPI MODE CREATE: if the file does not exist, prevent from prefetching data, -MPI MODE RDONLY: prevent from synchronization at file closing, -MPI DELETE ON CLOSE: prevent from synchronization at file closing.
In special cases like treating the file as a huge distributed shared memory, most of the cache overhead related to file opening and closing is avoided.

Graph processing application
The graph theory has many real world applications like obtaining social network properties (e.g.Facebook, Twitter), processing maps and locations (e.g.Google Maps, GPS based navigation systems), optimizing layout of connections (e.g.designing cellular network layout) or preparing some recommendations (e.g.Netflix, Google PageRank).A great deal of algorithms hidden behind those applications are computational demanding and without further optimization they will not be able to handle increasing volume of data.In this paper we propose how to extend a selected algorithm with our MPI I/O NVRAM distribute cache.As an exemplary algorithm we have chosen the transformation, that could provide multiple graph properties, among others: number of paths of length n that connect two vertices, shortest path lengths, number of triangles in the graph.
Many different data structures could be used for representation of graphs.The selected problem requires checking often whether two vertices are adjacent, so a representation that minimizes complexity of this operation would be beneficial.Complexity of such query in adjacency matrix is O(1).Disadvantages of an adjacency matrix are irrelevant in the context of the selected algorithm.Slow adding or removing vertices is negligible because of constant size of a graph.Large memory consumption is unimportant since NVRAM distributed cache provides storage of size limited to the sum of all NVRAM capacities in a cluster.In our implementation, an adjacency matrix graph representation is used.
With an adjacency matrix, in order to search for walks of particular length between vertices, matrix multiplication could be used.Assuming A is the adjacency matrix, in the matrix A n each element a n(i,j) represents the number of walks of length n connecting vertex i with vertex j.The idea is illustrated with the exemplary A, A 2 and A 3 : From the above matrices one can read for instance: there are 2 paths of length 2 that connect vertex 4 with vertex 3 (a 2(4,3) = 2), the shortest path from vertex 3 to vertex 1 is 2 (smallest n where a n(3,1) > 0 is 2), the number of triangles in the graph is 2 ( To calculate power of a matrix efficiently, in this application, communicationavoiding Cannon's algorithm is used [3].

Experiments
The MPI I/O NVRAM extension is designed for applications of specific properties.To benefit most from the extension, a data-intensive application should access small data chunks from spread locations and run long enough to compensate for the overhead for initialization and deinitialization.Implementation of an algorithm for graph adjacency matrix is an attempt of validating the extension with an application that does not strictly possess these properties.The application accesses larger data chunks at once from neighboring locations, what is especially convenient for parallel file systems we want to compete with.Within this section we want to prove, that proposed MPI I/O NVRAM distributed cache is beneficial for a wide range of applications by showing a case study of an application that does not meet our cache requirements.

Testbed environment
The extension was tested on two clusters: Lap06 and K2 described in detail in Tables 1 and 2. Each node in Lap06 is equipped with an NVRAM hardware simulation platform, set to pessimistic values.Lower times would result in even better results for the NVRAM version.In contrast, K2 simulates NVRAM using tmpfs but its size allows to measure scalability.

Performance tests
Calculating power of a matrix can have different real-world applications, including graph processing.For some of them, storing a final matrix is crucial (e.g.searching for the number of walks of a particular length), but other use it only as intermediate values (e.g.searching for the number of triangles).From the perspective of MPI I/O, if the application does not need the final matrix stored on disk, MPI MODE DELETE ON CLOSE could be used in order to trigger additional optimizations.For that reason, most test cases are split into groups according to delete on close mode (on and off).As the execution time does not depend on graph properties other than its size, for each test case we generated n nodes and connected each two nodes with the probability of 2%.All scenarios apply to low graph powers, so each value of the adjacency matrix is stored in a 1 byte cell, that results in the size of the final file equal to n 2 bytes.
Performance optimizations As stated in section 3, handling of write requests with a synchronous mode was the most noticeable improvement over the previously proposed extension.Comparison of synchronous and asynchronous mode for graph processing application is presented in Figure 2. Results show 30% reduction of application's execution time for small computed graph powers.Moreover, while increasing the power of a graph, the performance gain is greater.
Results with delete on close mode off According to Figure 3, execution time of the application grows linearly with the power.The greater the power of a matrix, the greater execution time and greater performance gain from the NVRAM distributed cache.For example, with power of 2 the proposed extension is less than 20% faster than unmodified MPI I/O, while for larger powers it is more than 40% faster.Increasing performance gain is caused by overhead for initialization and deinitialization that is independent from the power.
Figure 4, and 5 present an exponential growth of execution time for different sizes of input graphs.Plots prepared for power of 2 show that for small powers with delete on close mode off the proposed extension is beneficial only for small input data.The chart with results of power of 8 is an example that for higher powers the NVRAM distributed cache is superior for each input size.Results with delete on close mode on The delete on close mode allows to omit the phase of synchronization between the NVRAM distributed cache and the parallel file system.For that reason, execution time of an application is reduced, what is especially important for smaller powers.Figure 6 shows that the proposed solution performs significantly better than the regular MPI I/O for powers starting with 2, while Figure 7 proves, that the extension gives performance gain both for small and large size of input.
Scalability Figure 8 illustrates a respectable scalability of the algorithm, as well as the proposed NVRAM cache.With an increasing size of a cluster environment,  average load of a single node related to I/O operations is constant, because a higher number of I/O requests from computing nodes is handled by a higher number of cache managers.A typical PFS environment usually is not so flexible and in case of an unexpected heavy load requires hardware reconfiguration.

Summary and future work
Motivation of this research was related to expected incorporation of emerging memory technologies in production devices soon, which was justified in the intro-  duction and related work.Within this paper we described an optimized version of MPI I/O distributed cache supported by byte-addressable NVRAM.Then we focused on its evaluation with an application that computes powers of adjacency matrix using Cannon's algorithm.Presented results of experiments prove, that with the tested application our solution performed better than regular MPI I/O.Our future plans include further optimization of the extension and its evaluation with a wider range of applications.Moreover, we also want to focus on the mechanism, that would allow to predict the benefit from our extension without actual running of an application.

Fig. 1 .
Fig. 1.Exemplary architecture of the solution for cluster nodes, components within a single node are inside dashed bracket.NVRAM cache layer marked with gray illustrates the difference between classical architecture and the one extended with NVRAM cache.Number of nodes in the solution is not limited.

Fig. 2 .
Fig.2.Power of adjacency matrix, comparison between synchronous and asynchronous write.Values for fixed graph size (10 000 nodes) and different powers.Lap06 cluster.

Fig. 3 .
Fig. 3. Power of adjacency matrix, comparison between unmodified MPI I/O and proposed extension.Values for fixed graph size (20 000 nodes) and different powers.Lap06 cluster.Delete on close mode off.

Fig. 4 .
Fig. 4. Power of adjacency matrix, comparison between unmodified MPI I/O and proposed extension.Values for fixed power (2) and different graph sizes.Lap06 cluster.Delete on close mode off.

Fig. 5 .
Fig. 5. Power of adjacency matrix, comparison between unmodified MPI I/O and proposed extension.Values for fixed power (8) and different graph sizes.Lap06 cluster.Delete on close mode off.

Fig. 6 .
Fig. 6.Power of adjacency matrix, comparison between unmodified MPI I/O and proposed extension.Values for fixed graph size (20 000 nodes) and different powers.Lap06 cluster.Delete on close mode on.

Fig. 7 .
Fig. 7. Power of adjacency matrix, comparison between unmodified MPI I/O and proposed extension.Values for fixed power (2) and different graph sizes.Lap06 cluster.Delete on close mode on.

Table 1 .
Lap06 and K2 clusters -hardware and software configuration

Table 2 .
NVRAM simulation platform parameters in Lap06 cluster