Engine Using Intel Many Integrated Core Architecture

. This paper describes the acceleration of the most computationally intensive kernels of the Blender rendering engine, Blender Cycles, using Intel Many Integrated Core architecture (MIC). The proposed parallelization, which uses OpenMP technology, also improves the performance of the rendering engine when running on multi-core CPUs and multi-socket servers. Although the GPU acceleration is already implemented in Cycles, its functionality is limited. Our proposed implementation for MIC architecture contains all features of the engine with improved performance. The paper presents performance evaluation for three architectures: multi-socket server, server with MIC (Intel Xeon Phi 5100p) accelerator and server with GPU accelerator (NVIDIA Tesla K20m).


Introduction
The progress in the High-Performance Computing (HPC) plays an important role in the science and engineering. Computationally intensive simulations have become an essential part of research and development of new technologies. Many research groups in the area of computer graphics deal with problems related to an extremely time-consuming process of image synthesis of virtual scenes, also called rendering (Shrek Forever 3D -50 mil. CPU rendering hours, [CGW]).
Beside the use of HPC clusters for speeding up the computationally intensive tasks hardware accelerators are being extensively used as well. They can further increase computing power and efficiency of HPC clusters. In general, two types of hardware accelerators could be used, GPU accelerators and MIC coprocessors.
The new Intel MIC coprocessor is composed of up to 61 low power cores in terms of both energy and performance when compared to multi-core CPUs. It can be used as a standalone Linux box or as an accelerator to the main CPU. The peak performance of the top-of-the line Xeon Phi is over 1.1 TFLOP (10 12 floating point operations per second) in double precision and over 2.2 TFLOPS in single precision. MIC architecture can be programmed using both shared memory models such as OpenMP or OpenCL (provides compatibility with codes developed for GPU) and distributed memory models such as MPI.
The implementation presented in this paper has been developed and tested on Anselm Bullx cluster at the IT4Innovations National Supercomputing Centre in Ostrava, Czech Republic. Anselm cluster is equipped with both Intel Xeon Phi 5110P and Tesla K20m accelerators [AHW]. For production runs, once the algorithm is fully optimized, new IT4Innovations Salomon system will be used. Salomon will be equipped with 432 nodes, each with two coprocessors Intel Xeon Phi 7120P [SHW].

Rendering Equation and Monte Carlo Path-Tracing Method
In 1986 Kajiya first introduced the rendering equation in computer graphics [KAJ]. One of the last versions of this equation is represented as where ω o is direction of outgoing ray, ω i is direction of incoming ray, L o is spectral radiance emitted by the source from point x in direction ω o , L e is emitted spectral radiance from point x in direction ω o , Ω is the unit hemisphere in direction of normal vector n with center in x, over which we integrate, L i is spectral radiance coming inside to x in direction ω i , f r (x, ω i , ω o ) is distribution function of the image (BRDF) in point x from direction ω i to direction ω o , ω i · n is angle between ω i and surface normal.
Rendering equation is fundamental algorithm for all algorithms of image synthesis based on ray tracing principle such as Path-Tracing. Solving the rendering equation is computationally extensive in general. The most common solution methods are based on numerical estimation of the integral (1). One of the most commonly used methods for numerical solution of equation (1) is Monte Carlo (MC) method. More information about this method can be found in the dissertation thesis of Lafortune [LAF]. MC is also employed in different areas beside the computer graphics, e.q. in statistics [GRE].

Quasi-Monte Carlo and Sobol's Sequence
One of the MC drawbacks is slow convergency, which is O 1 √ N , where N is number of simulations. Due to this reason techniques that speed up the convergency and effectivity of the whole computation have been developed.
Deterministic form of Monte Carlo method is called quasi-Monte Carlo (QCM) and its convergency speed is O (log N ) s N , where s is dimension of the integral. In order for O (log N ) s N to be smaller than O 1 √ N , s needs to be small and N needs to be large [LEM]. In QMC method pseudo-random numbers are replaced by quasi-random numbers that are generated by deterministic algorithms. Typical property of such numbers is that they fill in the unit square more uniformly. This property is called low-discrepancy (LD). More details about it can be found in [MOR], [NIE].
Let elements x 1 , . . . , x N are sequence of s-dimensional space [0, 1] s , then approximation is expressed as Well known types of LD sequences are Halton's, Faure's, Sobol's, Wozniak's, Hammersly's or Niederreiter's sequence. Blender uses Sobol's sequences [JO3], [JO8], since they fit their needs -runs well on the CPU and accelerators, supports high path depth and can perform adaptive sampling.
Due to its property Sobol's sequences can be used for progressive sampling. Unlike the Halton's sequence which can be used for progressive sampling as well, Sobol's sequences are not correlated in higher dimensions, and so do not need to be scrambled. The algorithm for generating Sobol's sequences is explained in [BRA].
Any number from any dimension can be queried without per path precomputation. Each dimension has its own sequence and when rendering the i-th pass, Blender gets element i from the sequence. A sequence is defined by a 32 × 32 binary matrix, and getting the i-th element in the sequence corresponds to multiplying the matrix by i. With binary operations this ends up being quite quick.
These matrices are not as simple to compute, but the data to generate them up to dimension 21201 is available online. Blender currently uses 4 dimensions initially to sample the subpixel location and lens and 8 numbers per bounce, so that limits us to a maximum path depth of 2649 [BLS].

Implementation and Parallelization for MIC Accelerator
The implementation presented in this paper is based on source code of the Blender version 2.73 that has been obtained from [BLE]. The code of Blender is compiled using GNU compiler version gcc/4.9.0 and the library running on MIC coprocessor was compiled using intel/14.0.1 compiler. In order to enable newly developed OpenMP and MIC accelerated rendering engines new OMP computing device has been added to GUI setting of Blender.

Parallelization for Multi-core CPU's with Shared Memory
The core of the Cycles engine computation method implements quasi-Monte Carlo method with Sobol's sequence. Rendered scene is represented as a C++ global variable. If GPU or MIC acceleration is used this global variable has to be transferred to the accelerator before rendering (solving the rendering equation (1)) is started. The synthesized image of size x r × y r is decomposed into tiles of size x t × y t (see Fig. 1). The way rendering algorithm itself is executed inside a tile differs for each computing device. We compare the performance of the following computing devices: CPU (POSIX threads for CPU only, this is the original computing device used by Blender Cycles), OpenMP (for CPU and MIC -newly developed computing device by our group), OpenCL (for CPU, MIC and GPU) and CUDA (for GPU only).
Original Implementation. The original computing device from Blender Cycles uses POSIX threads for parallelization. Parallelization is done in the following way: the synthesized image of resolution x r × y r is decomposed into tiles of size x t × y t . Each tile is then computed by one POSIX thread/one CPU core. The situation is shown in the Fig. 1.   Fig. 1. The decomposition of synthesized image with resolution xr × yr to tiles with size xt × yt by original implementation. The one tile is computed by one POSIX thread on one CPU core for xt × yt pixels. This is an example of CPU16 -see Section 3.

OpenMP Parallelization of Intra Tile Rendering for CPU and MIC Architecture
The newly proposed OpenMP parallelization is implemented in the OMP computing device. A hierarchical approach is used where each POSIX thread forked by the Cycles renderer is further parallelized using OpenMP threads. In order to provide enough work for each core of MIC coprocessor we need to decompose the larger tile x t × y t to smaller sub-tile x o × y o to fully utilize the CPU hardware (this is an example of OMP15MIC or OMP8CPU2MIC -see Section 3). The OpenMP parallelizes the loop of calculation across sub-tiles of the tile in a way that single OpenMP thread is processing single sub-tile of a tile. An example in The decomposition of synthesized image with resolution xr × yr to tiles with size xt × yt and to sub-tiles with size xo × yo by OpenMP implementation. One tile xt × yt is computed by eight threads using eight cores for xt × yt pixels. Each sub-tile xo × yo is computed by one OpenMP thread. This is an example of OMP8CPU2 -see Section 3. The computation time of each sub-tile is different. To achieve an effective load balancing we have to adjust the workload distribution by setting up the OpenMP runtime scheduler to schedule(dynamic, 1) and the values of x o and y o have to be set to small number (ex. x o = 32 and y o = 32). This setup produces the most efficient work distribution among processor cores and therefore minimizes processing time.
As of now the POSIX threads are used to control the tiles, but in our future work, we would like to use MPI processing for computer clusters, so that single image can be processed by multiple computers or computing nodes of the HPC cluster. For this scenario the POSIX threads will be substituted by the MPI threads.
The acceleration of the rendering process using Intel Xeon Phi accelerators is built on the similar basis as the approach to multi-core CPUs. In this case one POSIX thread is used to control single MIC accelerator. This means that one accelerator works on a single tile and multiple accelerators can be used in parallel to render multiple tiles.
The architecture of the Intel Xeon Phi is significantly different from the regular Xeon processors. It is equipped with 61 processing cores where each core can run up to 4 threads, which gives 244 threads in total. This means that in order to fully utilize the hardware and to provide enough work for each core to enable load balancing, each tile has to be significantly larger than in case of CPU processing (see Fig. 3). In addition, the computation of each pixel takes different time. To enable load balancing using OpenMP runtime engine we have to set the scheduler to schedule(dynamic, 1).
Using the coprocessor with separate memory also requires the data transfer between CPU memory and the memory of the coprocessor. Blender uses complex C++ structure that represents entire scene (KernelGlobals) and therefore in order to transfer data it has to be retyped to binary array (mic alloc kg()). When computation ends, allocated memory of the coprocessor is cleaned (mic free kg()).

Parallelization by OpenCL and CUDA
The OpenCL computing device can be used for multi-core CPU's as well as for MIC or GPU accelerators. Scene decomposition is similar to OpenMP processing for MIC. Only one POSIX thread with a large tile for optimal performance is used due to previously discussed reasons. The parallelization is based on task parallelism where for computing of one pixel a separate task is created (see Fig. 4). The original code had to be modified in order to run on Intel Xeon Phi devices. Unfortunately rendering with textures did not work on MIC coprocessor. Due to this shading and advance shading had to be disabled. There is no intention to fix this malfunction, because the OpenCL is not a targeted platform for us (the OpenCL is limited like CUDA and it is hard to use for development and optimization).
As OpenCL and CUDA programing model are very similar, so is the decomposition. There is again one POSIX thread for main computation per accelerator. On GPU a rendering kernel uses single CUDA thread to render single pixel of a tile. The GPU needs thousands threads for better performance. This is reason, why we need the large tile (see Section 3).
The GPU acceleration is a part of the original render engine. When compared to the our proposed approach it has limited functionality: the maximum amount of individual textures is limited, Open shading language (OSL) is not supported, Smoke/Fire rendering is not supported, GPU has smaller memory then MIC, GPU does not support cooperative rendering with CPU. We need  the all feature from CPU -that's reason, why the combining of the CPU and GPU is useless for us.

Benchmark
In this paper two scenes are used for all benchmarks; one scene with and one without textures. The first scene with Tatra T87 has 1.2 millions triangles and uses HDRI lighting (see Fig. 5). The other scene with bones was generated from CT images and has 7.7 millions triangles (see Fig. 5). It does not use textures. This scene is used to evaluate the performance of the OpenCL implementation for Xeon Phi as it does not support textures.
The benchmark was run on single computing node of the Anselm supercomputer equipped with two Intel Sandy Bridge E5-2470 CPU's (used by CPUn, OMPn engines -n is number of cores used) and one Intel Xeon Phi 5110P (MIC) or one NVIDIA Tesla K20m coprocessor (GPU).
In the next test we exploited the CPU-MIC hybrid system (OMP15MIC). The two tiles with large size are computed at the same time (2×POSIX threads were created, one for CPU and one for MIC). First tile is computed by CPU using 15× OpenMP threads and the other tile is computed by MIC coprocessor (1× core is used to manage MIC), see the results in the Table 1 and 4.
Another test we performed was a combination of OMP8 and CPU2. That means the computation of two tiles was parallelized (2×POSIX threads were created) and each tile was computed by 8×OpenMP threads, see the results in the Table 1 and 4. We also combined the CPU2 (2×tiles, each has one POSIX thread), OMP8 (the each POSIX thread has 8×OpenMP threads) and MIC, see the results in the Table 1 and 4. This combination is designed for the systems with multiple MIC accelerators (this will be the architecture of the Salomon -see Section 5).
The division of the image into tiles has to be made carefully. You can see the big time when CPU16 is used and the size of tile is 1024×1024. In this example the 12×cores are idle (see the Table 1 and 4).

Benchmark 1: Tatra T87
For this scene resolution 2048 × 2048px was used. First we compared the calculation for different size of tiles for the same resolution (see Table 1). In next two tests the resolution and count of samples were changed (see Table 2 and 3).

Benchmark 2: Bones
The original implementation of OpenCL does not work on Intel Xeon Phi. The problem is with shading and advance shading which has to be disabled. For this reason, we created a new scene, just for OpenCL testing. The Table 4 compares runtimes (in minutes) for different numbers of tiles. The Table 5 shows the result using OpenCL.

Conclusion
Both benchmarks show that the best performance could be obtained in case when combination of CPU and MIC coprocessor (OMP15MIC) is used. We can see from the results that the MIC coprocessor behaves like a 6-cores CPU unit. On the other hand, the MIC accelerator adds only 1.37 speedup over CPU implementation, which is less than expected. We would expect at least the same performance boost as in case of GPU accelerators. The reason why Intel MIC does not provide the expected performance boost is due to insufficient vectorization in the code for calculation the rendering equation (512 bit registers (able to hold 16 floats) are wasted now). We also show that the combination of full CPU utilization and MIC (OMP15MIC) has the advantage over GPU parallelization. Advantages are as follows: -GPU parallelization does not use all CPU cores -GPU does not offer all features of our MIC implementation, which has identical feature set as the original CPU implementation -The combination of 2 POSIX threads, each thread running on one socket, and 2× MIC is designed for the systems with multiple MIC accelerators (this will be the architecture of the Salomon -see Section 5)

Future Work
In the future work we will focus on vectorization of the code to improve its performance on Intel Xeon Phi devices. We will also modify the existing implementation of the Path-Tracing method using MPI technology. Our new benchmarks will target the new supercomputer Salomon which will be equipped with two Intel Xeon Phi for better performance.