Optimizing OpenCL Implementation of Deep Convolutional Neural Network on FPGA

,


Introduction
Deep Convolutional Neural Networks (CNNs) bring about significant precision gains in many fields of computer vision, such as image classification, object detection and object tracing.While deeper structures produce higher accuracy, they demand more computing resource than what today's CPUs can provide.As a result, in the present situation, graphics processing units (GPUs) become the mainstream platform for implementing CNNs [1].However, GPUs are powerhungry and inefficient in using computational resources.For example, the CNN version based on vendor recommended cuCNN lib can only achieve about 1/3 of the peak performance of GPU [1].Hardware accelerators offer an alternative path towards significant boost in both performance and energy efficiency.
Usually, hardware accelerators are based on ASIC [2] or FPGA [3,4].A-SIC based accelerators provide the highest performance and energy efficiency but have to endure huge development cost.Owing to the reconfigurable nature, FPGA based accelerators are more economical considering the development expenses.
For years, FPGA developers suffer from the hard-to-use RTL (Register Transfer Level) programming languages such as VHDL and Verilog HDL.It makes programmability a major issue of FPGA.Thus, FPGA vendors begin to provide high-level synthesis (HLS) tools such as OpenCL framework [5] to enable programming FPGAs using high level languages.
Although developers can easily port codes originally designed for CPUs/GPUs to FPGAs with OpenCL framework, it is still challenging to make the OpenCL codes execute efficiently on FPGAs.The same code may exhibit different performance on different platforms due to the different architecture-related execution manners.Therefore, developers should consider the FPGA architecture when optimizing the OpenCL code.
In this paper, we make a deep investigation on how to optimize the OpenCL code on FPGA platforms.A CNN accelerator implemented in OpenCL is proposed which achieves the state of the art performance and performance density.
The key contributions of our work are summarized as follows: -We make a detailed analysis of running a CPU/GPU oriented OpenCL code of CNN on an FPGA.We explore the memory access behavior of the OpenCL FPGA implementation and point out the bottleneck of the code.-According to the analysis result, we propose an optimized OpenCL implementation of CNN accelerator, focusing on efficient external memory access.-We implement our design on an Altera Stratix V FPGA.A performance of 137 Gop/s is achieved using 16-bit fixed point data type.Compared with the original version, it achieves a speed up of 4.76X.To the best of our knowledge, this implementation outperforms most of the previous OpenCL based FPGA CNN accelerator.
The rest of this paper is organized as follows: Section 2 discusses the background of CNN and OpenCL framework for FPGA.Section 3 presents the performance analysis of the baseline code.The implementation details of our optimized design are demonstrated in Section 4. Section 5 provides the experiment result and Section 6 concludes the paper.

Convolutional Neural Network
CNN is a trainable architecture inspired by the research findings in neuroscience.As Fig. 1 shows, a typical CNN structure consists of several feature extractor Usually, convolutional layers generate more than 90% of the computational workload of a CNN model [6].For a convolutional layer, Q input feature maps X 0 . . .X Q−1 are convolved with R*Q convolutional kernels K r,q (r=0,1. . .R-1, q=0,1. . .Q-1) to get R output feature maps Y 0 . . .Y R−1 .Equations (1) (2) show the procedure.bias is a value that is added to each pixel of Y r .conv < X q , K r,q > refers to the convolution between input feature map X q and convolutional kernel K r,q .ks is the size of the convolutional kernel and s denotes the stride that the convolutional window slides with each time.As the length limit, in this paper we only focus on the convolutional layers.

OpenCL Framework for FPGAs
OpenCL is an open standard for cross-platform parallel programming.As a high level synthesis (HLS) tool, the OpenCL framework for FPGA enables synthesizing designs described by a C-like language.It greatly improves the development productivity of FPGA.The OpenCL designs for CPU/GPU can also be easily ported to FPGA with little efforts.The OpenCL for FPGA liberates the developers from the burden of complicated periphery circuits design (e.g.PCIe, DDR, SerDes).The details of periphery circuits are transparent to the developers thus they can concentrate on the designing of kernel logics.
The hardware infrastructure of the OpenCL framework consists of two parts, an FPGA accelerator and a host computer.The OpenCL logic in the FPGA accelerator exists as an SoC.It consists of at least a global memory controller, a link controller to the host computer and a reconfigurable fabric.Developers use HLS tools to synthesize the OpenCL codes into kernel logics and program them to the reconfigurable fabric part.The host computer communicates with the FPGA accelerator through the host-accelerator link.As a common workflow, the host computer first offloads the data to the global memory of the FPGA accelerator.Then it starts the kernel logics to process these data.At last, the host computer gets the result back.In this paper, we use an Altera FPGA development kit to build our CNN accelerator.In particular, the global memory controller is a DDR3 controller, the link controller is a PCIe controller and the host computer is a desktop PC based on x86 architecture.
However, making the OpenCL code execute efficiently on an FPGA is not easy.It requires the awareness of many details about the OpenCL framework for FPGA.

Performance analysis of the baseline CNN OpenCL implementation
In this section, we start with a CPU/GPU oriented OpenCL implementation of CNN provided by AMD Research [7].We will consider it as a baseline version.
For convolutional layers, the baseline code first converts the convolutions to a matrix multiplication to utilize the efficient BLAS (Basic Linear Algebra Subprograms) library for CPU/GPU.As the computational workload of full connection layers is also matrix multiplication, this method simplifies the accelerator design.
Fig. 2 shows the procedure of converting the convolutions to a matrix multiplication.Since a two-dimensional matrix is stored as an array physically, both the convolutional kernels and the output feature maps keep their data structures unchanged.Only the input feature maps need to be reorganized as the M ap matrix.Thus the computation of a convolutional layer can be divided into two parts: reorganizing the input feature maps to M ap matrix and calculating the matrix multiplication.
In the first step, there is no arithmetic operation.The main factor that determines the execution time is the memory access.As Fig. 3 shows, the baseline version divides the M ap matrix into several column vectors, each of which consists of ks 2 pixels in a convolutional window.Each work item loads these ks 2 pixels form a input feature map and stores them back to the M ap matrix as a column vector.We can see that the memory access of each work item to the input feature maps is in a sliding window pattern.The memory access pattern to the M ap matrix is in per-column manner.For a DDR 3 system, the time consumption of memory access is determined by the number of memory transactions and the physical bandwidth of external memory.Each memory transaction consists of two phases: prepare phase and burst phase.The burst phase consists of several memory transmits.Assuming that the kernel has M transactions and each transactions has a burst phase with N transmits (burst length = N ), the time of memory access of a kernel is: Ttransaction j (3) Ttransmit ij (5) The transmits in the same burst access memory with continuous physical addresses.Moreover, if the memory access addresses are continuous, multiple transmits can be coalesced into a single wide transmit.Also, multiple bursts can be coalesced into one longer burst.By coalescing the transmits and bursts, we can reduce the proportion of prepare phase in the the time of memory access and improve the utilization rate of memory bandwidth.Thus, continuous memory access is an essential factor for better performance.
From Fig. 3, we can see that for a single work item, the length of the longest continuous memory access to the input feature maps is ks.However, this length cannot dominate T transaction , and it is T prepare that will form most of the memory access time.In the output feature maps, there is no address-continuous memory access.Thus the single work item can not make the best use of the memory bandwidth.Furthermore there are overlapped and adjacent data accesses between adjacent work items.In a CPU/GPU platform, such data locality can be exploited by the cache hierarchy and multiple memory access can be merged.However, in a typical FPGA system, there is no ready-made cache to use.The OpenCL compiler can coalesce address continuous memory accesses from adjacent work items automatically if the addresses are regular.To make things worse, the computation of addresses in the baseline version is very complicated, thus today's HLS tools can not recognize the data locality between adjacent work items.For the reasons above, the memory access must be optimized.
The second step is a matrix multiplication.Most implementations of matrix multiplication in GPU/CPU use BLAS library to achieve a efficient execution.However, there is no BLAS library for FPGA using OpenCL so far.Therefore, we adopt the efficient FPGA OpenCL implementation from [8] to construct the baseline version.Such implementation has been well optimized for general-purpose matrix multiplication.However, the data precision of the baseline version is 32-bit floating point, but for CNN, it is redundant [6].In the FPGA-specific design, low data precision may save DSP and logic resources.

Optimizing the OpenCL design of CNN accelerator on FPGA
In the last section, we discussed the baseline version and pointed out that the memory access pattern of the baseline version will cause low utilization rate of memory bandwidth.In this section, we will analyze the character of the algorithm and optimized the OpenCL implementation.Fig. 4. The data locality in reorganizing input feature maps to map matrix and the task organization of work items Fig. 4 shows the relationship between input feature maps and M ap matrix.The coloured part of M ap matrix is corresponding to ks rows in input feature maps.Obviously, these rows are stored in external memory continuously.Compared to accessing these elements repeatedly, a better choice is to prefetch them to an on-chip buffer at first in an address continuous manner.
An optimized method is also shown in Fig. 4.Each work group first reads a whole row of a input feature map and then write them in the format of the M ap matrix.For the prefetch part, the longest continuous memory access to the input feature maps in a work group is the width of a input map (N in ).It is much longer than the baseline version.For the writing back part, the memory accesses between adjacent work items are address continuous.The start addresses of the rows and the blocks can be calculated as a shared value among work items in a same work group.We use the local id as the address offset so that the compiler can easily recognize the locality, thus the memory accesses in adjacent work items can be coalesced automatically.
One pixel in the input feature maps is related to at most ks 2 elements in M ap matrix.Thus the efficiency of writing back dominates the memory access performance.We notice that in the optimized method above (OPT1 for short), the max length of continuous memory access to M ap matrix is N out (the width of the output feature maps).Generally speaking, the map sizes of first several layers are large, so the burst length is large enough to make full use of the memory bandwidth.But in the last several layers, the map sizes are always small, the expense of the prepare phase becomes prominent.We can further optimize the kernel (OPT2 for short) by increasing the length of the continuous memory access.
Fig. 5 shows the further optimization of the memory access.The M ap matrix is transposed so that the pixels in a convolutional window are address continuous.The convolutional windows with the same location in adjacent channels are also address continuous.Thus we can merge ks 2 * M c pixels into one memory transaction.M c refers to the number of adjacent channels that can be merged.The transposed M ap matrix is divided to a 2-dimension grid.Q*(M out /M c) work items are used for the reorganizing task.Every work item first loads ks rows from each of the M c input feature maps to the on-chip buffer.In the example, the M c is 2, the ks is 3. Then each work item writes ks 2 * M c * N out pixels into the transposed M ap matrix.In Fig. 5, The pixels in one dotted box are processed by one work item W I(y, x) (x = 0, 1, ..., M out − 1. y = 0, 1, ..., Q/M c − 1).
As for the input data, the total data size of OPT2 is ks times comparing to OPT1.However, the kernel performance will not degrade heavily for two reasons.Firstly, the writing back part occupies the most time of memory access.Although we prefetch more data, the overhead is still ignorable.Secondly, the prefetch data are address continuous.The memory transactions can be coalesced.Thus the impact of the data size is not obvious.
For the output data, the maximum length of continuous memory access is M c*ks 2 .The total data size of OPT2 is the same as OPT1.Thus when M c*ks 2 is larger than N out , the kernel performance will be improved.
As the baseline version of the matrix multiplication part is already an FPGA oriented code, only minor modifications are needed to adapt it to our design.The size of matrix block of the baseline version is fixed.So we add some branch statements to enable variable matrix block size.For input data, the out-of-bounds elements are padded with zeros and written to the on-chip buffers.For the output data, before each work item writing back its corresponding element, it will check whether the location is out-of-bounds.The out-of-bounds elements will not be written back.
The OPT2 generates M ap matrix T , thus we need to modify the matrix multiplication kernel from  directly read the corresponding element from the transposed matrix block, but the address will not be continuous.We change the read order to the external memory to ensure the address continuous.
The baseline version adopts 32-bit floating point data format.In fact, research has proven that such high precision is redundant for the forward propagation [4,9] of CNN.We modify the optimized versions to 16-bit fixed point to increase the accelerator performance.
The full connection layers can be processed by the matrix multiplication kernel independently.Although our implementation can handle matrix blocks of different sizes, using batched images can reach higher performance.In the full connection layers, for a single input image, the main computation is matrixvector multiplication.The ratio of computation/memory access of a matrixvector multiplication is low, because every element in parameter matrix only relates to one multiply-accumulate operation.The accelerator thus needs a high external memory bandwidth to read the parameter matrix.To achieve higher bandwidth, multiple images' full connection layers can be merged through combining a batch of matrix-vector multiplications into one matrix-matrix multiplication.Every element in parameter matrix needs to be operated with batch size elements in input matrix.The ratio of computation/memory access increases when the batch size increases.

System Evalution
We propose a prototype of our design in a DE5-net development board.The main chip of the board is an Altera Stratix V A7 FPGA.The work frequency of out kernel logic is 185 Mhz.We implement our design using Altera OpenCL SDK 14.1.The most widely used model VGG-16 is chosen to be the benchmark.Fig. 6 (a) shows the performance of reorganizing part.The vertical axis refers to the consumed time and the horizontal axis refers to different layers.Bars with three different colours represent the three different versions.The result shows that the performance of optimized code is improved dramatically.Both two optimized versions greatly reduce the consumed time.The OPT1 performs better in the first several layers and the OPT2 performs better in the last several layers.Our discussion in section 4 is confirmed by the experimental results.Fig. 6 (b) shows the performance of the convolutional layers.The performance of optimized versions are still much better than the baseline version after the matrix multiplication part is added.Fig. 7 shows the performance under different data precisions, compared with the 32-bit float-point data type, the 16-bit fixed point data type nearly doubles the performance.It is because that performing a 16-bit fixed point multiplyadd operation only needs 1 DSP in Altera's FPGA while performing a 32-bit floating point multiply-add operation needs 2 DSPs.Using 16-bit fixed point computation engine needs fewer DSP units and less on-chip memory resources.Compared with the baseline version, the optimized 16-bit fixed-point version has a speed up of 4.76x.
We also make a comparison between our implementation and other state-ofthe-art FPGA CNN accelerators based on OpenCL.Table 1 shows the on-chip resource utilization of our implementation (OPT2).To unify the performance metric among different FPGA devices, we use performance density as the metric, the unit is Gops/DSP.As shown in Table 2, our implementation achieves the highest performance among recent works using the same device (Stratix V A7).
[9] provides a throughput-optimized OpenCL implementation.To the best of our knowledge, it is the first OpenCL implementation for an entire CNN model on FPGA.However, there is still a big gap between its real performance and the peak performance of the FPGA they used.Comparing with [9], we present a better memory access design and get a 2.88x speed up on the same device.
[10] presents a CNN framework that using GPU and FPGA-based accelerators.They make more effort on compatibility while we focus on performance using FPGA.Compared with [10], our design has a speed up of 5.35x.[11] uses System Verilog to implement a CNN accelerator and package it into the OpenCL IP library.This work achieves a very high performance and gets better performance density than ours.However, in fact this work is an RTL design.Compared with HLS designs, the RTL design can exploit more hardware details.Thus it can get higher frequency and efficiency easily, whereas the major benefit of OpenCL design is better reusability and shorter development cycle.

Conclusion
In this paper, we have proposed an optimized CNN accelerator design using OpenCL FPGA.By analyzing the OpenCL implementation for CPU/GPU, we find that the bottleneck is the external memory access, because the memory system of FPGA is much different with CPU/GPU.Then we optimize the CNN design.Effort is made on the data re-arrangement and coalescing the memory accesses are applied for better usage of the external memory bandwidth.A prototype system is built.Compared with the baseline version, a performance speed-up of 4.76x is achieved.Our implementation on the Altera Stratix V device achieves a 137 Gop/s throughput under 185MHz working frequency.This performance outperforms most of the prior work using OpenCL on FPGA.

Fig. 2 .Fig. 3 .
Fig. 2. The procedure of converting convolutions to matrix multiplications C = A * B to C = A * B T .Each work item can

Table 1 .
Critical resource utilization rate in one chip

Table 2 .
Comparison between our design and existing FPGA-OpenCL based CNN accelerators