WHEN Current situation of deep learning heterogeneous computing
With the rapid growth of Internet users and the rapid expansion of data volume, the demand for computing in data centers is also rising rapidly. At the same time, with the rise of computing intensive fields such as artificial intelligence, high-performance data analysis and financial analysis, the demand for computing power has far exceeded the capacity of traditional CPU processors.
Heterogeneous computing is considered to be the key technology to solve this computing gap at this stage. At present, "CPU GPU" and "CPU FPGA" are the heterogeneous computing platforms most concerned by the industry. They have the advantages of higher efficiency and lower latency than traditional CPU parallel computing. In the face of such a huge market, a large number of enterprises in the science and technology industry have invested a lot of funds and manpower, the development standards of heterogeneous programming are gradually maturing, and the mainstream cloud service providers are actively layout.
The industry can see that giant companies such as Microsoft have deployed a large number of FPGAs to accelerate AI information. What are the advantages of FPGAs over other devices?
Flexibility: programmability naturally adapts to the rapidly evolving ML algorithm
DNN, CNN, LSTM, MLP, reinforcement learning, decision tree, etc
Arbitrary precision dynamic support
Model compression, sparse network, faster and better network
Performance: build real-time AI service capability
Improved low latency prediction capability compared to GPU / CPU
Improved single watt performance compared to GPU / CPU
Scale
High speed interconnection IO between boards
Intel CPU FPGA Architecture
At the same time, the short board of FPGA is also very obvious. FPGA uses HDL hardware description language for development, which has long development cycle and high entry threshold. Taking separate classic models such as Alex net and Google net as examples, it often takes months to customize and accelerate the development of a model. The business side and FPGA acceleration team need to consider both algorithm iteration and adapting FPGA hardware acceleration, which is very painful.
On the one hand, we need FPGA to provide low latency and high-performance services that are competitive compared with CPU / GPU. On the other hand, we need the development cycle of FPGA to keep up with the iteration cycle of deep learning algorithm. Based on these two points, we design and develop a general CNN accelerator. Considering the general design of mainstream model operators, the compiler generates instructions to drive model acceleration, which can support model switching in a short time; At the same time, for the emerging deep learning algorithm, the rapid development iteration of relevant operators is carried out on this general basic version, and the accelerated development time of the model is reduced from a few months to one to two weeks.
The overall framework of the general CNN accelerator based on FPGA is as follows. The CNN model trained by Caffe / tensorflow / mxnet and other frameworks generates the instructions corresponding to the model through a series of optimization of the compiler; At the same time, the image data and model weight data are preprocessed and compressed according to the optimization rules, and then distributed to the FPGA accelerator through PCIe. The FPGA accelerator works completely according to the instruction set in the instruction buffer. The accelerator executes the instructions in the complete instruction buffer once to complete the calculation and acceleration of a picture depth model. Each functional module is relatively independent and is only responsible for each individual module calculation request. The accelerator is separated from the deep learning model, and the data dependency and pre and post execution relationship of each layer are controlled in the instruction set.
In short, the main work of the compiler is to analyze and optimize the model structure, and then generate the instruction set efficiently executed by FPGA. The guiding ideology of compiler optimization is: higher MAC DSP computing efficiency and less memory access requirements.
Next, we take the Google net V1 model as an example to make a simple analysis of the accelerator design optimization idea. The network of inception V1 combines 1x1, 3X3, 5x5 conv and 3x3 pooling stack, which increases the width of the network and the adaptability of the network to scale. The following figure shows the basic structure of inception in the model.
Data dependency analysis
This part mainly analyzes the pipelined and parallelized computing in the mining model. Pipelining design can improve the utilization of computing units in the accelerator, and parallel computing can use as many computing units as possible at the same time.
As for pipelining, the analysis part includes the operation of loading data from DDR to SRAM on FPGA chip and the pipelining calculated by PE. Through this optimization, the memory access time is overlapped; DSP calculates the calculation control process of the whole column to ensure the improvement of DSP utilization.
As for parallelism, we need to focus on analyzing the parallel relationship between PE computing array and "post-processing" modules such as activation, pooling and normalization. How to determine the data dependency and prevent conflict is the key to the design here. In inception, it can be seen from its network structure that 1x1 convolution calculation in branch A / B / C and pooling in branch D can be calculated in parallel, and there is no data dependency between them. By optimizing here, the calculation of 3x3 Max pooling layer can be completely overlapped.
Model optimization
In the design, two aspects are mainly considered: finding the optimization of model structure and supporting the fixed-point adjustment of dynamic accuracy.
FPGA is a device that supports a large number of parallel computing. Looking for higher dimensional parallelism from the model structure is of great significance for computing efficiency and reducing memory access. In inception V1, we can see that the input data of 1x1 convolution layer in the first layer of branch a branch B branch C is completely consistent, and the strip and pad of convolution layer are consistent. Can we align and overlay on the output feature map dimension? After superposition, the memory access demand for input data is reduced to 1 / 3 of the original.
On the other hand, in order to give full play to the acceleration characteristics of FPGA hardware, the influence process of the model needs to carry out fixed-point operation on the model. In FPGA, the performance of int8 can be twice that of int16. However, in order to enable customers in the company and on Tencent cloud to deploy their trained floating-point model without perception, and do not need retrain int8 model to control accuracy loss, we adopt a customized int16 scheme that supports dynamic accuracy adjustment. Through this method, the user trained model can be deployed directly through the compiler without any loss of accuracy.
Memory architecture design
Bandwidth is always one of the bottlenecks restricting the performance of computer architecture. At the same time, memory access directly affects the power consumption efficiency of accelerator devices.
In order to minimize DDR memory access during model calculation, we designed the following memory architecture:
Input buffer and output buffer ping pong are designed to maximize pipelining and parallelism
Support the inner copy operation between the input buffer and the output buffer itself
Cross copy operation between input buffer and output buffer
Through this architecture, for most current mainstream models, the accelerator can hold all the intermediate data on the FPGA chip. In addition to the loading of model weight, the intermediate does not need to consume any additional memory operation. For the model that cannot completely store the middle tier feature map on the slice, we introduce the concept of slice slice slice in the channel dimension and part slice in the feature map dimension. Through the compiler, the primary convolution or poolingnorm operation is reasonably split, and the DDR memory access operation is pipelined with FPGA accelerated calculation. On the premise of giving priority to ensuring the DSP computing efficiency, the memory access demand of DDR is minimized.
Calculation unit design
The core of the general CNN accelerator based on FPGA is its computing unit. The current version of the accelerator is designed based on Xilinx ku115 chip. The PE computing unit is composed of 4096 MAC DSP cores working at 500MHz, with a theoretical peak computing capacity of 4tflops. Its basic organizational framework is shown in the figure below.
Ku115 chip is stacked by two die pairs, and two groups of processing units PE are placed in parallel in the accelerator. Each PE is composed of four groups of Xbar composed of 32x16 = 512 MAC computing DSP cores. The key of the design is to improve the data multiplexing in the design, reduce the bandwidth, realize the multiplexing of model weight and each layer feature map, and improve the computing efficiency.
Application scenarios and performance comparison
At present, the mainstream of deep learning uses GPU for the training process of deep learning. When deploying online inference, it is necessary to comprehensively consider the characteristics of real-time, low cost and low power consumption and select the acceleration platform. Classified by deep learning landing scenarios, advertising recommendation, speech recognition, real-time monitoring of picture / video content belong to real-time AI services and real-time low-power scenarios of terminals such as intelligent transportation, intelligent speakers and driverless. Compared with GPU, FPGA can provide strong real-time and high-performance support for business.
What is the platform performance, development cycle and ease of use for users?
Acceleration performance
Taking the actual Google V1 model as an example, the CPU test environment: two 6-core CPUs (e5-2620v3) with 64g memory.
When the CPU of the whole machine is full, the performance of a single ku115 based accelerator is 16 times higher than that of the CPU, the detection delay of a single picture is reduced from 250ms to 4ms, and the TCO cost is reduced by 90%.
Meanwhile, the prediction performance of FPGA is slightly better than that of NVIDIA GPU P4, but the delay is optimized by an order of magnitude.
development cycle
The general CNN FPGA acceleration architecture can support the deep learning model in the rapid iteration and continuous evolution of business, including classic models such as Google net / VGg / RESNET / shufflenet / mobilenet and new model variants.
For classical models and algorithm variants based on standard layer self-developed, the existing acceleration architecture can already support, and the corresponding instruction set of the model can be implemented through the compiler within one day to realize deployment online.
For self-developed special models, such as asymmetric convolution operator and asymmetric pooling operation, iterative development of relevant operators needs to be carried out on this platform according to the actual model structure, and the development cycle can be shortened to support within one to two weeks.
Ease of use
FPGA CNN accelerator encapsulates the underlying acceleration process and provides easy-to-use SDK to the business parties of the acceleration platform. The business party can complete the acceleration operation by calling simple API functions, and there is almost no change to the business logic.
epilogue
The general CNN accelerated design based on FPGA can greatly shorten the FPGA development cycle and support the rapid iteration of service depth learning algorithm; It provides computing performance comparable to GPU, but has a delay advantage of orders of magnitude compared with GPU. The general RNN / DNN platform is under intense research and development, and the FPGA accelerator builds the strongest real-time AI service capability for the business.
In the cloud, at the beginning of 2017, we launched the first domestic FPGA public cloud server in Tencent cloud. We will gradually introduce the basic AI acceleration capability to the public cloud.
The battlefield of AI heterogeneous acceleration is great and wonderful. Providing the best solution for the company's internal and cloud business is the direction of the continuous efforts of the FPGAs team.
If the online model needs to be changed, just call the model initialization function and initialize the corresponding model instruction set to FPGA. The acceleration service can be switched in a few seconds.
About the author:
Derickwang (Wang Yuwei), graduated from Huazhong University of science and technology in 2014, with a master's degree. His main research and practice direction is FPGA heterogeneous computing in data center, and his project
Bao’an District Shenzhen City, China
+86 189 3806 5764