Current Situation and Optimization of FPGA Heterogeneous Computing

The general CNN accelerated design based on FPGA can greatly shorten the FPGA development cycle and support the rapid iteration of service depth learning algorithm; It provides computing performance comparable to GPU, but has the delay advantage of orders of magnitude compared with GPU, and builds the strongest real-time AI service capability for business.

WHEN Current situation of deep learning heterogeneous computing

With the rapid growth of Internet users and the rapid expansion of data volume, the demand for computing in data centers is also rising rapidly. At the same time, with the rise of computing intensive fields such as artificial intelligence, high-performance data analysis and financial analysis, the demand for computing power has far exceeded the capacity of traditional CPU processors.

Heterogeneous computing is considered to be the key technology to solve this computing gap at this stage. At present, "CPU GPU" and "CPU FPGA" are the heterogeneous computing platforms most concerned by the industry. They have the advantages of higher efficiency and lower latency than traditional CPU parallel computing. In the face of such a huge market, a large number of enterprises in the science and technology industry have invested a lot of funds and manpower, the development standards of heterogeneous programming are gradually maturing, and the mainstream cloud service providers are actively layout.

The industry can see that giant companies such as Microsoft have deployed a large number of FPGAs to accelerate AI information. What are the advantages of FPGAs over other devices?

Flexibility: programmability naturally adapts to the rapidly evolving ML algorithm

DNN, CNN, LSTM, MLP, reinforcement learning, decision tree, etc

Arbitrary precision dynamic support

Model compression, sparse network, faster and better network

Performance: build real-time AI service capability

Improved low latency prediction capability compared to GPU / CPU

Improved single watt performance compared to GPU / CPU

Scale

High speed interconnection IO between boards

Intel CPU FPGA Architecture

At the same time, the short board of FPGA is also very obvious. FPGA uses HDL hardware description language for development, which has long development cycle and high entry threshold. Taking separate classic models such as Alex net and Google net as examples, it often takes months to customize and accelerate the development of a model. The business side and FPGA acceleration team need to consider both algorithm iteration and adapting FPGA hardware acceleration, which is very painful.

On the one hand, we need FPGA to provide low latency and high-performance services that are competitive compared with CPU / GPU. On the other hand, we need the development cycle of FPGA to keep up with the iteration cycle of deep learning algorithm. Based on these two points, we design and develop a general CNN accelerator. Considering the general design of mainstream model operators, the compiler generates instructions to drive model acceleration, which can support model switching in a short time; At the same time, for the emerging deep learning algorithm, the rapid development iteration of relevant operators is carried out on this general basic version, and the accelerated development time of the model is reduced from a few months to one to two weeks.

The overall framework of the general CNN accelerator based on FPGA is as follows. The CNN model trained by Caffe / tensorflow / mxnet and other frameworks generates the instructions corresponding to the model through a series of optimization of the compiler; At the same time, the image data and model weight data are preprocessed and compressed according to the optimization rules, and then distributed to the FPGA accelerator through PCIe. The FPGA accelerator works completely according to the instruction set in the instruction buffer. The accelerator executes the instructions in the complete instruction buffer once to complete the calculation and acceleration of a picture depth model. Each functional module is relatively independent and is only responsible for each individual module calculation request. The accelerator is separated from the deep learning model, and the data dependency and pre and post execution relationship of each layer are controlled in the instruction set.

In short, the main work of the compiler is to analyze and optimize the model structure, and then generate the instruction set efficiently executed by FPGA. The guiding ideology of compiler optimization is: higher MAC DSP computing efficiency and less memory access requirements.

Next, we take the Google net V1 model as an example to make a simple analysis of the accelerator design optimization idea. The network of inception V1 combines 1x1, 3X3, 5x5 conv and 3x3 pooling stack, which increases the width of the network and the adaptability of the network to scale. The following figure shows the basic structure of inception in the model.

Data dependency analysis

This part mainly analyzes the pipelined and parallelized computing in the mining model. Pipelining design can improve the utilization of computing units in the accelerator, and parallel computing can use as many computing units as possible at the same time.

As for pipelining, the analysis part includes the operation of loading data from DDR to SRAM on FPGA chip and the pipelining calculated by PE. Through this optimization, the memory access time is overlapped; DSP calculates the calculation control process of the whole column to ensure the improvement of DSP utilization.

As for parallelism, we need to focus on analyzing the parallel relationship between PE computing array and "post-processing" modules such as activation, pooling and normalization. How to determine the data dependency and prevent conflict is the key to the design here. In inception, it can be seen from its network structure that 1x1 convolution calculation in branch A / B / C and pooling in branch D can be calculated in parallel, and there is no data dependency between them. By optimizing here, the calculation of 3x3 Max pooling layer can be completely overlapped.

Model optimization

In the design, two aspects are mainly considered: finding the optimization of model structure and supporting the fixed-point adjustment of dynamic accuracy.

FPGA is a device that supports a large number of parallel computing. Looking for higher dimensional parallelism from the model structure is of great significance for computing efficiency and reducing memory access. In inception V1, we can see that the input data of 1x1 convolution layer in the first layer of branch a branch B branch C is completely consistent, and the strip and pad of convolution layer are consistent. Can we align and overlay on the output feature map dimension? After superposition, the memory access demand for input data is reduced to 1 / 3 of the original.

On the other hand, in order to give full play to the acceleration characteristics of FPGA hardware, the influence process of the model needs to carry out fixed-point operation on the model. In FPGA, the performance of int8 can be twice that of int16. However, in order to enable customers in the company and on Tencent cloud to deploy their trained floating-point model without perception, and do not need retrain int8 model to control accuracy loss, we adopt a customized int16 scheme that supports dynamic accuracy adjustment. Through this method, the user trained model can be deployed directly through the compiler without any loss of accuracy.

Memory architecture design

Bandwidth is always one of the bottlenecks restricting the performance of computer architecture. At the same time, memory access directly affects the power consumption efficiency of accelerator devices.

In order to minimize DDR memory access during model calculation, we designed the following memory architecture:

Input buffer and output buffer ping pong are designed to maximize pipelining and parallelism

Support the inner copy operation between the input buffer and the output buffer itself

Cross copy operation between input buffer and output buffer

Through this architecture, for most current mainstream models, the accelerator can hold all the intermediate data on the FPGA chip. In addition to the loading of model weight, the intermediate does not need to consume any additional memory operation. For the model that cannot completely store the middle tier feature map on the slice, we introduce the concept of slice slice slice in the channel dimension and part slice in the feature map dimension. Through the compiler, the primary convolution or poolingnorm operation is reasonably split, and the DDR memory access operation is pipelined with FPGA accelerated calculation. On the premise of giving priority to ensuring the DSP computing efficiency, the memory access demand of DDR is minimized.

Calculation unit design

The core of the general CNN accelerator based on FPGA is its computing unit. The current version of the accelerator is designed based on Xilinx ku115 chip. The PE computing unit is composed of 4096 MAC DSP cores working at 500MHz, with a theoretical peak computing capacity of 4tflops. Its basic organizational framework is shown in the figure below.

Ku115 chip is stacked by two die pairs, and two groups of processing units PE are placed in parallel in the accelerator. Each PE is composed of four groups of Xbar composed of 32x16 = 512 MAC computing DSP cores. The key of the design is to improve the data multiplexing in the design, reduce the bandwidth, realize the multiplexing of model weight and each layer feature map, and improve the computing efficiency.

Application scenarios and performance comparison

At present, the mainstream of deep learning uses GPU for the training process of deep learning. When deploying online inference, it is necessary to comprehensively consider the characteristics of real-time, low cost and low power consumption and select the acceleration platform. Classified by deep learning landing scenarios, advertising recommendation, speech recognition, real-time monitoring of picture / video content belong to real-time AI services and real-time low-power scenarios of terminals such as intelligent transportation, intelligent speakers and driverless. Compared with GPU, FPGA can provide strong real-time and high-performance support for business.

What is the platform performance, development cycle and ease of use for users?

Acceleration performance

Taking the actual Google V1 model as an example, the CPU test environment: two 6-core CPUs (e5-2620v3) with 64g memory.

When the CPU of the whole machine is full, the performance of a single ku115 based accelerator is 16 times higher than that of the CPU, the detection delay of a single picture is reduced from 250ms to 4ms, and the TCO cost is reduced by 90%.

Meanwhile, the prediction performance of FPGA is slightly better than that of NVIDIA GPU P4, but the delay is optimized by an order of magnitude.

development cycle

The general CNN FPGA acceleration architecture can support the deep learning model in the rapid iteration and continuous evolution of business, including classic models such as Google net / VGg / RESNET / shufflenet / mobilenet and new model variants.

For classical models and algorithm variants based on standard layer self-developed, the existing acceleration architecture can already support, and the corresponding instruction set of the model can be implemented through the compiler within one day to realize deployment online.

For self-developed special models, such as asymmetric convolution operator and asymmetric pooling operation, iterative development of relevant operators needs to be carried out on this platform according to the actual model structure, and the development cycle can be shortened to support within one to two weeks.

Ease of use

FPGA CNN accelerator encapsulates the underlying acceleration process and provides easy-to-use SDK to the business parties of the acceleration platform. The business party can complete the acceleration operation by calling simple API functions, and there is almost no change to the business logic.

epilogue

The general CNN accelerated design based on FPGA can greatly shorten the FPGA development cycle and support the rapid iteration of service depth learning algorithm; It provides computing performance comparable to GPU, but has a delay advantage of orders of magnitude compared with GPU. The general RNN / DNN platform is under intense research and development, and the FPGA accelerator builds the strongest real-time AI service capability for the business.

In the cloud, at the beginning of 2017, we launched the first domestic FPGA public cloud server in Tencent cloud. We will gradually introduce the basic AI acceleration capability to the public cloud.

The battlefield of AI heterogeneous acceleration is great and wonderful. Providing the best solution for the company's internal and cloud business is the direction of the continuous efforts of the FPGAs team.

If the online model needs to be changed, just call the model initialization function and initialize the corresponding model instruction set to FPGA. The acceleration service can be switched in a few seconds.

About the author:

Derickwang (Wang Yuwei), graduated from Huazhong University of science and technology in 2014, with a master's degree. His main research and practice direction is FPGA heterogeneous computing in data center, and his project

Current Situation and Optimization of FPGA Heterogeneous Computing 1

get in touch with us
recommended articles
FAQ
Factors Affecting the Price of LED Display Screen
How much is a square meter of full-color LED display in Zhengzhou, Henan Province? For customers who intend to buy LED screens, in addition to caring about product quality, price is also a crucial factor. However, the price of LED display screen is related to many factors. To understand the price more clearly, please look down:How much is a square meter of full-color LED display screen in Zhengzhou, Henan Province? It is closely related to the basic configuration of LED screen, including LED beads, chips, box materials, and even related to the processing technology, direct insertion or surface paste. Generally, die cast aluminum LED box is more durable than sheet metal box, but the price is also higher.At present, the most widely used outdoor P10 full-color display screen has a price range of 3000-5000 yuan / square meter. Due to the changes of market conditions, this is for reference only. The quotation of each LED display manufacturer will be different, and the quality and performance will be different. There are also cheaper ones, lower than 2000, but cheap products are generally not cost-effective, the quality is not guaranteed, the probability of problems is high, and after-sales is difficult. Hengyuan optoelectronics reminds you that the purchase price of LED display screen is not the only factor, which should be measured by comprehensively considering the regularity of the manufacturer, product use and environment.The factors affecting the price of LED display project are as follows:1. LED screen size: for example, if the customer customizes an outdoor LED display with a length of 10 meters and a width of 6 meters in lianchengfa, the salesperson will select P8 or P10 according to the size provided by the customer and give the specific scheme and price.2. Raw materials: including different types of LED lamp beads, lamp boards and boxes3. Control system: there are two common types: Nova and yangbang.4. Supporting equipment: computer, lightning arrester, sound, power amplifier, distribution cabinet and air conditioner, etc.5. Steel structure materials and installation: such as common wall mounted and column installation. The construction prices of the two are different. Others include the selection of frame structure materials.LED display engineering price = screen price * screen area control system cost frame structure cost transportation and installation cost distribution system cost, including power line, data line steel frame and civil engineering cost taxThe above is the information about how much a square meter of full-color LED display screen is. Only by clarifying the LED screen area, model and installation mode, the quotation obtained can have practical significance. Therefore, if you encounter an easily quoted LED display manufacturer, you generally can't believe it.Zhengzhou Hengyuan Optoelectronics (Zhengzhou Hengyuan display equipment Co., Ltd.) was established in 2007 and is the general distributor of lyade LED electronic display screen in Henan. Liyade Henan Branch, after more than ten years of development, the company headquarters has established a large spot warehouse in Zhengzhou Zhongli advertising market, and built Zhengdong center in Zhengdong New Area, with Flagship Exhibition Hall, office area, etc. It also has branches in Hohhot, Lanzhou, Xining, Yinchuan, Lhasa, Shandong and other cities. The company is mainly engaged in LED display screen, and its agent brands include: liad (the market value of LED display screen ranks first in the industry), Shanghai yangbang control card, Xi'an Nova control card, Changzhou Chuanglian power supply, hy full-color power supply, distribution cabinet, video processor and other LED display screen peripheral supporting products. Channel products, indoor, outdoor, small spacing 1.8, 1.5, 1.2, all-in-one conference machines are available.
How to Fix This Problem with IPhone4s, Got Blue Dark Display Screen with Black Lines?
Analysis of System Composition and Architecture of LED Display Screen
Illustration of Replacing Laptop Screen
With the Rapid Development of Mobile Internet, the Production Qualification Rate of Display Screen I
How to Make My Computer Display the Screen Full Size?
Aventk UV Color Changing Black Glue 1098 Can Effectively Solve the Reflection Problem of OLED Displa
LED Display Screen Enterprises to Enter the Field of 'price' -Information Industry News
What Are the Components of LED Display_ Application Method of LED Display Screen
What Aspects Should Be Considered When Selecting LED Display Screen
related searches
Factors Affecting the Price of LED Display Screen
How to Fix This Problem with IPhone4s, Got Blue Dark Display Screen with Black Lines?
Analysis of System Composition and Architecture of LED Display Screen
Illustration of Replacing Laptop Screen
With the Rapid Development of Mobile Internet, the Production Qualification Rate of Display Screen I
How to Make My Computer Display the Screen Full Size?
Aventk UV Color Changing Black Glue 1098 Can Effectively Solve the Reflection Problem of OLED Displa
LED Display Screen Enterprises to Enter the Field of 'price' -Information Industry News
What Are the Components of LED Display_ Application Method of LED Display Screen
Contact US

    Bao’an District  Shenzhen City, China

     +86 189 3806 5764

NEWSLETTER SUBSCRIBE
no data
Copyright © 2021 CrowBerry

   

chat online
NEED HELP? WE'RE HERE!