Current Situation and Optimization of FPGA Heterogeneous Computing

The general CNN accelerated design based on FPGA can greatly shorten the FPGA development cycle and support the rapid iteration of service depth learning algorithm; It provides computing performance comparable to GPU, but has the delay advantage of orders of magnitude compared with GPU, and builds the strongest real-time AI service capability for business.

WHEN Current situation of deep learning heterogeneous computing

With the rapid growth of Internet users and the rapid expansion of data volume, the demand for computing in data centers is also rising rapidly. At the same time, with the rise of computing intensive fields such as artificial intelligence, high-performance data analysis and financial analysis, the demand for computing power has far exceeded the capacity of traditional CPU processors.

Heterogeneous computing is considered to be the key technology to solve this computing gap at this stage. At present, "CPU GPU" and "CPU FPGA" are the heterogeneous computing platforms most concerned by the industry. They have the advantages of higher efficiency and lower latency than traditional CPU parallel computing. In the face of such a huge market, a large number of enterprises in the science and technology industry have invested a lot of funds and manpower, the development standards of heterogeneous programming are gradually maturing, and the mainstream cloud service providers are actively layout.

The industry can see that giant companies such as Microsoft have deployed a large number of FPGAs to accelerate AI information. What are the advantages of FPGAs over other devices?

Flexibility: programmability naturally adapts to the rapidly evolving ML algorithm

DNN, CNN, LSTM, MLP, reinforcement learning, decision tree, etc

Arbitrary precision dynamic support

Model compression, sparse network, faster and better network

Performance: build real-time AI service capability

Improved low latency prediction capability compared to GPU / CPU

Improved single watt performance compared to GPU / CPU

Scale

High speed interconnection IO between boards

Intel CPU FPGA Architecture

At the same time, the short board of FPGA is also very obvious. FPGA uses HDL hardware description language for development, which has long development cycle and high entry threshold. Taking separate classic models such as Alex net and Google net as examples, it often takes months to customize and accelerate the development of a model. The business side and FPGA acceleration team need to consider both algorithm iteration and adapting FPGA hardware acceleration, which is very painful.

On the one hand, we need FPGA to provide low latency and high-performance services that are competitive compared with CPU / GPU. On the other hand, we need the development cycle of FPGA to keep up with the iteration cycle of deep learning algorithm. Based on these two points, we design and develop a general CNN accelerator. Considering the general design of mainstream model operators, the compiler generates instructions to drive model acceleration, which can support model switching in a short time; At the same time, for the emerging deep learning algorithm, the rapid development iteration of relevant operators is carried out on this general basic version, and the accelerated development time of the model is reduced from a few months to one to two weeks.

The overall framework of the general CNN accelerator based on FPGA is as follows. The CNN model trained by Caffe / tensorflow / mxnet and other frameworks generates the instructions corresponding to the model through a series of optimization of the compiler; At the same time, the image data and model weight data are preprocessed and compressed according to the optimization rules, and then distributed to the FPGA accelerator through PCIe. The FPGA accelerator works completely according to the instruction set in the instruction buffer. The accelerator executes the instructions in the complete instruction buffer once to complete the calculation and acceleration of a picture depth model. Each functional module is relatively independent and is only responsible for each individual module calculation request. The accelerator is separated from the deep learning model, and the data dependency and pre and post execution relationship of each layer are controlled in the instruction set.

In short, the main work of the compiler is to analyze and optimize the model structure, and then generate the instruction set efficiently executed by FPGA. The guiding ideology of compiler optimization is: higher MAC DSP computing efficiency and less memory access requirements.

Next, we take the Google net V1 model as an example to make a simple analysis of the accelerator design optimization idea. The network of inception V1 combines 1x1, 3X3, 5x5 conv and 3x3 pooling stack, which increases the width of the network and the adaptability of the network to scale. The following figure shows the basic structure of inception in the model.

Data dependency analysis

This part mainly analyzes the pipelined and parallelized computing in the mining model. Pipelining design can improve the utilization of computing units in the accelerator, and parallel computing can use as many computing units as possible at the same time.

As for pipelining, the analysis part includes the operation of loading data from DDR to SRAM on FPGA chip and the pipelining calculated by PE. Through this optimization, the memory access time is overlapped; DSP calculates the calculation control process of the whole column to ensure the improvement of DSP utilization.

As for parallelism, we need to focus on analyzing the parallel relationship between PE computing array and "post-processing" modules such as activation, pooling and normalization. How to determine the data dependency and prevent conflict is the key to the design here. In inception, it can be seen from its network structure that 1x1 convolution calculation in branch A / B / C and pooling in branch D can be calculated in parallel, and there is no data dependency between them. By optimizing here, the calculation of 3x3 Max pooling layer can be completely overlapped.

Model optimization

In the design, two aspects are mainly considered: finding the optimization of model structure and supporting the fixed-point adjustment of dynamic accuracy.

FPGA is a device that supports a large number of parallel computing. Looking for higher dimensional parallelism from the model structure is of great significance for computing efficiency and reducing memory access. In inception V1, we can see that the input data of 1x1 convolution layer in the first layer of branch a branch B branch C is completely consistent, and the strip and pad of convolution layer are consistent. Can we align and overlay on the output feature map dimension? After superposition, the memory access demand for input data is reduced to 1 / 3 of the original.

On the other hand, in order to give full play to the acceleration characteristics of FPGA hardware, the influence process of the model needs to carry out fixed-point operation on the model. In FPGA, the performance of int8 can be twice that of int16. However, in order to enable customers in the company and on Tencent cloud to deploy their trained floating-point model without perception, and do not need retrain int8 model to control accuracy loss, we adopt a customized int16 scheme that supports dynamic accuracy adjustment. Through this method, the user trained model can be deployed directly through the compiler without any loss of accuracy.

Memory architecture design

Bandwidth is always one of the bottlenecks restricting the performance of computer architecture. At the same time, memory access directly affects the power consumption efficiency of accelerator devices.

In order to minimize DDR memory access during model calculation, we designed the following memory architecture:

Input buffer and output buffer ping pong are designed to maximize pipelining and parallelism

Support the inner copy operation between the input buffer and the output buffer itself

Cross copy operation between input buffer and output buffer

Through this architecture, for most current mainstream models, the accelerator can hold all the intermediate data on the FPGA chip. In addition to the loading of model weight, the intermediate does not need to consume any additional memory operation. For the model that cannot completely store the middle tier feature map on the slice, we introduce the concept of slice slice slice in the channel dimension and part slice in the feature map dimension. Through the compiler, the primary convolution or poolingnorm operation is reasonably split, and the DDR memory access operation is pipelined with FPGA accelerated calculation. On the premise of giving priority to ensuring the DSP computing efficiency, the memory access demand of DDR is minimized.

Calculation unit design

The core of the general CNN accelerator based on FPGA is its computing unit. The current version of the accelerator is designed based on Xilinx ku115 chip. The PE computing unit is composed of 4096 MAC DSP cores working at 500MHz, with a theoretical peak computing capacity of 4tflops. Its basic organizational framework is shown in the figure below.

Ku115 chip is stacked by two die pairs, and two groups of processing units PE are placed in parallel in the accelerator. Each PE is composed of four groups of Xbar composed of 32x16 = 512 MAC computing DSP cores. The key of the design is to improve the data multiplexing in the design, reduce the bandwidth, realize the multiplexing of model weight and each layer feature map, and improve the computing efficiency.

Application scenarios and performance comparison

At present, the mainstream of deep learning uses GPU for the training process of deep learning. When deploying online inference, it is necessary to comprehensively consider the characteristics of real-time, low cost and low power consumption and select the acceleration platform. Classified by deep learning landing scenarios, advertising recommendation, speech recognition, real-time monitoring of picture / video content belong to real-time AI services and real-time low-power scenarios of terminals such as intelligent transportation, intelligent speakers and driverless. Compared with GPU, FPGA can provide strong real-time and high-performance support for business.

What is the platform performance, development cycle and ease of use for users?

Acceleration performance

Taking the actual Google V1 model as an example, the CPU test environment: two 6-core CPUs (e5-2620v3) with 64g memory.

When the CPU of the whole machine is full, the performance of a single ku115 based accelerator is 16 times higher than that of the CPU, the detection delay of a single picture is reduced from 250ms to 4ms, and the TCO cost is reduced by 90%.

Meanwhile, the prediction performance of FPGA is slightly better than that of NVIDIA GPU P4, but the delay is optimized by an order of magnitude.

development cycle

The general CNN FPGA acceleration architecture can support the deep learning model in the rapid iteration and continuous evolution of business, including classic models such as Google net / VGg / RESNET / shufflenet / mobilenet and new model variants.

For classical models and algorithm variants based on standard layer self-developed, the existing acceleration architecture can already support, and the corresponding instruction set of the model can be implemented through the compiler within one day to realize deployment online.

For self-developed special models, such as asymmetric convolution operator and asymmetric pooling operation, iterative development of relevant operators needs to be carried out on this platform according to the actual model structure, and the development cycle can be shortened to support within one to two weeks.

Ease of use

FPGA CNN accelerator encapsulates the underlying acceleration process and provides easy-to-use SDK to the business parties of the acceleration platform. The business party can complete the acceleration operation by calling simple API functions, and there is almost no change to the business logic.

epilogue

The general CNN accelerated design based on FPGA can greatly shorten the FPGA development cycle and support the rapid iteration of service depth learning algorithm; It provides computing performance comparable to GPU, but has a delay advantage of orders of magnitude compared with GPU. The general RNN / DNN platform is under intense research and development, and the FPGA accelerator builds the strongest real-time AI service capability for the business.

In the cloud, at the beginning of 2017, we launched the first domestic FPGA public cloud server in Tencent cloud. We will gradually introduce the basic AI acceleration capability to the public cloud.

The battlefield of AI heterogeneous acceleration is great and wonderful. Providing the best solution for the company's internal and cloud business is the direction of the continuous efforts of the FPGAs team.

If the online model needs to be changed, just call the model initialization function and initialize the corresponding model instruction set to FPGA. The acceleration service can be switched in a few seconds.

About the author:

Derickwang (Wang Yuwei), graduated from Huazhong University of science and technology in 2014, with a master's degree. His main research and practice direction is FPGA heterogeneous computing in data center, and his project

Current Situation and Optimization of FPGA Heterogeneous Computing 1

get in touch with us
recommended articles
FAQ
How to Buy Cheap and Stylish Dell Laptop Keyboard Price
The Role of dell laptop keyboard priceWhen it comes to buying a laptop, there are many factors that influence the purchase. Dell laptops are very popular in the market and people who have purchased them often get excited about them. There are also many things that you can do to improve the quality of your laptop. The most important thing is to be careful about your purchase and not to make any rash decisions. You should make sure that you read through the tips that are given in this article and compare them with your own experience. If you need to buy a laptop, it is important to be careful about the information that you read about it.You need to be careful when you buy a laptop because you will get bad luck in your life. Don't go shopping with bad luck in your life. A good laptop will always work for you. When you go shopping for a laptop, make sure you check the reviews of the laptops that you are buying. It is better to go shopping with good luck in your life. If you don't know how to use a laptop, it is better to go shopping with good luck in your life.It is really important to buy a laptop that has an adjustable touchpad and multiple functions. Dell's computer will be able to be made more flexible by adding features that allow users to change the size of the screen, such as including screens with larger displays, or creating keyboards that have keys that can be used to write on the screen. There are so many different types of computers that people use today, and each one has its own style of use. If you want to learn more about Dell's computer, then check out Dell's website.It is difficult to know how to create a great story when you are overwhelmed by so many details. However, if you take a good look at the actual piece of writing, you will realize that the information that you need to tell your story is simple and concise. And it is easy to put together a great story if you have the right information at your disposal. There are many ways to tell a great story, and one of the best ways to tell a great story is to use short and sweet phrases that will get your audience's attention.Types of dell laptop keyboard priceIf you need to know what kind of dell laptop keyboard you are looking for, then you need to ask your friend or colleague. A good way to do this is to ask them what kind of dell laptop keyboard they are looking for. They will be able to give you some ideas about what kind of dell laptop keyboard they are looking for.Dell is known for its laptops, desktops, servers, and all kinds of devices. Dell's products are also sold in many countries around the world. It is not easy to decide which type of laptop you will use. In order to help you make the right choice, we have put together a list of the most common types of laptops that Dell makes. These laptops are mainly used by professionals who work in offices and companies. They are not only more expensive but also have more features than their competition. In this list, you will find all the different types of Dell laptops that you can choose from.The difference between an old school typewriter and a modern typeface is that a typewriter has been around for centuries. An old school typewriter has many uses, such as small handwriting, calligraphy, etc. An old school typewriter is made of metal, wood, or leather. It has been around for centuries. A modern typeface is made of plastic, carbon, or glass. It has been around for centuries. A modern typeface is made of computer hardware, software, or design. It has been around for centuries.You are looking for dell laptop keyboard? Dell is known for its high quality products and we have a wide range of laptops in all different shapes and sizes. You can find the best deals on dell laptop keyboards at Lowes.com. If you need to buy a dell laptop keyboard, then it is important to shop around. Most people will go to the local hardware store to see if they can find the best deal on the laptop they are looking for. You can also use the internet to shop for the best deal on a laptop.How to Choose dell laptop keyboard priceDell is an American company that produces and sells laptops. Dell's success has been based on its unique approach to computer software and technology.When choosing a laptop it is important to select a laptop that will be easy to use. There are so many choices available today, and you need to make sure you pick the best one for you. So if you want to know how to choose a laptop keyboard, then this post is for you. This post will help you with the Dell laptop keyboard buying guide. The Dell laptop keyboard buying guide will give you a good idea about the different models of laptops and how to choose the best one for you. If you want to learn more about the Dell laptop keyboard, then this post is for you.As we move forward with Dell's product line, the quality of Dell's laptops has steadily improved. The company has been increasing the power and speed of its products. It is no longer only about getting more computer chips and RAM, but also about making sure that they can provide you with a faster computer. Dell's new laptops are designed to offer you a better user experience. It is not only about increasing the power and speed of the computer, but also about improving the overall quality of the computer. If you want to get a better laptop, you need to choose the best laptop that will allow you to do so.I want to buy a laptop that will allow me to type with both hands. There are two main issues that are usually going to arise when buying a laptop: the first is that you have to decide on the size of the screen and the second is that you have to decide on the resolution of the screen. So, if you want to use your laptop for anything other than work, then you need to make sure that you have the right size of the screen and the right resolution of the screen.How to Install dell laptop keyboard priceNoob apps are simple and straightforward to use. Noob apps can be used to make any type of app, website, or software that is needed to work in a web browser. Noob apps are great for all types of websites, including forums, blogs, social media, games, email, phone, word processing, spreadsheets, photo editing, word processing, video editing, etc. There are noob apps that can't be used to create any type of app. Noob apps are easy to use and do not take up a lot of space on your computer.I am so glad that you are here. The problem is that I am having trouble finding the best way to fix it. If you have any idea about how to fix it, please leave a comment. Also, if you have some ideas about how to fix it, please share them with me. Thanks.I am working on a new post that will be focused on how to install dell laptop keyboard price. I am not working on a new post, but rather a lot of work that I have been doing on how to install dell laptop keyboard price. The problem that I am working on is that there are lots of problems that I am working on. This means that I have to keep working on them.You need to have an account in order to leave a comment. The form is anonymous. Please wait while you are redirected to register. To be able to post a comment, you must be a registered user. This form is not supported by any content type. All fields are required. Your email address is required. You have to be a registered user to leave a comment. To be able to post a comment, you must be a registered user. Your email address is required. You have to be a registered user to leave a comment. To be able to post a comment, you must be a registered user.
How Does a Acer Laptop Keyboard Replacement Price Work?
How to Select and Toshiba Laptop Keyboard Price Tops Together
How to Choose Professional Samsung Laptop Keyboard Price?
How to Use Fujitsu Keyboard Laptop for Your New Home?
Asus Laptop Adapter Most Authoritative Review
Whats the Best Toshiba Laptop Keyboard Brand in China?
Why to Choose a Lenovo Laptop Keyboard for Your Home
Best 5 Tips to Choose a Best Buy Laptop Battery Replacement
Why to Choose a Lenovo Thinkpad Adapter for Your Home
related searches
How to Buy Cheap and Stylish Dell Laptop Keyboard Price
How Does a Acer Laptop Keyboard Replacement Price Work?
How to Select and Toshiba Laptop Keyboard Price Tops Together
How to Choose Professional Samsung Laptop Keyboard Price?
How to Use Fujitsu Keyboard Laptop for Your New Home?
Asus Laptop Adapter Most Authoritative Review
Whats the Best Toshiba Laptop Keyboard Brand in China?
Why to Choose a Lenovo Laptop Keyboard for Your Home
Best 5 Tips to Choose a Best Buy Laptop Battery Replacement
Contact US

    Bao’an District  Shenzhen City, China

     +86 189 3806 5764

NEWSLETTER SUBSCRIBE
no data
Copyright © 2021 CrowBerry

   

chat online
NEED HELP? WE'RE HERE!