東華大學圖書館 |

Hardware Acceleration of Deep Convolutional Neural Networks on FPGA.

紀錄類型:	書目-電子資源 : Monograph/item
正題名/作者:	Hardware Acceleration of Deep Convolutional Neural Networks on FPGA./
作者:	Ma, Yufei.
出版者:	Ann Arbor : ProQuest Dissertations & Theses, : 2018,
面頁冊數:	170 p.
附註:	Source: Dissertations Abstracts International, Volume: 80-06, Section: B.
Contained By:	Dissertations Abstracts International80-06B.
標題:	Computer Engineering. -
電子資源:	http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=10979649
ISBN:	9780438713604

Hardware Acceleration of Deep Convolutional Neural Networks on FPGA.
Ma, Yufei.

Hardware Acceleration of Deep Convolutional Neural Networks on FPGA. - Ann Arbor : ProQuest Dissertations & Theses, 2018 - 170 p.

Source: Dissertations Abstracts International, Volume: 80-06, Section: B.

Thesis (Ph.D.)--Arizona State University, 2018.

This item must not be sold to any third party vendors.

The rapid improvement in computation capability has made deep convolutional neural networks (CNNs) a great success in recent years on many computer vision tasks with significantly improved accuracy. During the inference phase, many applications demand low latency processing of one image with strict power consumption requirement, which reduces the efficiency of GPU and other general-purpose platform, bringing opportunities for specific acceleration hardware, e.g. FPGA, by customizing the digital circuit specific for the deep learning algorithm inference. However, deploying CNNs on portable and embedded systems is still challenging due to large data volume, intensive computation, varying algorithm structures, and frequent memory accesses. This dissertation proposes a complete design methodology and framework to accelerate the inference process of various CNN algorithms on FPGA hardware with high performance, efficiency and flexibility. As convolution contributes most operations in CNNs, the convolution acceleration scheme significantly affects the efficiency and performance of a hardware CNN accelerator. Convolution involves multiply and accumulate (MAC) operations with four levels of loops. Without fully studying the convolution loop optimization before the hardware design phase, the resulting accelerator can hardly exploit the data reuse and manage data movement efficiently. This work overcomes these barriers by quantitatively analyzing and optimizing the design objectives (e.g. memory access) of the CNN accelerator based on multiple design variables. An efficient dataflow and hardware architecture of CNN acceleration are proposed to minimize the data communication while maximizing the resource utilization to achieve high performance. Although great performance and efficiency can be achieved by customizing the FPGA hardware for each CNN model, significant efforts and expertise are required leading to long development time, which makes it difficult to catch up with the rapid development of CNN algorithms. In this work, we present an RTL-level CNN compiler that automatically generates customized FPGA hardware for the inference tasks of various CNNs, in order to enable high-level fast prototyping of CNNs from software to FPGA and still keep the benefits of low-level hardware optimization. First, a general-purpose library of RTL modules is developed to model different operations at each layer. The integration and dataflow of physical modules are predefined in the top-level system template and reconfigured during compilation for a given CNN algorithm. The runtime control of layer-by-layer sequential computation is managed by the proposed execution schedule so that even highly irregular and complex network topology, e.g. GoogLeNet and ResNet, can be compiled. The proposed methodology is demonstrated with various CNN algorithms, e.g. NiN, VGG, GoogLeNet and ResNet, on two different standalone FPGAs achieving state-of-the art performance. Based on the optimized acceleration strategy, there are still a lot of design options, e.g. the degree and dimension of computation parallelism, the size of on-chip buffers, and the external memory bandwidth, which impact the utilization of computation resources and data communication efficiency, and finally affect the performance and energy consumption of the accelerator. The large design space of the accelerator makes it impractical to explore the optimal design choice during the real implementation phase. Therefore, a performance model is proposed in this work to quantitatively estimate the accelerator performance and resource utilization. By this means, the performance bottleneck and design bound can be identified and the optimal design option can be explored early in the design phase.

ISBN: 9780438713604Subjects--Topical Terms:

1567821
Computer Engineering.

Hardware Acceleration of Deep Convolutional Neural Networks on FPGA.
LDR:04887nmm a2200337 4500 001 2207840
005 20190923114240.5
008 201008s2018 ||||||||||||||||| ||eng d
020 $a 9780438713604
035 $a (MiAaPQ)AAI10979649
035 $a (MiAaPQ)asu:18365
035 $a AAI10979649
040 $a MiAaPQ $c MiAaPQ
100 1 $a Ma, Yufei. $3 3434841
245 1 0 $a Hardware Acceleration of Deep Convolutional Neural Networks on FPGA.
260 1 $a Ann Arbor : $b ProQuest Dissertations & Theses, $c 2018
300 $a 170 p.
500 $a Source: Dissertations Abstracts International, Volume: 80-06, Section: B.
500 $a Publisher info.: Dissertation/Thesis.
500 $a Advisor: Vrudhula, Sarma;Seo, Jae-sun.
502 $a Thesis (Ph.D.)--Arizona State University, 2018.
506 $a This item must not be sold to any third party vendors.
520 $a The rapid improvement in computation capability has made deep convolutional neural networks (CNNs) a great success in recent years on many computer vision tasks with significantly improved accuracy. During the inference phase, many applications demand low latency processing of one image with strict power consumption requirement, which reduces the efficiency of GPU and other general-purpose platform, bringing opportunities for specific acceleration hardware, e.g. FPGA, by customizing the digital circuit specific for the deep learning algorithm inference. However, deploying CNNs on portable and embedded systems is still challenging due to large data volume, intensive computation, varying algorithm structures, and frequent memory accesses. This dissertation proposes a complete design methodology and framework to accelerate the inference process of various CNN algorithms on FPGA hardware with high performance, efficiency and flexibility. As convolution contributes most operations in CNNs, the convolution acceleration scheme significantly affects the efficiency and performance of a hardware CNN accelerator. Convolution involves multiply and accumulate (MAC) operations with four levels of loops. Without fully studying the convolution loop optimization before the hardware design phase, the resulting accelerator can hardly exploit the data reuse and manage data movement efficiently. This work overcomes these barriers by quantitatively analyzing and optimizing the design objectives (e.g. memory access) of the CNN accelerator based on multiple design variables. An efficient dataflow and hardware architecture of CNN acceleration are proposed to minimize the data communication while maximizing the resource utilization to achieve high performance. Although great performance and efficiency can be achieved by customizing the FPGA hardware for each CNN model, significant efforts and expertise are required leading to long development time, which makes it difficult to catch up with the rapid development of CNN algorithms. In this work, we present an RTL-level CNN compiler that automatically generates customized FPGA hardware for the inference tasks of various CNNs, in order to enable high-level fast prototyping of CNNs from software to FPGA and still keep the benefits of low-level hardware optimization. First, a general-purpose library of RTL modules is developed to model different operations at each layer. The integration and dataflow of physical modules are predefined in the top-level system template and reconfigured during compilation for a given CNN algorithm. The runtime control of layer-by-layer sequential computation is managed by the proposed execution schedule so that even highly irregular and complex network topology, e.g. GoogLeNet and ResNet, can be compiled. The proposed methodology is demonstrated with various CNN algorithms, e.g. NiN, VGG, GoogLeNet and ResNet, on two different standalone FPGAs achieving state-of-the art performance. Based on the optimized acceleration strategy, there are still a lot of design options, e.g. the degree and dimension of computation parallelism, the size of on-chip buffers, and the external memory bandwidth, which impact the utilization of computation resources and data communication efficiency, and finally affect the performance and energy consumption of the accelerator. The large design space of the accelerator makes it impractical to explore the optimal design choice during the real implementation phase. Therefore, a performance model is proposed in this work to quantitatively estimate the accelerator performance and resource utilization. By this means, the performance bottleneck and design bound can be identified and the optimal design option can be explored early in the design phase.
590 $a School code: 0010.
650 4 $a Computer Engineering. $3 1567821
650 4 $a Electrical engineering. $3 649834
650 4 $a Artificial intelligence. $3 516317
690 $a 0464
690 $a 0544
690 $a 0800
710 2 $a Arizona State University. $b Electrical Engineering. $3 1671741
773 0 $t Dissertations Abstracts International $g 80-06B.
790 $a 0010
791 $a Ph.D.
792 $a 2018
793 $a English
856 4 0 $u http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=10979649