Novel Edge AI with Power-Efficient Re-configurable MAC Processing Elements /Mahmoud Kamel Ismail Hasabrabou

By:

Mahmoud Kamel Ismail Hasabrabou

Material type: Text

TextLanguage: English Summary language: English, Arabic Publication details: 2024Description: 80p. ill. 21 cmSubject(s):

Genre/Form:

Dissertation, Academic

DDC classification:

Contents:

Contents: Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 List of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Chapters: 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Thesis Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Research Organization . . . . . . . . . . . . . . . . . . . . . . . . . 4 2. Literature Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1 Background and Related Work Survey . . . . . . . . . . . . . . . . 5 2.1.1 Common Strategies to Decrease Power in AI Edge devices . 9 2.1.2 Neural Networks architecture optimizing techniques . . . . . 11 2.1.3 Lightweight Networks . . . . . . . . . . . . . . . . . . . . . 15 2.1.4 Neural network with reduced precision (int8) . . . . . . . . 17 2.1.5 State-of-the-Art MAC Architectures and Vector Multiplications 17 2.1.6 Research examples of MAC architectures and Vector Multiplications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 6 3. LP-MAC Algorithm & Implementation . . . . . . . . . . . . . . . . . . 25 3.1 Algorithm General Idea Introduction . . . . . . . . . . . . . . . . . 25 3.1.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . 25 3.1.2 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.1.3 Binary Multiplication Example . . . . . . . . . . . . . . . . 26 3.1.4 Algorithm for Binary Multiplication with Bitsnum Bits . . 26 3.1.5 Binary Multiplication of Vectors Example . . . . . . . . . . 26 3.1.6 Binary Multiplication of One Vector with Two Vectors Example 27 3.1.7 Step-by-Step Multiplication . . . . . . . . . . . . . . . . . . 28 3.1.8 Algorithm for Binary Vector Multiplication with N Elements 29 3.1.9 Optimization opportunities in Binary Vector Multiplication 29 3.2 Low Power MAC Novel Algorithm . . . . . . . . . . . . . . . . . . 32 3.2.1 LP-MAC Algorithm Overview . . . . . . . . . . . . . . . . . 32 3.2.2 LP-MAC algorithm example . . . . . . . . . . . . . . . . . . 33 3.2.3 Detailed example for LP-MAC algorithm . . . . . . . . . . . 35 3.3 Applications in Neural Networks . . . . . . . . . . . . . . . . . . . 38 3.3.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . 38 3.3.2 Steps to Optimize Multiplications Using Partial Products . 38 3.3.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.3.4 General Algorithm in Pseudocode for high vector length . . 39 3.4 LP-MAC Vector Multiplication Generic Implementation . . . . . . 40 3.4.1 Special Case if Vector Inputs are signed . . . . . . . . . . . 40 3.5 Algorithm Working Conditions & Limitations . . . . . . . . . . . . 42 3.5.1 Power Cost of Vector Multiplications in Normal MAC and LP-MAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.5.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4. Architecture for common Neural networks using LP-MAC . . . . . . . . 45 4.1 Low Power MAC Deployment in Fully Connected Layers . . . . . . 46 4.2 Math operations model for CONV Layer . . . . . . . . . . . . . . 50 4.3 Convolutional Layer with LP-MAC . . . . . . . . . . . . . . . . . . 51 4.4 Low Power MAC Deployment in Attention Networks . . . . . . . . 53 5. Low Power MAC Co-processor for Embedded Devices with AHB interface 55 5.1 DNN AHB Co-processors for Embedded Systems . . . . . . . . . . 55 5.2 LP-MAC Co-processor with Tunable Activation Function . . . . . . 57 5.2.1 LP-MAC Array with AHB Interface . . . . . . . . . . . . . 57 5.2.2 Tunable Activation Function . . . . . . . . . . . . . . . . . 58 7 5.2.3 Tunable ReLU Function . . . . . . . . . . . . . . . . . . . . 59 5.2.4 Tunable ReLU Implementation . . . . . . . . . . . . . . . . 60 5.2.5 Novel Fixed-Point Implementation for Power Function . . . 61 6. Power Comparison & Verification Results . . . . . . . . . . . . . . . . . . 65 6.1 Example Implementation and Simulation of an Eyeriss CNN, Both with the Inclusion of Low Power MAC and Without It . . . . . . . 66 6.2 Simulations and Verification Results . . . . . . . . . . . . . . . . . 68 7. Future Work and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . 73 7.1 Working Conditions and Limitations . . . . . . . . . . . . . . . . . 73 7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 7.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Chapters: A. RTL Verilog Code for LP-MAC . . . . . . . . . . . . . . . . . . . . . . . 1 B. Synthesis scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

Dissertation note: Thesis (M.A.)—Nile University, Egypt, 2024 . Abstract: Abstract: Deep learning has gained importance in multiple fields, including robotics, speech recognition, and image processing. Nevertheless, deploying deep learning models on edge and embedded devices with limited power and area resources can be difficult due to their high computational demands. This thesis introduces a new optimized power approach for deployment deep learning networks on edge devices, named LPMAC (Low Power Multiply Accumulate). Low power MAC is a low-power technique introduced to target fixed-point format numbers and optimized for reusing vectors of inputs for Multiply-Accumulate (MAC) operations. Unlike conventional MAC units that use multipliers, LP-MAC uses only adders, shifters, and multiplexers, which consume less power. This results in over 30% power reduction, making LP-MAC more power-efficient. LP-MAC also offers efficient dynamically precision control of MAC operations, which results in low latency and real-time performance. The characteristics of LP-MAC position it as a highly suitable option for deploying deep learning architectures on edge devices that have limited power resources. This is especially true for models such as Convolutional Neural Networks (CNNs), Fully Connected networks, and networks that utilize Transformer-based attention mechanisms. The dissertation outlines the hardware realizations for these types of networks and proposes a systematic process for transitioning current networks to leverage LP-MAC over the traditional MAC units. Furthermore, the thesis details the creation of an 15 AHB-SLAVE co-processor equipped with an LP-MAC array, specifically tailored for embedded system applications. Keywords— Artificial intelligence, Hardware accelerators, Machine learning, Deep learning, Neural networks, High performance computing, FPGA, ASIC, GPU, Edge computing, Cloud computing, Energy efficiency, Performance optimization, System architecture,Parallel processing, Training and inference, Model compression, Quantization, Sparsity, Pruning, Multiply Accumulate

Tags from this library: No tags from this library for this title. Log in to add tags.

Average rating: 0.0 (0 votes)

Holdings
Item type	Current library	Call number	Status	Date due	Barcode
Thesis	Main library	621/M.H.N/ 2024 (Browse shelf(Opens below))	Not for loan

Supervisor: Prof. Dr. Ahmed H. Madian

Thesis (M.A.)—Nile University, Egypt, 2024 .

"Includes bibliographical references"

Contents:
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
List of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Chapters:
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Thesis Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Research Organization . . . . . . . . . . . . . . . . . . . . . . . . . 4
2. Literature Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 Background and Related Work Survey . . . . . . . . . . . . . . . . 5
2.1.1 Common Strategies to Decrease Power in AI Edge devices . 9
2.1.2 Neural Networks architecture optimizing techniques . . . . . 11
2.1.3 Lightweight Networks . . . . . . . . . . . . . . . . . . . . . 15
2.1.4 Neural network with reduced precision (int8) . . . . . . . . 17
2.1.5 State-of-the-Art MAC Architectures and Vector Multiplications 17
2.1.6 Research examples of MAC architectures and Vector Multiplications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
6
3. LP-MAC Algorithm & Implementation . . . . . . . . . . . . . . . . . . 25
3.1 Algorithm General Idea Introduction . . . . . . . . . . . . . . . . . 25
3.1.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . 25
3.1.2 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1.3 Binary Multiplication Example . . . . . . . . . . . . . . . . 26
3.1.4 Algorithm for Binary Multiplication with Bitsnum Bits . . 26
3.1.5 Binary Multiplication of Vectors Example . . . . . . . . . . 26
3.1.6 Binary Multiplication of One Vector with Two Vectors Example 27
3.1.7 Step-by-Step Multiplication . . . . . . . . . . . . . . . . . . 28
3.1.8 Algorithm for Binary Vector Multiplication with N Elements 29
3.1.9 Optimization opportunities in Binary Vector Multiplication 29
3.2 Low Power MAC Novel Algorithm . . . . . . . . . . . . . . . . . . 32
3.2.1 LP-MAC Algorithm Overview . . . . . . . . . . . . . . . . . 32
3.2.2 LP-MAC algorithm example . . . . . . . . . . . . . . . . . . 33
3.2.3 Detailed example for LP-MAC algorithm . . . . . . . . . . . 35
3.3 Applications in Neural Networks . . . . . . . . . . . . . . . . . . . 38
3.3.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . 38
3.3.2 Steps to Optimize Multiplications Using Partial Products . 38
3.3.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3.4 General Algorithm in Pseudocode for high vector length . . 39
3.4 LP-MAC Vector Multiplication Generic Implementation . . . . . . 40
3.4.1 Special Case if Vector Inputs are signed . . . . . . . . . . . 40
3.5 Algorithm Working Conditions & Limitations . . . . . . . . . . . . 42
3.5.1 Power Cost of Vector Multiplications in Normal MAC and
LP-MAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.5.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4. Architecture for common Neural networks using LP-MAC . . . . . . . . 45
4.1 Low Power MAC Deployment in Fully Connected Layers . . . . . . 46
4.2 Math operations model for CONV Layer . . . . . . . . . . . . . . 50
4.3 Convolutional Layer with LP-MAC . . . . . . . . . . . . . . . . . . 51
4.4 Low Power MAC Deployment in Attention Networks . . . . . . . . 53
5. Low Power MAC Co-processor for Embedded Devices with AHB interface 55
5.1 DNN AHB Co-processors for Embedded Systems . . . . . . . . . . 55
5.2 LP-MAC Co-processor with Tunable Activation Function . . . . . . 57
5.2.1 LP-MAC Array with AHB Interface . . . . . . . . . . . . . 57
5.2.2 Tunable Activation Function . . . . . . . . . . . . . . . . . 58
7
5.2.3 Tunable ReLU Function . . . . . . . . . . . . . . . . . . . . 59
5.2.4 Tunable ReLU Implementation . . . . . . . . . . . . . . . . 60
5.2.5 Novel Fixed-Point Implementation for Power Function . . . 61
6. Power Comparison & Verification Results . . . . . . . . . . . . . . . . . . 65
6.1 Example Implementation and Simulation of an Eyeriss CNN, Both
with the Inclusion of Low Power MAC and Without It . . . . . . . 66
6.2 Simulations and Verification Results . . . . . . . . . . . . . . . . . 68
7. Future Work and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . 73
7.1 Working Conditions and Limitations . . . . . . . . . . . . . . . . . 73
7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
7.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Chapters:
A. RTL Verilog Code for LP-MAC . . . . . . . . . . . . . . . . . . . . . . . 1
B. Synthesis scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

Abstract:
Deep learning has gained importance in multiple fields, including robotics, speech
recognition, and image processing. Nevertheless, deploying deep learning models on
edge and embedded devices with limited power and area resources can be difficult
due to their high computational demands. This thesis introduces a new optimized
power approach for deployment deep learning networks on edge devices, named LPMAC (Low Power Multiply Accumulate). Low power MAC is a low-power technique
introduced to target fixed-point format numbers and optimized for reusing vectors of
inputs for Multiply-Accumulate (MAC) operations. Unlike conventional MAC units
that use multipliers, LP-MAC uses only adders, shifters, and multiplexers, which consume less power. This results in over 30% power reduction, making LP-MAC more
power-efficient. LP-MAC also offers efficient dynamically precision control of MAC
operations, which results in low latency and real-time performance. The characteristics of LP-MAC position it as a highly suitable option for deploying deep learning
architectures on edge devices that have limited power resources. This is especially
true for models such as Convolutional Neural Networks (CNNs), Fully Connected
networks, and networks that utilize Transformer-based attention mechanisms. The
dissertation outlines the hardware realizations for these types of networks and proposes a systematic process for transitioning current networks to leverage LP-MAC
over the traditional MAC units. Furthermore, the thesis details the creation of an
15
AHB-SLAVE co-processor equipped with an LP-MAC array, specifically tailored for
embedded system applications.
Keywords— Artificial intelligence, Hardware accelerators, Machine learning, Deep learning, Neural networks, High performance computing, FPGA, ASIC, GPU, Edge computing,
Cloud computing, Energy efficiency, Performance optimization, System architecture,Parallel
processing, Training and inference, Model compression, Quantization, Sparsity, Pruning,
Multiply Accumulate

Text in English, abstracts in English and Arabic

There are no comments on this title.

to post a comment.