| 000 | 08157nam a22002657a 4500 | ||
|---|---|---|---|
| 008 | 201210s2024 a|||f bm|| 00| 0 eng d | ||
| 024 | 7 |
_a0009-0001-2561-9884 _2ORCID |
|
| 040 |
_aEG-CaNU _cEG-CaNU |
||
| 041 | 0 |
_aeng _beng _bara |
|
| 082 | _a621 | ||
| 100 | 0 |
_aMahmoud Kamel Ismail Hasabrabou _93564 |
|
| 245 | 1 |
_aNovel Edge AI with Power-Efficient Re-configurable MAC Processing Elements _c/Mahmoud Kamel Ismail Hasabrabou |
|
| 260 | _c2024 | ||
| 300 |
_a80p. _bill. _c21 cm. |
||
| 500 | _3Supervisor: Prof. Dr. Ahmed H. Madian | ||
| 502 | _aThesis (M.A.)—Nile University, Egypt, 2024 . | ||
| 504 | _a"Includes bibliographical references" | ||
| 505 | 0 | _aContents: Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 List of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Chapters: 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Thesis Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Research Organization . . . . . . . . . . . . . . . . . . . . . . . . . 4 2. Literature Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1 Background and Related Work Survey . . . . . . . . . . . . . . . . 5 2.1.1 Common Strategies to Decrease Power in AI Edge devices . 9 2.1.2 Neural Networks architecture optimizing techniques . . . . . 11 2.1.3 Lightweight Networks . . . . . . . . . . . . . . . . . . . . . 15 2.1.4 Neural network with reduced precision (int8) . . . . . . . . 17 2.1.5 State-of-the-Art MAC Architectures and Vector Multiplications 17 2.1.6 Research examples of MAC architectures and Vector Multiplications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 6 3. LP-MAC Algorithm & Implementation . . . . . . . . . . . . . . . . . . 25 3.1 Algorithm General Idea Introduction . . . . . . . . . . . . . . . . . 25 3.1.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . 25 3.1.2 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.1.3 Binary Multiplication Example . . . . . . . . . . . . . . . . 26 3.1.4 Algorithm for Binary Multiplication with Bitsnum Bits . . 26 3.1.5 Binary Multiplication of Vectors Example . . . . . . . . . . 26 3.1.6 Binary Multiplication of One Vector with Two Vectors Example 27 3.1.7 Step-by-Step Multiplication . . . . . . . . . . . . . . . . . . 28 3.1.8 Algorithm for Binary Vector Multiplication with N Elements 29 3.1.9 Optimization opportunities in Binary Vector Multiplication 29 3.2 Low Power MAC Novel Algorithm . . . . . . . . . . . . . . . . . . 32 3.2.1 LP-MAC Algorithm Overview . . . . . . . . . . . . . . . . . 32 3.2.2 LP-MAC algorithm example . . . . . . . . . . . . . . . . . . 33 3.2.3 Detailed example for LP-MAC algorithm . . . . . . . . . . . 35 3.3 Applications in Neural Networks . . . . . . . . . . . . . . . . . . . 38 3.3.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . 38 3.3.2 Steps to Optimize Multiplications Using Partial Products . 38 3.3.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.3.4 General Algorithm in Pseudocode for high vector length . . 39 3.4 LP-MAC Vector Multiplication Generic Implementation . . . . . . 40 3.4.1 Special Case if Vector Inputs are signed . . . . . . . . . . . 40 3.5 Algorithm Working Conditions & Limitations . . . . . . . . . . . . 42 3.5.1 Power Cost of Vector Multiplications in Normal MAC and LP-MAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.5.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4. Architecture for common Neural networks using LP-MAC . . . . . . . . 45 4.1 Low Power MAC Deployment in Fully Connected Layers . . . . . . 46 4.2 Math operations model for CONV Layer . . . . . . . . . . . . . . 50 4.3 Convolutional Layer with LP-MAC . . . . . . . . . . . . . . . . . . 51 4.4 Low Power MAC Deployment in Attention Networks . . . . . . . . 53 5. Low Power MAC Co-processor for Embedded Devices with AHB interface 55 5.1 DNN AHB Co-processors for Embedded Systems . . . . . . . . . . 55 5.2 LP-MAC Co-processor with Tunable Activation Function . . . . . . 57 5.2.1 LP-MAC Array with AHB Interface . . . . . . . . . . . . . 57 5.2.2 Tunable Activation Function . . . . . . . . . . . . . . . . . 58 7 5.2.3 Tunable ReLU Function . . . . . . . . . . . . . . . . . . . . 59 5.2.4 Tunable ReLU Implementation . . . . . . . . . . . . . . . . 60 5.2.5 Novel Fixed-Point Implementation for Power Function . . . 61 6. Power Comparison & Verification Results . . . . . . . . . . . . . . . . . . 65 6.1 Example Implementation and Simulation of an Eyeriss CNN, Both with the Inclusion of Low Power MAC and Without It . . . . . . . 66 6.2 Simulations and Verification Results . . . . . . . . . . . . . . . . . 68 7. Future Work and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . 73 7.1 Working Conditions and Limitations . . . . . . . . . . . . . . . . . 73 7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 7.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Chapters: A. RTL Verilog Code for LP-MAC . . . . . . . . . . . . . . . . . . . . . . . 1 B. Synthesis scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 | |
| 520 | 3 | _aAbstract: Deep learning has gained importance in multiple fields, including robotics, speech recognition, and image processing. Nevertheless, deploying deep learning models on edge and embedded devices with limited power and area resources can be difficult due to their high computational demands. This thesis introduces a new optimized power approach for deployment deep learning networks on edge devices, named LPMAC (Low Power Multiply Accumulate). Low power MAC is a low-power technique introduced to target fixed-point format numbers and optimized for reusing vectors of inputs for Multiply-Accumulate (MAC) operations. Unlike conventional MAC units that use multipliers, LP-MAC uses only adders, shifters, and multiplexers, which consume less power. This results in over 30% power reduction, making LP-MAC more power-efficient. LP-MAC also offers efficient dynamically precision control of MAC operations, which results in low latency and real-time performance. The characteristics of LP-MAC position it as a highly suitable option for deploying deep learning architectures on edge devices that have limited power resources. This is especially true for models such as Convolutional Neural Networks (CNNs), Fully Connected networks, and networks that utilize Transformer-based attention mechanisms. The dissertation outlines the hardware realizations for these types of networks and proposes a systematic process for transitioning current networks to leverage LP-MAC over the traditional MAC units. Furthermore, the thesis details the creation of an 15 AHB-SLAVE co-processor equipped with an LP-MAC array, specifically tailored for embedded system applications. Keywords— Artificial intelligence, Hardware accelerators, Machine learning, Deep learning, Neural networks, High performance computing, FPGA, ASIC, GPU, Edge computing, Cloud computing, Energy efficiency, Performance optimization, System architecture,Parallel processing, Training and inference, Model compression, Quantization, Sparsity, Pruning, Multiply Accumulate | |
| 546 | _aText in English, abstracts in English and Arabic | ||
| 650 | 4 |
_aMSD _9317 |
|
| 655 | 7 |
_2NULIB _aDissertation, Academic _9187 |
|
| 690 |
_aMSD _9317 |
||
| 942 |
_2ddc _cTH |
||
| 999 |
_c10892 _d10892 |
||