MARC View

000			08157nam a22002657a 4500
008			201210s2024 a\|\|\|f bm\|\| 00\| 0 eng d
024	7		_a0009-0001-2561-9884 _2ORCID
040			_aEG-CaNU _cEG-CaNU
041	0		_aeng _beng _bara
082			_a621
100	0		_aMahmoud Kamel Ismail Hasabrabou _93564
245	1		_aNovel Edge AI with Power-Efficient Re-configurable MAC Processing Elements _c/Mahmoud Kamel Ismail Hasabrabou
260			_c2024
300			_a80p. _bill. _c21 cm.
500			_3Supervisor: Prof. Dr. Ahmed H. Madian
502			_aThesis (M.A.)—Nile University, Egypt, 2024 .
504			_a"Includes bibliographical references"
505	0		_aContents: Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 List of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Chapters: 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Thesis Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Research Organization . . . . . . . . . . . . . . . . . . . . . . . . . 4 2. Literature Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1 Background and Related Work Survey . . . . . . . . . . . . . . . . 5 2.1.1 Common Strategies to Decrease Power in AI Edge devices . 9 2.1.2 Neural Networks architecture optimizing techniques . . . . . 11 2.1.3 Lightweight Networks . . . . . . . . . . . . . . . . . . . . . 15 2.1.4 Neural network with reduced precision (int8) . . . . . . . . 17 2.1.5 State-of-the-Art MAC Architectures and Vector Multiplications 17 2.1.6 Research examples of MAC architectures and Vector Multiplications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 6 3. LP-MAC Algorithm & Implementation . . . . . . . . . . . . . . . . . . 25 3.1 Algorithm General Idea Introduction . . . . . . . . . . . . . . . . . 25 3.1.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . 25 3.1.2 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.1.3 Binary Multiplication Example . . . . . . . . . . . . . . . . 26 3.1.4 Algorithm for Binary Multiplication with Bitsnum Bits . . 26 3.1.5 Binary Multiplication of Vectors Example . . . . . . . . . . 26 3.1.6 Binary Multiplication of One Vector with Two Vectors Example 27 3.1.7 Step-by-Step Multiplication . . . . . . . . . . . . . . . . . . 28 3.1.8 Algorithm for Binary Vector Multiplication with N Elements 29 3.1.9 Optimization opportunities in Binary Vector Multiplication 29 3.2 Low Power MAC Novel Algorithm . . . . . . . . . . . . . . . . . . 32 3.2.1 LP-MAC Algorithm Overview . . . . . . . . . . . . . . . . . 32 3.2.2 LP-MAC algorithm example . . . . . . . . . . . . . . . . . . 33 3.2.3 Detailed example for LP-MAC algorithm . . . . . . . . . . . 35 3.3 Applications in Neural Networks . . . . . . . . . . . . . . . . . . . 38 3.3.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . 38 3.3.2 Steps to Optimize Multiplications Using Partial Products . 38 3.3.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.3.4 General Algorithm in Pseudocode for high vector length . . 39 3.4 LP-MAC Vector Multiplication Generic Implementation . . . . . . 40 3.4.1 Special Case if Vector Inputs are signed . . . . . . . . . . . 40 3.5 Algorithm Working Conditions & Limitations . . . . . . . . . . . . 42 3.5.1 Power Cost of Vector Multiplications in Normal MAC and LP-MAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.5.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4. Architecture for common Neural networks using LP-MAC . . . . . . . . 45 4.1 Low Power MAC Deployment in Fully Connected Layers . . . . . . 46 4.2 Math operations model for CONV Layer . . . . . . . . . . . . . . 50 4.3 Convolutional Layer with LP-MAC . . . . . . . . . . . . . . . . . . 51 4.4 Low Power MAC Deployment in Attention Networks . . . . . . . . 53 5. Low Power MAC Co-processor for Embedded Devices with AHB interface 55 5.1 DNN AHB Co-processors for Embedded Systems . . . . . . . . . . 55 5.2 LP-MAC Co-processor with Tunable Activation Function . . . . . . 57 5.2.1 LP-MAC Array with AHB Interface . . . . . . . . . . . . . 57 5.2.2 Tunable Activation Function . . . . . . . . . . . . . . . . . 58 7 5.2.3 Tunable ReLU Function . . . . . . . . . . . . . . . . . . . . 59 5.2.4 Tunable ReLU Implementation . . . . . . . . . . . . . . . . 60 5.2.5 Novel Fixed-Point Implementation for Power Function . . . 61 6. Power Comparison & Verification Results . . . . . . . . . . . . . . . . . . 65 6.1 Example Implementation and Simulation of an Eyeriss CNN, Both with the Inclusion of Low Power MAC and Without It . . . . . . . 66 6.2 Simulations and Verification Results . . . . . . . . . . . . . . . . . 68 7. Future Work and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . 73 7.1 Working Conditions and Limitations . . . . . . . . . . . . . . . . . 73 7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 7.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Chapters: A. RTL Verilog Code for LP-MAC . . . . . . . . . . . . . . . . . . . . . . . 1 B. Synthesis scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
520	3		_aAbstract: Deep learning has gained importance in multiple fields, including robotics, speech recognition, and image processing. Nevertheless, deploying deep learning models on edge and embedded devices with limited power and area resources can be difficult due to their high computational demands. This thesis introduces a new optimized power approach for deployment deep learning networks on edge devices, named LPMAC (Low Power Multiply Accumulate). Low power MAC is a low-power technique introduced to target fixed-point format numbers and optimized for reusing vectors of inputs for Multiply-Accumulate (MAC) operations. Unlike conventional MAC units that use multipliers, LP-MAC uses only adders, shifters, and multiplexers, which consume less power. This results in over 30% power reduction, making LP-MAC more power-efficient. LP-MAC also offers efficient dynamically precision control of MAC operations, which results in low latency and real-time performance. The characteristics of LP-MAC position it as a highly suitable option for deploying deep learning architectures on edge devices that have limited power resources. This is especially true for models such as Convolutional Neural Networks (CNNs), Fully Connected networks, and networks that utilize Transformer-based attention mechanisms. The dissertation outlines the hardware realizations for these types of networks and proposes a systematic process for transitioning current networks to leverage LP-MAC over the traditional MAC units. Furthermore, the thesis details the creation of an 15 AHB-SLAVE co-processor equipped with an LP-MAC array, specifically tailored for embedded system applications. Keywords— Artificial intelligence, Hardware accelerators, Machine learning, Deep learning, Neural networks, High performance computing, FPGA, ASIC, GPU, Edge computing, Cloud computing, Energy efficiency, Performance optimization, System architecture,Parallel processing, Training and inference, Model compression, Quantization, Sparsity, Pruning, Multiply Accumulate
546			_aText in English, abstracts in English and Arabic
650		4	_aMSD _9317
655		7	_2NULIB _aDissertation, Academic _9187
690			_aMSD _9317
942			_2ddc _cTH
999			_c10892 _d10892