UniTextFusion: A Unified Early Fusion Framework for Arabic Multimodal Sentiment Analysis with LLMs /Salma Khaled Ali Mohamed Ali

By:

Salma Khaled Ali Mohamed Ali

Material type: Text

TextLanguage: English Summary language: English, Arabic Publication details: 2025Description: 86p. ill. 21 cmSubject(s):

Genre/Form:

Dissertation, Academic

DDC classification:

Contents:

Contents: Contents Page Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IV Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VI List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . X List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XI List of Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XII Chapters: 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Challenges in Arabic Multimodal Sentiment Analysis . . . . . . . . . . . . . . . . . 2 1.3 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.4 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4.1 Background and Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4.5 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2. Background and Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1 Multi-modal Sentiment Analsysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Multi-modal Data Fusion Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2.1 Feature Level (early-stage) Fusion Technique . . . . . . . . . . . . . . . . . 6 2.2.2 Model Level (mid-stage) Fusion Technique . . . . . . . . . . . . . . . . . . . 6 2.2.3 Decision Level (late-stage) Fusion Technique . . . . . . . . . . . . . . . . . . 7 2.2.4 Comparison of Fusion Techniques . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 MuSA Techniques and Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3.1 Traditional Machine Learning Approaches . . . . . . . . . . . . . . . . . . . 9 2.3.2 Deep Learning Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3.3 Large Language Model (LLM)-based Generative Approaches . . . . . . . . . 10 3. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.1 Existing Multimodal Datasets for Sentiment Analysis . . . . . . . . . . . . . . . . . 12 VII 3.1.1 Comparison of English and Arabic Datasets . . . . . . . . . . . . . . . . . . 12 3.1.2 Gaps in Existing Multimodal Datasets . . . . . . . . . . . . . . . . . . . . . 14 3.2 Approaches to Multimodal Sentiment Analysis . . . . . . . . . . . . . . . . . . . . 15 3.3 Research Gaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4. Arabic Multimodal Sentiment Analysis Methodology . . . . . . . . . . . . . . . . . . . . 20 4.1 Ar-MUSA: Arabic Multimodal Sentiment Analysis Dataset . . . . . . . . . . . . . . 20 4.1.1 Data Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.1.2 Data Preperation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.1.3 Data Labeling Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.1.4 Ethical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.2 Multimodal Sentiment Analysis Fusion and Models . . . . . . . . . . . . . . . . . . 25 4.2.1 Pre-trained Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.2.2 Generative LLMs Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.2.3 UniText Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 5. Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.1 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.1.1 Weighted Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.1.2 Weighted Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.1.3 Weighted F1-Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.1.4 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.2 Pre-trained Models: Setup and Results . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.2.1 Text Based Transformer: MarBERT Model . . . . . . . . . . . . . . . . . . 47 5.2.2 Audio Based Transformer: Egyptian HuBERT . . . . . . . . . . . . . . . . . 48 5.2.3 Image Based Transformer: MobileNet V2 . . . . . . . . . . . . . . . . . . . 49 5.2.4 Multi-Modal Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 5.3 Generative LLMs Models: Setup and Results . . . . . . . . . . . . . . . . . . . . . 51 5.3.1 Text Based LLM: Qwen2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.3.2 Audio Based LLM: Qwen2-Audio . . . . . . . . . . . . . . . . . . . . . . . . 51 5.3.3 Image Based LLM: Qwen2-VL . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.3.4 Multi-modal Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.4 UniTect Fusion Approach: Setup and Results . . . . . . . . . . . . . . . . . . . . . 54 5.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.4.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.5 Comparative Analysis of UniText Fusion and State-of-the-Art Techniques . . . . . 57 6. Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Appendices: A. Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 VIII Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

Dissertation note: Thesis (M.A.)—Nile University, Egypt, 2025 . Abstract: Abstract: Multimodal Sentiment Analysis (MuSA) combines text, audio, and visual inputs to detect and classify emotions. Despite its growing relevance, Arabic MuSA research is limited due to the lack of highquality annotated datasets and the complexity of Arabic language processing. This work presents Ar-MuSA, an open-source Arabic MuSA dataset containing aligned text, audio, and visual data. Unlike existing unimodal resources, Ar-MuSA supports sentiment analysis across multiple modalities. The dataset is evaluated using MarBERT (text), HuBERT (audio), MobileNet (vision), Qwen2 (multimodal), and ensemble methods. Results indicate improved performance through modality fusion; MarBERT achieved a 71% F1-score for text-only classification, while audio and image modalities performed lower individually. Fusion with text improved performance from 39% to 67%, representing an absolute gain of 28%. To further improve results, the UniTextFusion framework is proposed. It performs Early Fusion by converting audio and visual signals into text descriptions, which are combined with transcripts and used as input to large language models (LLMs). Fine-tuning Arabic-compatible LLMs—LLaMA 3.1-8B Instruct and SILMA AI 9B—using LoRA (Low-Rank Adaptation) yielded F1-scores of 68% and 71%, surpassing unimodal baselines of 34% and 41% by 34 and 30 percentage points, respectively. Keywords: Arabic Multimodal Sentiment Analysis, LoRA, Fine-tuning, Arabic MuSA Dataset, Multimodal Generative LLMs, Fusion

Tags from this library: No tags from this library for this title. Log in to add tags.

Average rating: 0.0 (0 votes)

Holdings
Item type	Current library	Call number	Status	Date due	Barcode
Thesis	Main library	610/S.A.U/2025 (Browse shelf(Opens below))	Not for loan

Supervisor:
Dr. Walaa Medhat
Dr. Ensaf Hussein Mohamed

Thesis (M.A.)—Nile University, Egypt, 2025 .

"Includes bibliographical references"

Contents:
Contents
Page
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IV
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VI
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . X
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XI
List of Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XII
Chapters:
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Challenges in Arabic Multimodal Sentiment Analysis . . . . . . . . . . . . . . . . . 2
1.3 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4.1 Background and Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4.5 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2. Background and Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 Multi-modal Sentiment Analsysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Multi-modal Data Fusion Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1 Feature Level (early-stage) Fusion Technique . . . . . . . . . . . . . . . . . 6
2.2.2 Model Level (mid-stage) Fusion Technique . . . . . . . . . . . . . . . . . . . 6
2.2.3 Decision Level (late-stage) Fusion Technique . . . . . . . . . . . . . . . . . . 7
2.2.4 Comparison of Fusion Techniques . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 MuSA Techniques and Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.1 Traditional Machine Learning Approaches . . . . . . . . . . . . . . . . . . . 9
2.3.2 Deep Learning Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.3 Large Language Model (LLM)-based Generative Approaches . . . . . . . . . 10
3. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1 Existing Multimodal Datasets for Sentiment Analysis . . . . . . . . . . . . . . . . . 12
VII
3.1.1 Comparison of English and Arabic Datasets . . . . . . . . . . . . . . . . . . 12
3.1.2 Gaps in Existing Multimodal Datasets . . . . . . . . . . . . . . . . . . . . . 14
3.2 Approaches to Multimodal Sentiment Analysis . . . . . . . . . . . . . . . . . . . . 15
3.3 Research Gaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4. Arabic Multimodal Sentiment Analysis Methodology . . . . . . . . . . . . . . . . . . . . 20
4.1 Ar-MUSA: Arabic Multimodal Sentiment Analysis Dataset . . . . . . . . . . . . . . 20
4.1.1 Data Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.1.2 Data Preperation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.1.3 Data Labeling Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.1.4 Ethical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 Multimodal Sentiment Analysis Fusion and Models . . . . . . . . . . . . . . . . . . 25
4.2.1 Pre-trained Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2.2 Generative LLMs Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2.3 UniText Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5. Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.1 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.1.1 Weighted Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.1.2 Weighted Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.1.3 Weighted F1-Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.1.4 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.2 Pre-trained Models: Setup and Results . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.2.1 Text Based Transformer: MarBERT Model . . . . . . . . . . . . . . . . . . 47
5.2.2 Audio Based Transformer: Egyptian HuBERT . . . . . . . . . . . . . . . . . 48
5.2.3 Image Based Transformer: MobileNet V2 . . . . . . . . . . . . . . . . . . . 49
5.2.4 Multi-Modal Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.3 Generative LLMs Models: Setup and Results . . . . . . . . . . . . . . . . . . . . . 51
5.3.1 Text Based LLM: Qwen2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.3.2 Audio Based LLM: Qwen2-Audio . . . . . . . . . . . . . . . . . . . . . . . . 51
5.3.3 Image Based LLM: Qwen2-VL . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.3.4 Multi-modal Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.4 UniTect Fusion Approach: Setup and Results . . . . . . . . . . . . . . . . . . . . . 54
5.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.4.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.5 Comparative Analysis of UniText Fusion and State-of-the-Art Techniques . . . . . 57
6. Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Appendices:
A. Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
VIII
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

Abstract:
Multimodal Sentiment Analysis (MuSA) combines text, audio, and visual inputs to detect and classify
emotions. Despite its growing relevance, Arabic MuSA research is limited due to the lack of highquality annotated datasets and the complexity of Arabic language processing. This work presents
Ar-MuSA, an open-source Arabic MuSA dataset containing aligned text, audio, and visual data. Unlike existing unimodal resources, Ar-MuSA supports sentiment analysis across multiple modalities.
The dataset is evaluated using MarBERT (text), HuBERT (audio), MobileNet (vision), Qwen2 (multimodal), and ensemble methods. Results indicate improved performance through modality fusion;
MarBERT achieved a 71% F1-score for text-only classification, while audio and image modalities performed lower individually. Fusion with text improved performance from 39% to 67%, representing an
absolute gain of 28%. To further improve results, the UniTextFusion framework is proposed. It performs Early Fusion by converting audio and visual signals into text descriptions, which are combined
with transcripts and used as input to large language models (LLMs). Fine-tuning Arabic-compatible
LLMs—LLaMA 3.1-8B Instruct and SILMA AI 9B—using LoRA (Low-Rank Adaptation) yielded
F1-scores of 68% and 71%, surpassing unimodal baselines of 34% and 41% by 34 and 30 percentage
points, respectively.
Keywords:
Arabic Multimodal Sentiment Analysis, LoRA, Fine-tuning, Arabic MuSA Dataset, Multimodal
Generative LLMs, Fusion

Text in English, abstracts in English and Arabic

There are no comments on this title.

to post a comment.