Speech emotion recognition system/

Mai Mohamed Magdy Abd ElSalam El Seknedy

Speech emotion recognition system/ Mai Mohamed Magdy Abd ElSalam El Seknedy - 2022 - 135 p. ill. 21 cm.

Supervisor:
Sahar Ali Fawzi

Thesis (M.A.)—Nile University, Egypt, 2022 .

"Includes bibliographical references"

Contents:
TABLE OF CONTENTS PAGE
Dedication ..................................................................................................................... v
Acknowledgments....................................................................................................... vi
List of Tables ............................................................................................................ viii
List of Figures ............................................................................................................... x
List of Abbrevations ................................................................................................ xiv
Abstract ...................................................................................................................... xvi
Chapter 1: Introduction ................................................................................................. 1
1.1 Motivation …………………………………………………………….…..1
1.2 Problem Definition ………………………………………………….….. 2
1.3 Research Objective …………………………………………………..... 4
1.4 Research Structure. ……………………………………………………..... 5
Chapter 2: Background and Literature review ...............................................................6
2.1 Chapter overview…………………………………………….……….… 6
2.2 Emotions ………………………….……………………………………. 7
2.2.1 Discrete emotional model …………………………………….. 7
2.2.2 Contanous emotional model ………………………………….. 8
2.3 Features………………………………………………………….…....... 9
2.3.1 Acoustic feature types …………………………………….…..7
2.3.2 Feature selection techniques ………………………………....15
2.3.3 Feature Normalization ……………..………………………...16
2.4 Datasets ………………………………...................................................16
2.5 Model classification ……………………………….................................20
2.6 Literature Review……………………………….....................................28
Chapter 3: Materials and Methods .............................................................................. 33
3.1 SER System Architecture ………………………………………...…… 33
3.2 Datasets………………………………………………………………. .34
3.2.1 Datasets overview…………………………………………… .34
3.2.2 Arabic Survey ……………………………………………...35
viii
viii
3.3 Feature sets………………………………………………………………. 33
3.3.1 Feature Scaling (Data preprocessing)………………………42
3.3.2 Feature Importance ………………………………………...43
3.4 Classifers……………………………………………………………….…. 47
3.4.1 Model’s hyper-parameters …………………………..…….42
3.4.2 Evaluation metrics …………………………………….…...43
3.5 Experimentation Tools………………………………………………….…. 53
Chapter 4: Results and Discussions ............................................................................ 56
4.1 Single corpus SER ………………………………………………………... 56
4.1.1 Arabic corpus SER………………………………….…56
4.1.1.1 Arabic corpus survey………………….…....57
4.1.1.2 Arabic corpus SER Results…………………64
4.1.2 Urdu corpus SER…………………………………..…..76
4.1.3 English corpus SER……………………………………79
4.1.4 German corpus SER……………………………………80
4.1.5 French corpus SER…………………………………….82
4.1.6 Baseline single corpus SER…………………………....84
4.2 Cross corpus SER ……………………………………….…... ……………..85
4.2.1 Latin based cross-corpus SER………………………….85
4.2.2 Arabic vs Urdu cross-corpus SER…………………......88
4.2.3 Cross-corpus SER –including five languages…….........90
4.3 SER computational performance evaluation……………………………….. 93
4.3.1 Classifiers computational performance ………………..94
4.3.2 Features sets computational performance ……..………..95
Chapter 5: Conclusion and Future work ..................................................................... 96
References ................................................................................................................... 98
Appendix A: Publications ........................................................................................ 107
Appendix B: Datasets samples................................................................................. 108
Appendix C: Experimental results – Extra Diagrams ……………………………….

Abstract:
Nowadays, the Speech Emotion Recognition (SER) system is considered one of the most important applications of human-computer interaction. It creates a new means of communication between humans and machines through interpreting the speech signal and extracting the emotional content. The speech emotion recognition system has proved to have a very crucial part of our daily life applications as in call-centers, e-learning, medical therapies such as physiological diseases analysis and autonomous driver emotion detection. Although the great evolution of technology and wide research scope in that area, there is still a gap between the SER research applications and the ones needed on the everyday life applications. Most research focuses on new methodologies such as Deep learning models or new feature extraction techniques without giving more focus on system computational performance in real-time data that is suitable for commercial applications. Furthermore, there’s still a vague question that’s mostly addressed in research “what is the best speech feature set to be used to achieve the SER best performance?”. So far there’s no precise generic featureset to be used for the best performance. In addition, most of the existing research focuses on the performance of SER in a single corpus domain where the model is trained and tested on the same language. On the other hand, cross-corpus is still an ongoing challenge, as few studies have addressed cross-corpus emotion recognition. The main motivation in this work is to investigate the best featureset with high performance and low computational cost compared to benchmarked “Interspeech 2009 – 2010” featuresets. Two new feature sets were developed from a combination of spectral and prosodic features that is experimented and tested on a cross-corpus domain showing outperformed performance compared to other featuresets when experimented on the same datasets. The proposed SER system has been successfully experimented through the use of 5 datasets in 5 different languages (English, German, French, Arabic and URDU): Radvess, Cafe, Emodb, EYASE and Urdu datasets, respectively..
Furthermore, this research addressed the introduction of the Arabic language in SER systems due to their scarcity in the research domain. Studying its performance in the cross-corpus domain with Latin-based and Urdu languages was very promising. This research studied the performance of SER system using different models: Multi-Layer Perceptron, Support Vector Machine, Random Forest, Logistic Regression and Ensemble Learning using Majority voting. Results were analyzed and findings of the most convenient classifier to each language were concluded. Enhancement of performance compared to previous work of 16% in Urdu, 6.25% in English, 9.36% in German and 13.42% in French SER systems were achieved. Furthermore, featureset-2 showed very promising results compared to benchmarked Interspeech feature sets for both recognition rates and computational time. Cross-corpus showed results close to the baseline single corpus SER where in Arabic, an enhancement of 2.73% was achieved. Enhancements were achieved in Urdu and Arabic languages in cross-corpus domain compared to previous work.


Text in English, abstracts in English.


Informatics-IFM


Dissertation, Academic

610