A Review on Speech Emotion Recognition Using Deep Learning and Attention Mechanism

Eva Lieskovská; Maroš Jakubec; Roman Jarina; Michal Chmulík

doi:10.3390/electronics10101163

A Review on Speech Emotion Recognition Using Deep Learning and Attention Mechanism

Electronics ◽

10.3390/electronics10101163 ◽

2021 ◽

Vol 10 (10) ◽

pp. 1163

Author(s):

Eva Lieskovská ◽

Maroš Jakubec ◽

Roman Jarina ◽

Michal Chmulík

Keyword(s):

Emotion Recognition ◽

User Satisfaction ◽

Deep Neural Networks ◽

Attention Mechanism ◽

Speech Emotion Recognition ◽

Multimedia Content ◽

Human Interactions ◽

Benchmark Database ◽

The Impact ◽

Significant Factors

Emotions are an integral part of human interactions and are significant factors in determining user satisfaction or customer opinion. speech emotion recognition (SER) modules also play an important role in the development of human–computer interaction (HCI) applications. A tremendous number of SER systems have been developed over the last decades. Attention-based deep neural networks (DNNs) have been shown as suitable tools for mining information that is unevenly time distributed in multimedia content. The attention mechanism has been recently incorporated in DNN architectures to emphasise also emotional salient information. This paper provides a review of the recent development in SER and also examines the impact of various attention mechanisms on SER performance. Overall comparison of the system accuracies is performed on a widely used IEMOCAP benchmark database.

Download Full-text

The Impact of Attention Mechanisms on Speech Emotion Recognition

Sensors ◽

10.3390/s21227530 ◽

2021 ◽

Vol 21 (22) ◽

pp. 7530

Author(s):

Shouyan Chen ◽

Mingyan Zhang ◽

Xiaofen Yang ◽

Zhijia Zhao ◽

Tao Zou ◽

...

Keyword(s):

Emotion Recognition ◽

Attention Mechanism ◽

Speech Emotion Recognition ◽

Human Machine Interaction ◽

Attention Model ◽

Result Show ◽

Real Time Applications ◽

The Difference ◽

The Impact ◽

Machine Interaction

Speech emotion recognition (SER) plays an important role in real-time applications of human-machine interaction. The Attention Mechanism is widely used to improve the performance of SER. However, the applicable rules of attention mechanism are not deeply discussed. This paper discussed the difference between Global-Attention and Self-Attention and explored their applicable rules to SER classification construction. The experimental results show that the Global-Attention can improve the accuracy of the sequential model, while the Self-Attention can improve the accuracy of the parallel model when conducting the model with the CNN and the LSTM. With this knowledge, a classifier (CNN-LSTM×2+Global-Attention model) for SER is proposed. The experiments result show that it could achieve an accuracy of 85.427% on the EMO-DB dataset.

Download Full-text

Speech Emotion Recognition Using Deep Neural Networks on Multilingual Databases

Advances in Robotics, Automation and Data Analytics - Advances in Intelligent Systems and Computing ◽

10.1007/978-3-030-70917-4_3 ◽

2021 ◽

pp. 21-30

Author(s):

Syed Asif Ahmad Qadri ◽

Teddy Surya Gunawan ◽

Taiba Majid Wani ◽

Eliathamby Ambikairajah ◽

Mira Kartiwi ◽

...

Keyword(s):

Neural Networks ◽

Emotion Recognition ◽

Deep Neural Networks ◽

Speech Emotion Recognition

Download Full-text

Speech Emotion Recognition Based on Selective Interpolation Synthetic Minority Over-Sampling Technique in Small Sample Environment

Sensors ◽

10.3390/s20082297 ◽

2020 ◽

Vol 20 (8) ◽

pp. 2297

Author(s):

Zhen-Tao Liu ◽

Bao-Han Wu ◽

Dan-Yun Li ◽

Peng Xiao ◽

Jun-Wei Mao

Keyword(s):

Emotion Recognition ◽

Feature Selection Method ◽

Sampling Technique ◽

Small Sample ◽

Speech Emotion Recognition ◽

Gradient Boosting ◽

Data Imbalance ◽

The Arts ◽

The Impact ◽

Sample Environment

Speech emotion recognition often encounters the problems of data imbalance and redundant features in different application scenarios. Researchers usually design different recognition models for different sample conditions. In this study, a speech emotion recognition model for a small sample environment is proposed. A data imbalance processing method based on selective interpolation synthetic minority over-sampling technique (SISMOTE) is proposed to reduce the impact of sample imbalance on emotion recognition results. In addition, feature selection method based on variance analysis and gradient boosting decision tree (GBDT) is introduced, which can exclude the redundant features that possess poor emotional representation. Results of experiments of speech emotion recognition on three databases (i.e., CASIA, Emo-DB, SAVEE) show that our method obtains average recognition accuracy of 90.28% (CASIA), 75.00% (SAVEE) and 85.82% (Emo-DB) for speaker-dependent speech emotion recognition which is superior to some state-of-the-arts works.

Download Full-text

Combining Gated Convolutional Networks and Self-Attention Mechanism for Speech Emotion Recognition

2019 8th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW) ◽

10.1109/aciiw.2019.8925283 ◽

2019 ◽

Author(s):

Chao Li ◽

Jinlong Jiao ◽

Yiqin Zhao ◽

Ziping Zhao

Keyword(s):

Emotion Recognition ◽

Attention Mechanism ◽

Speech Emotion Recognition ◽

Convolutional Networks

Download Full-text

CNN-based Speech Emotion Recognition Model Applying Transfer Learning and Attention Mechanism

Journal of KIISE ◽

10.5626/jok.2020.47.7.665 ◽

2020 ◽

Vol 47 (7) ◽

pp. 665-673

Author(s):

Jung Hyun Lee ◽

Ui Nyoung Yoon ◽

Geun-Sik Jo

Keyword(s):

Emotion Recognition ◽

Transfer Learning ◽

Attention Mechanism ◽

Speech Emotion Recognition ◽

Recognition Model

Download Full-text

Convolutional Recurrent Neural Networks Based Speech Emotion Recognition

Journal of Computational and Theoretical Nanoscience ◽

10.1166/jctn.2020.9321 ◽

2020 ◽

Vol 17 (8) ◽

pp. 3786-3789

Author(s):

P. Gayathri ◽

P. Gowri Priya ◽

L. Sravani ◽

Sandra Johnson ◽

Visanth Sampath

Keyword(s):

Neural Networks ◽

Emotion Recognition ◽

Recurrent Neural Networks ◽

Machine Learning Techniques ◽

Speech Emotion Recognition ◽

Emotional Information ◽

Feature Representations ◽

Emotional Factors ◽

Learning Techniques ◽

The Impact

Recognition of emotions is the aspect of speech recognition that is gaining more attention and the need for it is growing enormously. Although there are methods to identify emotion using machine learning techniques, we assume in this paper that calculating deltas and delta-deltas for customized features not only preserves effective emotional information, but also that the impact of irrelevant emotional factors, leading to a reduction in misclassification. Furthermore, Speech Emotion Recognition (SER) often suffers from the silent frames and irrelevant emotional frames. Meanwhile, the process of attention has demonstrated exceptional performance in learning related feature representations for specific tasks. Inspired by this, propose a Convolutionary Recurrent Neural Networks (ACRNN) based on Attention to learn discriminative features for SER, where the Mel-spectrogram with deltas and delta-deltas is used as input. Finally, experimental results show the feasibility of the proposed method and attain state-of-the-art performance in terms of unweighted average recall.

Download Full-text

Towards real-time Speech Emotion Recognition using deep neural networks

2015 9th International Conference on Signal Processing and Communication Systems (ICSPCS) ◽

10.1109/icspcs.2015.7391796 ◽

2015 ◽

Cited By ~ 18

Author(s):

H.M. Fayek ◽

M. Lech ◽

L. Cavedon

Keyword(s):

Neural Networks ◽

Emotion Recognition ◽

Real Time ◽

Deep Neural Networks ◽

Speech Emotion Recognition

Download Full-text

Speech emotion recognition on mobile devices based on modulation spectral feature pooling and deep neural networks

2017 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT) ◽

10.1109/isspit.2017.8388669 ◽

2017 ◽

Cited By ~ 2

Author(s):

Anderson R. Avila ◽

Joao Monteiro ◽

Douglas O'Shaughneussy ◽

Tiago H. Falk

Keyword(s):

Neural Networks ◽

Emotion Recognition ◽

Mobile Devices ◽

Deep Neural Networks ◽

Spectral Feature ◽

Speech Emotion Recognition ◽

Feature Pooling

Download Full-text

An Analysis of the Impact of Spectral Contrast Feature in Speech Emotion Recognition

International Journal of Recent Contributions from Engineering Science & IT (iJES) ◽

10.3991/ijes.v9i2.22983 ◽

2021 ◽

Vol 9 (2) ◽

pp. 87

Author(s):

Shreya Kumar ◽

Swarnalaxmi Thiruvenkadam

Keyword(s):

Feature Extraction ◽

Emotion Recognition ◽

Prediction Accuracy ◽

Speech Emotion Recognition ◽

German Language ◽

Spectral Contrast ◽

Recognition Systems ◽

The Impact ◽

Contrast Feature

Feature extraction is an integral part in speech emotion recognition. Some emotions become indistinguishable from others due to high resemblance in their features, which results in low prediction accuracy. This paper analyses the impact of spectral contrast feature in increasing the accuracy for such emotions. The RAVDESS dataset has been chosen for this study. The SAVEE dataset, CREMA-D dataset and JL corpus dataset were also used to test its performance over different English accents. In addition to that, EmoDB dataset has been used to study its performance in the German language. The use of spectral contrast feature has increased the prediction accuracy in speech emotion recognition systems to a good degree as it performs well in distinguishing emotions with significant differences in arousal levels, and it has been discussed in detail.<div> </div>

Download Full-text

Sparse Autoencoder with Attention Mechanism for Speech Emotion Recognition

2019 IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS) ◽

10.1109/aicas.2019.8771593 ◽

2019 ◽

Author(s):

Ting-Wei Sun ◽

An-Yeu Andy Wu

Keyword(s):

Emotion Recognition ◽

Attention Mechanism ◽

Speech Emotion Recognition ◽

Sparse Autoencoder

Download Full-text