Dimensional speech emotion recognition from speech features and word embeddings by using multitask learning

APSIPA Transactions on Signal and Information Processing ◽

10.1017/atsip.2020.14 ◽

2020 ◽

Vol 9 ◽

Cited By ~ 1

Author(s):

Bagus Tris Atmaja ◽

Masato Akagi

Keyword(s):

Emotion Recognition ◽

Multitask Learning ◽

Speech Emotion Recognition ◽

Word Embeddings ◽

Concordance Correlation ◽

Acoustic Networks ◽

Overall Evaluation ◽

Speech Features ◽

Two Parameters ◽

Emotion Labels

Abstract The majority of research in speech emotion recognition (SER) is conducted to recognize emotion categories. Recognizing dimensional emotion attributes is also important, however, and it has several advantages over categorical emotion. For this research, we investigate dimensional SER using both speech features and word embeddings. The concatenation network joins acoustic networks and text networks from bimodal features. We demonstrate that those bimodal features, both are extracted from speech, improve the performance of dimensional SER over unimodal SER either using acoustic features or word embeddings. A significant improvement on the valence dimension is contributed by the addition of word embeddings to SER system, while arousal and dominance dimensions are also improved. We proposed a multitask learning (MTL) approach for the prediction of all emotional attributes. This MTL maximizes the concordance correlation between predicted emotion degrees and true emotion labels simultaneously. The findings suggest that the use of MTL with two parameters is better than other evaluated methods in representing the interrelation of emotional attributes. In unimodal results, speech features attain higher performance on arousal and dominance, while word embeddings are better for predicting valence. The overall evaluation uses the concordance correlation coefficient score of the three emotional attributes. We also discuss some differences between categorical and dimensional emotion results from psychological and engineering perspectives.

Download Full-text

Multi-Path and Group-Loss-Based Network for Speech Emotion Recognition in Multi-Domain Datasets

Sensors ◽

10.3390/s21051579 ◽

2021 ◽

Vol 21 (5) ◽

pp. 1579 ◽

Cited By ~ 1

Author(s):

Kyoung Ju Noh ◽

Chi Yoon Jeong ◽

Jiyoun Lim ◽

Seungeun Chung ◽

Gague Kim ◽

...

Keyword(s):

Emotion Recognition ◽

Short Term Memory ◽

Domain Adaptation ◽

Classification Model ◽

Speech Emotion Recognition ◽

Target Domain ◽

Model Generalization ◽

Speech Database ◽

Emotion Labels ◽

Temporal Feature

Speech emotion recognition (SER) is a natural method of recognizing individual emotions in everyday life. To distribute SER models to real-world applications, some key challenges must be overcome, such as the lack of datasets tagged with emotion labels and the weak generalization of the SER model for an unseen target domain. This study proposes a multi-path and group-loss-based network (MPGLN) for SER to support multi-domain adaptation. The proposed model includes a bidirectional long short-term memory-based temporal feature generator and a transferred feature extractor from the pre-trained VGG-like audio classification model (VGGish), and it learns simultaneously based on multiple losses according to the association of emotion labels in the discrete and dimensional models. For the evaluation of the MPGLN SER as applied to multi-cultural domain datasets, the Korean Emotional Speech Database (KESD), including KESDy18 and KESDy19, is constructed, and the English-speaking Interactive Emotional Dyadic Motion Capture database (IEMOCAP) is used. The evaluation of multi-domain adaptation and domain generalization showed 3.7% and 3.5% improvements, respectively, of the F1 score when comparing the performance of MPGLN SER with a baseline SER model that uses a temporal feature generator. We show that the MPGLN SER efficiently supports multi-domain adaptation and reinforces model generalization.

Download Full-text

Speech Emotion Recognition Framework based on User Self-referential Speech Features

2018 IEEE 7th Global Conference on Consumer Electronics (GCCE) ◽

10.1109/gcce.2018.8574676 ◽

2018 ◽

Cited By ~ 1

Author(s):

Kyoungju Noh ◽

Seungeun Chung ◽

Jiyoun Lim ◽

Gague Kim ◽

Hyuntae Jeong

Keyword(s):

Emotion Recognition ◽

Speech Emotion Recognition ◽

Speech Features

Download Full-text

Speech Emotion Recognition Based on Joint Self-Assessment Manikins and Emotion Labels

2019 IEEE International Symposium on Multimedia (ISM) ◽

10.1109/ism46123.2019.00073 ◽

2019 ◽

Author(s):

Jing-Ming Chen ◽

Pao-Chi Chang ◽

Kai-Wen Liang

Keyword(s):

Emotion Recognition ◽

Speech Emotion Recognition ◽

Self Assessment ◽

Emotion Labels

Download Full-text

A Study on the Search of the Most Discriminative Speech Features in the Speaker Dependent Speech Emotion Recognition

2012 Fifth International Symposium on Parallel Architectures, Algorithms and Programming ◽

10.1109/paap.2012.31 ◽

2012 ◽

Cited By ~ 11

Author(s):

Tsang-Long Pao ◽

Chun-Hsiang Wang ◽

Yu-Ji Li

Keyword(s):

Emotion Recognition ◽

Speech Emotion Recognition ◽

Speech Features

Download Full-text

Evaluation of error- and correlation-based loss functions for multitask learning dimensional speech emotion recognition

Journal of Physics Conference Series ◽

10.1088/1742-6596/1896/1/012004 ◽

2021 ◽

Vol 1896 (1) ◽

pp. 012004

Author(s):

B T Atmaja ◽

M Akagi

Keyword(s):

Emotion Recognition ◽

Multitask Learning ◽

Loss Functions ◽

Speech Emotion Recognition

Download Full-text

IMPROVED SPEAKER-INDEPENDENT EMOTION RECOGNITION FROM SPEECH USING TWO-STAGE FEATURE REDUCTION

Journal of Information and Communication Technology ◽

10.32890/jict2015.14.0.8156 ◽

2015 ◽

Author(s):

Hasrul Mohd Nazid ◽

Hariharan Muthusamy ◽

Vikneswaran Vijean ◽

Sazali Yaacob

Keyword(s):

Emotion Recognition ◽

Principal Component ◽

Feature Reduction ◽

Speech Emotion Recognition ◽

Emotional Speech ◽

Two Stage ◽

Linear Discriminant ◽

Speaker Independent ◽

Speech Features ◽

And Gender

In the recent years, researchers are focusing to improve the accuracy of speech emotion recognition. Generally, high emotion recognition accuracies were obtained for two-class emotion recognition, but multi-class emotion recognition is still a challenging task . The main aim of this work is to propose a two-stage feature reduction using Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) for improving the accuracy of the speech emotion recognition (ER) system. Short-term speech features were extracted from the emotional speech signals. Experiments were carried out using four different supervised classifi ers with two different emotional speech databases. From the experimental results, it can be inferred that the proposed method provides better accuracies of 87.48% for speaker dependent (SD) and gender dependent (GD) ER experiment, 85.15% for speaker independent (SI) ER experiment, and 87.09% for gender independent (GI) experiment.

Download Full-text

RNN-based Dimensional Speech Emotion Recognition

10.31227/osf.io/wa3vp ◽

2020 ◽

Author(s):

Bagus Tris Atmaja

Keyword(s):

Emotion Recognition ◽

Short Term Memory ◽

Mean Squared Error ◽

Absolute Error ◽

Recognition System ◽

Speech Emotion Recognition ◽

Percentage Error ◽

Concordance Correlation ◽

Acoustic Feature ◽

Dense System

◆ A speech emotion recognition system based on recurrent neural networks is developed using long short-term memory networks.◆ Two of acoustic feature sets are evaluated: 31 Features (3 time-domain features, 5 frequency-domain features, 13 MFCCs, 5 F0s, and 5 Harmonics) and eGeMaps feature set (23 features).◆ To evaluate the performance, some metrics are used i.e. mean squared error (MSE), mean absolute percentage error (MAPE), mean absolute error (MAE) and concordance correlation coefficient (CCC). Among those metrics, CCC is main focus as it is used by other researchers.◆ The developed system used multi-task learning to maximize arousal, valence, and dominance at the same time using CCC loss (1 - CCC). The result shows using LSTM networks improve the CCC score compared to baseline dense system. The best CCC score isobtained on arousal followed by dominance and valence.

Download Full-text

Meta-Learning for Speech Emotion Recognition Considering Ambiguity of Emotion Labels

10.21437/interspeech.2020-1082 ◽

2020 ◽

Author(s):

Takuya Fujioka ◽

Takeshi Homma ◽

Kenji Nagamatsu

Keyword(s):

Emotion Recognition ◽

Speech Emotion Recognition ◽

Meta Learning ◽

Emotion Labels

Download Full-text

Improved End-to-End Speech Emotion Recognition Using Self Attention Mechanism and Multitask Learning

10.21437/interspeech.2019-2594 ◽

2019 ◽

Cited By ~ 10

Author(s):

Yuanchao Li ◽

Tianyu Zhao ◽

Tatsuya Kawahara

Keyword(s):

Emotion Recognition ◽

Multitask Learning ◽

Attention Mechanism ◽

Speech Emotion Recognition ◽

End To End

Download Full-text

Random Deep Belief Networks for Recognizing Emotions from Speech Signals

Computational Intelligence and Neuroscience ◽

10.1155/2017/1945630 ◽

2017 ◽

Vol 2017 ◽

pp. 1-9 ◽

Cited By ~ 19

Author(s):

Guihua Wen ◽

Huihui Li ◽

Jubing Huang ◽

Danyang Li ◽

Eryang Xun

Keyword(s):

Emotion Recognition ◽

Speech Signal ◽

Majority Voting ◽

Speech Emotion Recognition ◽

Speech Signals ◽

Belief Networks ◽

Deep Belief Networks ◽

Emotion Label ◽

The Rich ◽

Emotion Labels

Now the human emotions can be recognized from speech signals using machine learning methods; however, they are challenged by the lower recognition accuracies in real applications due to lack of the rich representation ability. Deep belief networks (DBN) can automatically discover the multiple levels of representations in speech signals. To make full of its advantages, this paper presents an ensemble of random deep belief networks (RDBN) method for speech emotion recognition. It firstly extracts the low level features of the input speech signal and then applies them to construct lots of random subspaces. Each random subspace is then provided for DBN to yield the higher level features as the input of the classifier to output an emotion label. All outputted emotion labels are then fused through the majority voting to decide the final emotion label for the input speech signal. The conducted experimental results on benchmark speech emotion databases show that RDBN has better accuracy than the compared methods for speech emotion recognition.

Download Full-text