An Ensemble Model for Multi-Level Speech Emotion Recognition

Chunjun Zheng; Chunli Wang; Ning Jia

doi:10.3390/app10010205

An Ensemble Model for Multi-Level Speech Emotion Recognition

Applied Sciences ◽

10.3390/app10010205 ◽

2019 ◽

Vol 10 (1) ◽

pp. 205 ◽

Cited By ~ 5

Author(s):

Chunjun Zheng ◽

Chunli Wang ◽

Ning Jia

Keyword(s):

Neural Network ◽

Emotion Recognition ◽

Ensemble Learning ◽

Short Term Memory ◽

Learning Model ◽

Local Features ◽

Speech Emotion Recognition ◽

Model Design ◽

Local Data ◽

Global Features

Speech emotion recognition is a challenging and widely examined research topic in the field of speech processing. The accuracy of existing models in speech emotion recognition tasks is not high, and the generalization ability is not strong. Since the feature set and model design of effective speech directly affect the accuracy of speech emotion recognition, research on features and models is important. Because emotional expression is often correlated with the global features, local features, and model design of speech, it is often difficult to find a universal solution for effective speech emotion recognition. Based on this, the main research purpose of this paper is to generate general emotion features in speech signals from different angles, and use the ensemble learning model to perform emotion recognition tasks. It is divided into the following aspects: (1) Three expert roles of speech emotion recognition are designed. Expert 1 focuses on three-dimensional feature extraction of local signals; expert 2 focuses on extraction of comprehensive information in local data; and expert 3 emphasizes global features: acoustic feature descriptors (low-level descriptors (LLDs)), high-level statistics functionals (HSFs), and local features and their timing relationships. A single-/multiple-level deep learning model that meets expert characteristics is designed for each expert, including convolutional neural network (CNN), bi-directional long short-term memory (BLSTM), and gated recurrent unit (GRU). Convolutional recurrent neural network (CRNN), based on a combination of an attention mechanism, is used for internal training of experts. (2) By designing an ensemble learning model, each expert can play to its own advantages and evaluate speech emotions from different focuses. (3) Through experiments, the performance of various experts and ensemble learning models in emotion recognition is compared in the Interactive Emotional Dyadic Motion Capture (IEMOCAP) corpus and the validity of the proposed model is verified.

Download Full-text

Audio-Textual Emotion Recognition Based on Improved Neural Networks

Mathematical Problems in Engineering ◽

10.1155/2019/2593036 ◽

2019 ◽

Vol 2019 ◽

pp. 1-9 ◽

Cited By ~ 4

Author(s):

Linqin Cai ◽

Yaxin Hu ◽

Jiangong Dong ◽

Sitong Zhou

Keyword(s):

Neural Network ◽

Emotion Recognition ◽

Short Term Memory ◽

Recognition Accuracy ◽

Recognition System ◽

Speech Emotion Recognition ◽

Short Term ◽

Term Memory ◽

Emotional Recognition ◽

Long Short Term Memory

With the rapid development in social media, single-modal emotion recognition is hard to satisfy the demands of the current emotional recognition system. Aiming to optimize the performance of the emotional recognition system, a multimodal emotion recognition model from speech and text was proposed in this paper. Considering the complementarity between different modes, CNN (convolutional neural network) and LSTM (long short-term memory) were combined in a form of binary channels to learn acoustic emotion features; meanwhile, an effective Bi-LSTM (bidirectional long short-term memory) network was resorted to capture the textual features. Furthermore, we applied a deep neural network to learn and classify the fusion features. The final emotional state was determined by the output of both speech and text emotion analysis. Finally, the multimodal fusion experiments were carried out to validate the proposed model on the IEMOCAP database. In comparison with the single modal, the overall recognition accuracy of text increased 6.70%, and that of speech emotion recognition soared 13.85%. Experimental results show that the recognition accuracy of our multimodal is higher than that of the single modal and outperforms other published multimodal models on the test datasets.

Download Full-text

Ensemble Learning With Attention-Integrated Convolutional Recurrent Neural Network for Imbalanced Speech Emotion Recognition

IEEE Access ◽

10.1109/access.2020.3035910 ◽

2020 ◽

Vol 8 ◽

pp. 199909-199919

Author(s):

Xusheng Ai ◽

Victor S. Sheng ◽

Wei Fang ◽

Charles X. Ling ◽

Chunhua Li

Keyword(s):

Neural Network ◽

Emotion Recognition ◽

Ensemble Learning ◽

Recurrent Neural Network ◽

Speech Emotion Recognition

Download Full-text

Pre-trained Deep Convolution Neural Network Model With Attention for Speech Emotion Recognition

Frontiers in Physiology ◽

10.3389/fphys.2021.643202 ◽

2021 ◽

Vol 12 ◽

Author(s):

Hua Zhang ◽

Ruoyun Gou ◽

Jili Shang ◽

Fangyao Shen ◽

Yifan Wu ◽

...

Keyword(s):

Neural Network ◽

Emotion Recognition ◽

Short Term Memory ◽

Convolution Neural Network ◽

Classification Model ◽

Speech Emotion Recognition ◽

Deep Convolution Neural Network ◽

Long Short Term Memory ◽

High Level ◽

Better Than

Speech emotion recognition (SER) is a difficult and challenging task because of the affective variances between different speakers. The performances of SER are extremely reliant on the extracted features from speech signals. To establish an effective features extracting and classification model is still a challenging task. In this paper, we propose a new method for SER based on Deep Convolution Neural Network (DCNN) and Bidirectional Long Short-Term Memory with Attention (BLSTMwA) model (DCNN-BLSTMwA). We first preprocess the speech samples by data enhancement and datasets balancing. Secondly, we extract three-channel of log Mel-spectrograms (static, delta, and delta-delta) as DCNN input. Then the DCNN model pre-trained on ImageNet dataset is applied to generate the segment-level features. We stack these features of a sentence into utterance-level features. Next, we adopt BLSTM to learn the high-level emotional features for temporal summarization, followed by an attention layer which can focus on emotionally relevant features. Finally, the learned high-level emotional features are fed into the Deep Neural Network (DNN) to predict the final emotion. Experiments on EMO-DB and IEMOCAP database obtain the unweighted average recall (UAR) of 87.86 and 68.50%, respectively, which are better than most popular SER methods and demonstrate the effectiveness of our propose method.

Download Full-text

Speech Emotion Recognition using Time Distributed CNN and LSTM

ITM Web of Conferences ◽

10.1051/itmconf/20214003006 ◽

2021 ◽

Vol 40 ◽

pp. 03006

Author(s):

Beenaa Salian ◽

Omkar Narvade ◽

Rujuta Tambewagh ◽

Smita Bharne

Keyword(s):

Neural Network ◽

Emotion Recognition ◽

Short Term Memory ◽

Recognition System ◽

Speech Emotion Recognition ◽

Audio Analysis ◽

The Neural Network ◽

Testing Accuracy ◽

Characteristic Features ◽

Four Blocks

Speech has several distinguishing characteristic features which has remained a state-of-the-art tool for extracting valuable information from audio samples. Our aim is to develop a emotion recognition system using these speech features, which would be able to accurately and efficiently recognize emotions through audio analysis. In this article, we have employed a hybrid neural network comprising four blocks of time distributed convolutional layers followed by a layer of Long Short Term Memory to achieve the same.The audio samples for the speech dataset are collectively assembled from RAVDESS, TESS and SAVEE audio datasets and are further augmented by injecting noise. Mel Spectrograms are computed from audio samples and are used to train the neural network. We have been able to achieve a testing accuracy of about 89.26%.

Download Full-text

Speech emotion recognition using convolutional long short-term memory neural network and support vector machines

2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) ◽

10.1109/apsipa.2017.8282315 ◽

2017 ◽

Cited By ~ 1

Author(s):

Nattapong Kurpukdee ◽

Tomoki Koriyama ◽

Takao Kobayashi ◽

Sawit Kasuriya ◽

Chai Wutiwiwatchai ◽

...

Keyword(s):

Neural Network ◽

Support Vector Machines ◽

Emotion Recognition ◽

Short Term Memory ◽

Speech Emotion Recognition ◽

Support Vector ◽

Short Term ◽

Term Memory ◽

Vector Machines ◽

Long Short Term Memory

Download Full-text

Modeling Perceivers Neural-Responses Using Lobe-Dependent Convolutional Neural Network to Improve Speech Emotion Recognition

10.21437/interspeech.2017-562 ◽

2017 ◽

Cited By ~ 3

Author(s):

Ya-Tse Wu ◽

Hsuan-Yu Chen ◽

Yu-Hsien Liao ◽

Li-Wei Kuo ◽

Chi-Chun Lee

Keyword(s):

Neural Network ◽

Convolutional Neural Network ◽

Emotion Recognition ◽

Speech Emotion Recognition ◽

Neural Responses

Download Full-text

Improving Sentiment Analysis using Hybrid Deep Learning Model

Recent Advances in Computer Science and Communications ◽

10.2174/2213275912666190328200012 ◽

2020 ◽

Vol 13 (4) ◽

pp. 627-640 ◽

Cited By ~ 1

Author(s):

Avinash Chandra Pandey ◽

Dharmveer Singh Rajpoot

Keyword(s):

Neural Network ◽

Deep Learning ◽

Sentiment Analysis ◽

Classification Accuracy ◽

Short Term Memory ◽

Computational Cost ◽

Extraction Process ◽

Learning Model ◽

Sentiment Classification ◽

Deep Learning Model

Background: Sentiment analysis is a contextual mining of text which determines viewpoint of users with respect to some sentimental topics commonly present at social networking websites. Twitter is one of the social sites where people express their opinion about any topic in the form of tweets. These tweets can be examined using various sentiment classification methods to find the opinion of users. Traditional sentiment analysis methods use manually extracted features for opinion classification. The manual feature extraction process is a complicated task since it requires predefined sentiment lexicons. On the other hand, deep learning methods automatically extract relevant features from data hence; they provide better performance and richer representation competency than the traditional methods. Objective: The main aim of this paper is to enhance the sentiment classification accuracy and to reduce the computational cost. Method: To achieve the objective, a hybrid deep learning model, based on convolution neural network and bi-directional long-short term memory neural network has been introduced. Results: The proposed sentiment classification method achieves the highest accuracy for the most of the datasets. Further, from the statistical analysis efficacy of the proposed method has been validated. Conclusion: Sentiment classification accuracy can be improved by creating veracious hybrid models. Moreover, performance can also be enhanced by tuning the hyper parameters of deep leaning models.

Download Full-text

Robust Speech Emotion Recognition for Sindhi Language based on Deep Convolutional Neural Network

2021 International Conference on Communications, Information System and Computer Engineering (CISCE) ◽

10.1109/cisce52179.2021.9445883 ◽

2021 ◽

Author(s):

Muddasar Laghari ◽

Muhammad Junaid Tahir ◽

Abdullah Azeem ◽

Waqar Riaz ◽

Yi Zhou

Keyword(s):

Neural Network ◽

Convolutional Neural Network ◽

Emotion Recognition ◽

Deep Convolutional Neural Network ◽

Speech Emotion Recognition

Download Full-text

Multi-Path and Group-Loss-Based Network for Speech Emotion Recognition in Multi-Domain Datasets

Sensors ◽

10.3390/s21051579 ◽

2021 ◽

Vol 21 (5) ◽

pp. 1579 ◽

Cited By ~ 1

Author(s):

Kyoung Ju Noh ◽

Chi Yoon Jeong ◽

Jiyoun Lim ◽

Seungeun Chung ◽

Gague Kim ◽

...

Keyword(s):

Emotion Recognition ◽

Short Term Memory ◽

Domain Adaptation ◽

Classification Model ◽

Speech Emotion Recognition ◽

Target Domain ◽

Model Generalization ◽

Speech Database ◽

Emotion Labels ◽

Temporal Feature

Speech emotion recognition (SER) is a natural method of recognizing individual emotions in everyday life. To distribute SER models to real-world applications, some key challenges must be overcome, such as the lack of datasets tagged with emotion labels and the weak generalization of the SER model for an unseen target domain. This study proposes a multi-path and group-loss-based network (MPGLN) for SER to support multi-domain adaptation. The proposed model includes a bidirectional long short-term memory-based temporal feature generator and a transferred feature extractor from the pre-trained VGG-like audio classification model (VGGish), and it learns simultaneously based on multiple losses according to the association of emotion labels in the discrete and dimensional models. For the evaluation of the MPGLN SER as applied to multi-cultural domain datasets, the Korean Emotional Speech Database (KESD), including KESDy18 and KESDy19, is constructed, and the English-speaking Interactive Emotional Dyadic Motion Capture database (IEMOCAP) is used. The evaluation of multi-domain adaptation and domain generalization showed 3.7% and 3.5% improvements, respectively, of the F1 score when comparing the performance of MPGLN SER with a baseline SER model that uses a temporal feature generator. We show that the MPGLN SER efficiently supports multi-domain adaptation and reinforces model generalization.

Download Full-text

Convolution neural network based automatic speech emotion recognition using Mel-frequency Cepstrum coefficients

Multimedia Tools and Applications ◽

10.1007/s11042-020-10329-2 ◽

2021 ◽

Author(s):

Manju D. Pawar ◽

Rajendra D. Kokate

Keyword(s):

Neural Network ◽

Emotion Recognition ◽

Convolution Neural Network ◽

Speech Emotion Recognition

Download Full-text