High-level feature representation using recurrent neural network for speech emotion recognition

Multimodal Approach of Speech Emotion Recognition Using Multi-Level Multi-Head Fusion Attention-Based Recurrent Neural Network

IEEE Access ◽

10.1109/access.2020.2984368 ◽

2020 ◽

Vol 8 ◽

pp. 61672-61686 ◽

Cited By ~ 6

Author(s):

Ngoc-Huynh Ho ◽

Hyung-Jeong Yang ◽

Soo-Hyung Kim ◽

Gueesang Lee

Keyword(s):

Neural Network ◽

Emotion Recognition ◽

Recurrent Neural Network ◽

Speech Emotion Recognition ◽

Multimodal Approach ◽

Multi Level

Download Full-text

Dependency Exploitation: A Unified CNN-RNN Approach for Visual Emotion Recognition

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2017/503 ◽

2017 ◽

Cited By ~ 21

Author(s):

Xinge Zhu ◽

Liang Li ◽

Weigang Zhang ◽

Tianrong Rao ◽

Min Xu ◽

...

Keyword(s):

Neural Network ◽

Emotion Recognition ◽

Feature Fusion ◽

Feature Representation ◽

Low Level ◽

Learning Framework ◽

Independent Entity ◽

Internet Images ◽

High Level ◽

Different Levels

Visual emotion recognition aims to associate images with appropriate emotions. There are different visual stimuli that can affect human emotion from low-level to high-level, such as color, texture, part, object, etc. However, most existing methods treat different levels of features as independent entity without having effective method for feature fusion. In this paper, we propose a unified CNN-RNN model to predict the emotion based on the fused features from different levels by exploiting the dependency among them. Our proposed architecture leverages convolutional neural network (CNN) with multiple layers to extract different levels of features with in a multi-task learning framework, in which two related loss functions are introduced to learn the feature representation. Considering the dependencies within the low-level and high-level features, a new bidirectional recurrent neural network (RNN) is proposed to integrate the learned features from different layers in the CNN model. Extensive experiments on both Internet images and art photo datasets demonstrate that our method outperforms the state-of-the-art methods with at least 7% performance improvement.

Download Full-text

A Hybrid Technique using CNN+LSTM for Speech Emotion Recognition

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.e1027.069520 ◽

2020 ◽

Vol 9 (5) ◽

pp. 1126-1130

Keyword(s):

Feature Extraction ◽

Human Computer Interaction ◽

Emotion Recognition ◽

Speech Emotion Recognition ◽

Hybrid Technique ◽

Proposed Model ◽

High Level ◽

High Level Feature ◽

Convolutional Lstm

Automatic speech emotion recognition is a very necessary activity for effective human-computer interaction. This paper is motivated by using spectrograms as inputs to the hybrid deep convolutional LSTM for speech emotion recognition. In this study, we trained our proposed model using four convolutional layers for high-level feature extraction from input spectrograms, LSTM layer for accumulating long-term dependencies and finally two dense layers. Experimental results on the SAVEE database shows promising performance. Our proposed model is highly capable as it obtained an accuracy of 94.26%.

Download Full-text

Conversational Speech Emotion Recognition From Indonesian Spoken Language Using Recurrent Neural Network-Based Model

10.1109/icaicta53211.2021.9640273 ◽

2021 ◽

Author(s):

Aisyah Nurul Izzah Adma ◽

Dessi Puji Lestari

Keyword(s):

Neural Network ◽

Emotion Recognition ◽

Recurrent Neural Network ◽

Spoken Language ◽

Speech Emotion Recognition ◽

Conversational Speech

Download Full-text

Deep convolutional recurrent neural network with attention mechanism for robust speech emotion recognition

2017 IEEE International Conference on Multimedia and Expo (ICME) ◽

10.1109/icme.2017.8019296 ◽

2017 ◽

Cited By ~ 31

Author(s):

Che-Wei Huang ◽

Shrikanth Shri Narayanan

Keyword(s):

Neural Network ◽

Emotion Recognition ◽

Recurrent Neural Network ◽

Attention Mechanism ◽

Speech Emotion Recognition

Download Full-text

Discretized Continuous Speech Emotion Recognition with Multi-Task Deep Recurrent Neural Network

10.21437/interspeech.2017-94 ◽

2017 ◽

Cited By ~ 12

Author(s):

Duc Le ◽

Zakaria Aldeneh ◽

Emily Mower Provost

Keyword(s):

Neural Network ◽

Emotion Recognition ◽

Recurrent Neural Network ◽

Speech Emotion Recognition ◽

Continuous Speech ◽

Deep Recurrent Neural Network

Download Full-text

Parallelized Convolutional Recurrent Neural Network With Spectral Features for Speech Emotion Recognition

IEEE Access ◽

10.1109/access.2019.2927384 ◽

2019 ◽

Vol 7 ◽

pp. 90368-90377 ◽

Cited By ~ 11

Author(s):

Pengxu Jiang ◽

Hongliang Fu ◽

Huawei Tao ◽

Peizhi Lei ◽

Li Zhao

Keyword(s):

Neural Network ◽

Emotion Recognition ◽

Recurrent Neural Network ◽

Speech Emotion Recognition ◽

Spectral Features

Download Full-text

Ensemble Learning With Attention-Integrated Convolutional Recurrent Neural Network for Imbalanced Speech Emotion Recognition

IEEE Access ◽

10.1109/access.2020.3035910 ◽

2020 ◽

Vol 8 ◽

pp. 199909-199919

Author(s):

Xusheng Ai ◽

Victor S. Sheng ◽

Wei Fang ◽

Charles X. Ling ◽

Chunhua Li

Keyword(s):

Neural Network ◽

Emotion Recognition ◽

Ensemble Learning ◽

Recurrent Neural Network ◽

Speech Emotion Recognition

Download Full-text

A Study on a Speech Emotion Recognition System with Effective Acoustic Features Using Deep Learning Algorithms

Applied Sciences ◽

10.3390/app11041890 ◽

2021 ◽

Vol 11 (4) ◽

pp. 1890

Author(s):

Sung-Woo Byun ◽

Seok-Pil Lee

Keyword(s):

Neural Network ◽

Emotion Recognition ◽

Network Model ◽

Recurrent Neural Network ◽

Neural Network Model ◽

Recognition Performance ◽

Recognition System ◽

Speech Emotion Recognition ◽

Acoustic Features ◽

Speech Database

The goal of the human interface is to recognize the user’s emotional state precisely. In the speech emotion recognition study, the most important issue is the effective parallel use of the extraction of proper speech features and an appropriate classification engine. Well defined speech databases are also needed to accurately recognize and analyze emotions from speech signals. In this work, we constructed a Korean emotional speech database for speech emotion analysis and proposed a feature combination that can improve emotion recognition performance using a recurrent neural network model. To investigate the acoustic features, which can reflect distinct momentary changes in emotional expression, we extracted F0, Mel-frequency cepstrum coefficients, spectral features, harmonic features, and others. Statistical analysis was performed to select an optimal combination of acoustic features that affect the emotion from speech. We used a recurrent neural network model to classify emotions from speech. The results show the proposed system has more accurate performance than previous studies.

Download Full-text

Pre-trained Deep Convolution Neural Network Model With Attention for Speech Emotion Recognition

Frontiers in Physiology ◽

10.3389/fphys.2021.643202 ◽

2021 ◽

Vol 12 ◽

Author(s):

Hua Zhang ◽

Ruoyun Gou ◽

Jili Shang ◽

Fangyao Shen ◽

Yifan Wu ◽

...

Keyword(s):

Neural Network ◽

Emotion Recognition ◽

Short Term Memory ◽

Convolution Neural Network ◽

Classification Model ◽

Speech Emotion Recognition ◽

Deep Convolution Neural Network ◽

Long Short Term Memory ◽

High Level ◽

Better Than

Speech emotion recognition (SER) is a difficult and challenging task because of the affective variances between different speakers. The performances of SER are extremely reliant on the extracted features from speech signals. To establish an effective features extracting and classification model is still a challenging task. In this paper, we propose a new method for SER based on Deep Convolution Neural Network (DCNN) and Bidirectional Long Short-Term Memory with Attention (BLSTMwA) model (DCNN-BLSTMwA). We first preprocess the speech samples by data enhancement and datasets balancing. Secondly, we extract three-channel of log Mel-spectrograms (static, delta, and delta-delta) as DCNN input. Then the DCNN model pre-trained on ImageNet dataset is applied to generate the segment-level features. We stack these features of a sentence into utterance-level features. Next, we adopt BLSTM to learn the high-level emotional features for temporal summarization, followed by an attention layer which can focus on emotionally relevant features. Finally, the learned high-level emotional features are fed into the Deep Neural Network (DNN) to predict the final emotion. Experiments on EMO-DB and IEMOCAP database obtain the unweighted average recall (UAR) of 87.86 and 68.50%, respectively, which are better than most popular SER methods and demonstrate the effectiveness of our propose method.

Download Full-text