Cascaded Convolutional Neural Network Architecture for Speech Emotion Recognition in Noisy Conditions

Youngja Nam; Chankyu Lee

doi:10.3390/s21134399

Cascaded Convolutional Neural Network Architecture for Speech Emotion Recognition in Noisy Conditions

Sensors ◽

10.3390/s21134399 ◽

2021 ◽

Vol 21 (13) ◽

pp. 4399

Author(s):

Youngja Nam ◽

Chankyu Lee

Keyword(s):

Emotion Recognition ◽

Network Architecture ◽

Speech Emotion Recognition ◽

Emotional Speech ◽

Adverse Conditions ◽

Residual Learning ◽

Noisy Conditions ◽

Speech Denoising ◽

Two Stages ◽

Language Universal

Convolutional neural networks (CNNs) are a state-of-the-art technique for speech emotion recognition. However, CNNs have mostly been applied to noise-free emotional speech data, and limited evidence is available for their applicability in emotional speech denoising. In this study, a cascaded denoising CNN (DnCNN)–CNN architecture is proposed to classify emotions from Korean and German speech in noisy conditions. The proposed architecture consists of two stages. In the first stage, the DnCNN exploits the concept of residual learning to perform denoising; in the second stage, the CNN performs the classification. The classification results for real datasets show that the DnCNN–CNN outperforms the baseline CNN in overall accuracy for both languages. For Korean speech, the DnCNN–CNN achieves an accuracy of 95.8%, whereas the accuracy of the CNN is marginally lower (93.6%). For German speech, the DnCNN–CNN has an overall accuracy of 59.3–76.6%, whereas the CNN has an overall accuracy of 39.4–58.1%. These results demonstrate the feasibility of applying the DnCNN with residual learning to speech denoising and the effectiveness of the CNN-based approach in speech emotion recognition. Our findings provide new insights into speech emotion recognition in adverse conditions and have implications for language-universal speech emotion recognition.

Download Full-text

Fusing Visual Attention CNN and Bag of Visual Words for Cross-Corpus Speech Emotion Recognition

Sensors ◽

10.3390/s20195559 ◽

2020 ◽

Vol 20 (19) ◽

pp. 5559

Author(s):

Minji Seo ◽

Myungho Kim

Keyword(s):

Visual Attention ◽

Emotion Recognition ◽

Expressed Emotion ◽

Local Features ◽

Speech Emotion Recognition ◽

Bag Of Visual Words ◽

Emotional Speech ◽

Visual Words ◽

Performance Reduction ◽

Global And Local

Speech emotion recognition (SER) classifies emotions using low-level features or a spectrogram of an utterance. When SER methods are trained and tested using different datasets, they have shown performance reduction. Cross-corpus SER research identifies speech emotion using different corpora and languages. Recent cross-corpus SER research has been conducted to improve generalization. To improve the cross-corpus SER performance, we pretrained the log-mel spectrograms of the source dataset using our designed visual attention convolutional neural network (VACNN), which has a 2D CNN base model with channel- and spatial-wise visual attention modules. To train the target dataset, we extracted the feature vector using a bag of visual words (BOVW) to assist the fine-tuned model. Because visual words represent local features in the image, the BOVW helps VACNN to learn global and local features in the log-mel spectrogram by constructing a frequency histogram of visual words. The proposed method shows an overall accuracy of 83.33%, 86.92%, and 75.00% in the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), the Berlin Database of Emotional Speech (EmoDB), and Surrey Audio-Visual Expressed Emotion (SAVEE), respectively. Experimental results on RAVDESS, EmoDB, SAVEE demonstrate improvements of 7.73%, 15.12%, and 2.34% compared to existing state-of-the-art cross-corpus SER approaches.

Download Full-text

Impact of Feature Selection Algorithm on Speech Emotion Recognition Using Deep Convolutional Neural Network

Sensors ◽

10.3390/s20216008 ◽

2020 ◽

Vol 20 (21) ◽

pp. 6008 ◽

Cited By ~ 1

Author(s):

Misbah Farooq ◽

Fawad Hussain ◽

Naveed Khan Baloch ◽

Fawad Riasat Raja ◽

Heejung Yu ◽

...

Keyword(s):

Neural Network ◽

Feature Selection ◽

Convolutional Neural Network ◽

Emotion Recognition ◽

Deep Convolutional Neural Network ◽

Speech Emotion Recognition ◽

Support Vector ◽

Emotional Speech ◽

Human Machine Interaction ◽

Speaker Independent

Speech emotion recognition (SER) plays a significant role in human–machine interaction. Emotion recognition from speech and its precise classification is a challenging task because a machine is unable to understand its context. For an accurate emotion classification, emotionally relevant features must be extracted from the speech data. Traditionally, handcrafted features were used for emotional classification from speech signals; however, they are not efficient enough to accurately depict the emotional states of the speaker. In this study, the benefits of a deep convolutional neural network (DCNN) for SER are explored. For this purpose, a pretrained network is used to extract features from state-of-the-art speech emotional datasets. Subsequently, a correlation-based feature selection technique is applied to the extracted features to select the most appropriate and discriminative features for SER. For the classification of emotions, we utilize support vector machines, random forests, the k-nearest neighbors algorithm, and neural network classifiers. Experiments are performed for speaker-dependent and speaker-independent SER using four publicly available datasets: the Berlin Dataset of Emotional Speech (Emo-DB), Surrey Audio Visual Expressed Emotion (SAVEE), Interactive Emotional Dyadic Motion Capture (IEMOCAP), and the Ryerson Audio Visual Dataset of Emotional Speech and Song (RAVDESS). Our proposed method achieves an accuracy of 95.10% for Emo-DB, 82.10% for SAVEE, 83.80% for IEMOCAP, and 81.30% for RAVDESS, for speaker-dependent SER experiments. Moreover, our method yields the best results for speaker-independent SER with existing handcrafted features-based SER approaches.

Download Full-text

Speech Emotion Recognition System

International Journal of Advanced Research in Science, Communication and Technology ◽

10.48175/ijarsct-v4-i3-024 ◽

2021 ◽

pp. 156-159

Author(s):

Sourabh Suke ◽

Ganesh Regulwar ◽

Nikesh Aote ◽

Pratik Chaudhari ◽

Rajat Ghatode ◽

...

Keyword(s):

Emotion Recognition ◽

Automobile Industry ◽

Emotional State ◽

Recognition System ◽

Classification Model ◽

General Idea ◽

Speech Emotion Recognition ◽

Support Vector ◽

Emotional Speech ◽

Acoustic Features

This project describes "VoiEmo- A Speech Emotion Recognizer", a system for recognizing the emotional state of an individual from his/her speech. For example, one's speech becomes loud and fast, with a higher and wider range in pitch, when in a state of fear, anger, or joy whereas human voice is generally slow and low pitched in sadness and tiredness. We have particularly developed a classification model speech emotion detection based on Convolutional neural networks (CNNs), Support Vector Machine (SVM), Multilayer Perceptron (MLP) Classification which make predictions considering the acoustic features of speech signal such as Mel Frequency Cepstral Coefficient (MFCC). Our models have been trained to recognize seven common emotions (neutral, calm, happy, sad, angry, fearful, disgust, surprise). For training and testing the model, we have used relevant data from the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) dataset and the Toronto Emotional Speech Set (TESS) Dataset. The system is advantageous as it can provide a general idea about the emotional state of the individual based on the acoustic features of the speech irrespective of the language the speaker speaks in, moreover, it also saves time and effort. Speech emotion recognition systems have their applications in various fields like in call centers and BPOs, criminal investigation, psychiatric therapy, the automobile industry, etc.

Download Full-text

IMPROVED SPEAKER-INDEPENDENT EMOTION RECOGNITION FROM SPEECH USING TWO-STAGE FEATURE REDUCTION

Journal of Information and Communication Technology ◽

10.32890/jict2015.14.0.8156 ◽

2015 ◽

Author(s):

Hasrul Mohd Nazid ◽

Hariharan Muthusamy ◽

Vikneswaran Vijean ◽

Sazali Yaacob

Keyword(s):

Emotion Recognition ◽

Principal Component ◽

Feature Reduction ◽

Speech Emotion Recognition ◽

Emotional Speech ◽

Two Stage ◽

Linear Discriminant ◽

Speaker Independent ◽

Speech Features ◽

And Gender

In the recent years, researchers are focusing to improve the accuracy of speech emotion recognition. Generally, high emotion recognition accuracies were obtained for two-class emotion recognition, but multi-class emotion recognition is still a challenging task . The main aim of this work is to propose a two-stage feature reduction using Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) for improving the accuracy of the speech emotion recognition (ER) system. Short-term speech features were extracted from the emotional speech signals. Experiments were carried out using four different supervised classifi ers with two different emotional speech databases. From the experimental results, it can be inferred that the proposed method provides better accuracies of 87.48% for speaker dependent (SD) and gender dependent (GD) ER experiment, 85.15% for speaker independent (SI) ER experiment, and 87.09% for gender independent (GI) experiment.

Download Full-text

Databases, Features and Classification Techniques for Speech Emotion Recognition

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.f3487.049620 ◽

2020 ◽

Vol 9 (6) ◽

pp. 185-190

Keyword(s):

Emotion Recognition ◽

Research Area ◽

Research Field ◽

Speech Emotion Recognition ◽

Emotional Speech ◽

Human Machine Interaction ◽

Classification Techniques ◽

Classification Feature ◽

Physical Gestures ◽

Machine Interaction

Emotion recognition is a rapidly growing research field. Emotions can be effectively expressed through speech and can provide insight about speaker’s intentions. Although, humans can easily interpret emotions through speech, physical gestures, and eye movement but to train a machine to do the same with similar preciseness is quite a challenging task. SER systems can improve human-machine interaction when used with automatic speech recognition, as emotions have the tendency to change the semantics of a sentence. Many researchers have contributed their extremely impressive work in this research area, leading to development of numerous classification, feature selection, feature extraction and emotional speech databases. This paper reviews recent accomplishments in the area of speech emotion recognition. It also present a detailed review of various types of emotional speech databases, and different classification techniques which can be used individually or in combination and a brief description of various speech features for emotion recognition.

Download Full-text

Creation of speech corpus for emotion analysis in Gujarati language and its evaluation by various speech parameters

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v10i5.pp4752-4758 ◽

2020 ◽

Vol 10 (5) ◽

pp. 4752

Author(s):

Vishal P. Tank ◽

S. K. Hadia

Keyword(s):

Artificial Intelligence ◽

Facial Expression ◽

Emotion Recognition ◽

Speech Emotion Recognition ◽

Emotional States ◽

Emotional Speech ◽

Speech Corpus ◽

Speech Database ◽

Machine Communication ◽

Gujarati Language

In the last couple of years emotion recognition has proven its significance in the area of artificial intelligence and man machine communication. Emotion recognition can be done using speech and image (facial expression), this paper deals with SER (speech emotion recognition) only. For emotion recognition emotional speech database is essential. In this paper we have proposed emotional database which is developed in Gujarati language, one of the official’s language of India. The proposed speech corpus bifurcate six emotional states as: sadness, surprise, anger, disgust, fear, happiness. To observe effect of different emotions, analysis of proposed Gujarati speech database is carried out using efficient speech parameters like pitch, energy and MFCC using MATLAB Software.

Download Full-text

Multi-Conditioning and Data Augmentation Using Generative Noise Model for Speech Emotion Recognition in Noisy Conditions

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) ◽

10.1109/icassp40776.2020.9053581 ◽

2020 ◽

Author(s):

Upasana Tiwari ◽

Meet Soni ◽

Rupayan Chakraborty ◽

Ashish Panda ◽

Sunil Kumar Kopparapu

Keyword(s):

Emotion Recognition ◽

Data Augmentation ◽

Noise Model ◽

Speech Emotion Recognition ◽

Noisy Conditions

Download Full-text

Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture

2004 IEEE International Conference on Acoustics, Speech, and Signal Processing ◽

10.1109/icassp.2004.1326051 ◽

2004 ◽

Cited By ~ 132

Author(s):

B. Schuller ◽

G. Rigoll ◽

M. Lang

Keyword(s):

Support Vector Machine ◽

Emotion Recognition ◽

Network Architecture ◽

Speech Emotion Recognition ◽

Support Vector ◽

Acoustic Features ◽

Linguistic Information ◽

Belief Network

Download Full-text

Separation of Emotional and Reconstruction Embeddings on Ladder Network to Improve Speech Emotion Recognition Robustness in Noisy Conditions

10.21437/interspeech.2021-1438 ◽

2021 ◽

Author(s):

Seong-Gyun Leem ◽

Daniel Fulford ◽

Jukka-Pekka Onnela ◽

David Gard ◽

Carlos Busso

Keyword(s):

Emotion Recognition ◽

Speech Emotion Recognition ◽

Ladder Network ◽

Noisy Conditions

Download Full-text

Speech Emotion Recognition: Humans vs Machines

Discourse ◽

10.32603/2412-8562-2019-5-5-136-152 ◽

2019 ◽

Vol 5 (5) ◽

pp. 136-152

Author(s):

S. Werner ◽

G. N. Petrenko

Keyword(s):

Emotion Recognition ◽

Native Speakers ◽

Musical Training ◽

Speech Emotion Recognition ◽

Emotional Speech ◽

Emotion Classification ◽

Language Technology ◽

Language Studies ◽

And Gender ◽

Big Six

Introduction. The study focuses on emotional speech perception and speech emotion recognition using prosodic clues alone. Theoretical problems of defining prosody, intonation and emotion along with the challenges of emotion classification are discussed. An overview of acoustic and perceptional correlates of emotions found in speech is provided. Technical approaches to speech emotion recognition are also considered in the light of the latest emotional speech automatic classification experiments.Methodology and sources. The typical “big six” classification commonly used in technical applications is chosen and modified to include such emotions as disgust and shame. A database of emotional speech in Russian is created under sound laboratory conditions. A perception experiment is run using Praat software’s experimental environment.Results and discussion. Cross-cultural emotion recognition possibilities are revealed, as the Finnish and international participants recognised about a half of samples correctly. Nonetheless, native speakers of Russian appear to distinguish a larger proportion of emotions correctly. The effects of foreign languages knowledge, musical training and gender on the performance in the experiment were insufficiently prominent. The most commonly confused pairs of emotions, such as shame and sadness, surprise and fear, anger and disgust as well as confusions with neutral emotion were also given due attention.Conclusion. The work can contribute to psychological studies, clarifying emotion classification and gender aspect of emotionality, linguistic research, providing new evidence for prosodic and comparative language studies, and language technology, deepening the understanding of possible challenges for SER systems.

Download Full-text