Fusing Visual Attention CNN and Bag of Visual Words for Cross-Corpus Speech Emotion Recognition

Minji Seo; Myungho Kim

doi:10.3390/s20195559

Fusing Visual Attention CNN and Bag of Visual Words for Cross-Corpus Speech Emotion Recognition

Sensors ◽

10.3390/s20195559 ◽

2020 ◽

Vol 20 (19) ◽

pp. 5559

Author(s):

Minji Seo ◽

Myungho Kim

Keyword(s):

Visual Attention ◽

Emotion Recognition ◽

Expressed Emotion ◽

Local Features ◽

Speech Emotion Recognition ◽

Bag Of Visual Words ◽

Emotional Speech ◽

Visual Words ◽

Performance Reduction ◽

Global And Local

Speech emotion recognition (SER) classifies emotions using low-level features or a spectrogram of an utterance. When SER methods are trained and tested using different datasets, they have shown performance reduction. Cross-corpus SER research identifies speech emotion using different corpora and languages. Recent cross-corpus SER research has been conducted to improve generalization. To improve the cross-corpus SER performance, we pretrained the log-mel spectrograms of the source dataset using our designed visual attention convolutional neural network (VACNN), which has a 2D CNN base model with channel- and spatial-wise visual attention modules. To train the target dataset, we extracted the feature vector using a bag of visual words (BOVW) to assist the fine-tuned model. Because visual words represent local features in the image, the BOVW helps VACNN to learn global and local features in the log-mel spectrogram by constructing a frequency histogram of visual words. The proposed method shows an overall accuracy of 83.33%, 86.92%, and 75.00% in the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), the Berlin Database of Emotional Speech (EmoDB), and Surrey Audio-Visual Expressed Emotion (SAVEE), respectively. Experimental results on RAVDESS, EmoDB, SAVEE demonstrate improvements of 7.73%, 15.12%, and 2.34% compared to existing state-of-the-art cross-corpus SER approaches.

Download Full-text

Impact of Feature Selection Algorithm on Speech Emotion Recognition Using Deep Convolutional Neural Network

Sensors ◽

10.3390/s20216008 ◽

2020 ◽

Vol 20 (21) ◽

pp. 6008 ◽

Cited By ~ 1

Author(s):

Misbah Farooq ◽

Fawad Hussain ◽

Naveed Khan Baloch ◽

Fawad Riasat Raja ◽

Heejung Yu ◽

...

Keyword(s):

Neural Network ◽

Feature Selection ◽

Convolutional Neural Network ◽

Emotion Recognition ◽

Deep Convolutional Neural Network ◽

Speech Emotion Recognition ◽

Support Vector ◽

Emotional Speech ◽

Human Machine Interaction ◽

Speaker Independent

Speech emotion recognition (SER) plays a significant role in human–machine interaction. Emotion recognition from speech and its precise classification is a challenging task because a machine is unable to understand its context. For an accurate emotion classification, emotionally relevant features must be extracted from the speech data. Traditionally, handcrafted features were used for emotional classification from speech signals; however, they are not efficient enough to accurately depict the emotional states of the speaker. In this study, the benefits of a deep convolutional neural network (DCNN) for SER are explored. For this purpose, a pretrained network is used to extract features from state-of-the-art speech emotional datasets. Subsequently, a correlation-based feature selection technique is applied to the extracted features to select the most appropriate and discriminative features for SER. For the classification of emotions, we utilize support vector machines, random forests, the k-nearest neighbors algorithm, and neural network classifiers. Experiments are performed for speaker-dependent and speaker-independent SER using four publicly available datasets: the Berlin Dataset of Emotional Speech (Emo-DB), Surrey Audio Visual Expressed Emotion (SAVEE), Interactive Emotional Dyadic Motion Capture (IEMOCAP), and the Ryerson Audio Visual Dataset of Emotional Speech and Song (RAVDESS). Our proposed method achieves an accuracy of 95.10% for Emo-DB, 82.10% for SAVEE, 83.80% for IEMOCAP, and 81.30% for RAVDESS, for speaker-dependent SER experiments. Moreover, our method yields the best results for speaker-independent SER with existing handcrafted features-based SER approaches.

Download Full-text

Speech Emotion Recognition System

International Journal of Advanced Research in Science, Communication and Technology ◽

10.48175/ijarsct-v4-i3-024 ◽

2021 ◽

pp. 156-159

Author(s):

Sourabh Suke ◽

Ganesh Regulwar ◽

Nikesh Aote ◽

Pratik Chaudhari ◽

Rajat Ghatode ◽

...

Keyword(s):

Emotion Recognition ◽

Automobile Industry ◽

Emotional State ◽

Recognition System ◽

Classification Model ◽

General Idea ◽

Speech Emotion Recognition ◽

Support Vector ◽

Emotional Speech ◽

Acoustic Features

This project describes "VoiEmo- A Speech Emotion Recognizer", a system for recognizing the emotional state of an individual from his/her speech. For example, one's speech becomes loud and fast, with a higher and wider range in pitch, when in a state of fear, anger, or joy whereas human voice is generally slow and low pitched in sadness and tiredness. We have particularly developed a classification model speech emotion detection based on Convolutional neural networks (CNNs), Support Vector Machine (SVM), Multilayer Perceptron (MLP) Classification which make predictions considering the acoustic features of speech signal such as Mel Frequency Cepstral Coefficient (MFCC). Our models have been trained to recognize seven common emotions (neutral, calm, happy, sad, angry, fearful, disgust, surprise). For training and testing the model, we have used relevant data from the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) dataset and the Toronto Emotional Speech Set (TESS) Dataset. The system is advantageous as it can provide a general idea about the emotional state of the individual based on the acoustic features of the speech irrespective of the language the speaker speaks in, moreover, it also saves time and effort. Speech emotion recognition systems have their applications in various fields like in call centers and BPOs, criminal investigation, psychiatric therapy, the automobile industry, etc.

Download Full-text

IMPROVED SPEAKER-INDEPENDENT EMOTION RECOGNITION FROM SPEECH USING TWO-STAGE FEATURE REDUCTION

Journal of Information and Communication Technology ◽

10.32890/jict2015.14.0.8156 ◽

2015 ◽

Author(s):

Hasrul Mohd Nazid ◽

Hariharan Muthusamy ◽

Vikneswaran Vijean ◽

Sazali Yaacob

Keyword(s):

Emotion Recognition ◽

Principal Component ◽

Feature Reduction ◽

Speech Emotion Recognition ◽

Emotional Speech ◽

Two Stage ◽

Linear Discriminant ◽

Speaker Independent ◽

Speech Features ◽

And Gender

In the recent years, researchers are focusing to improve the accuracy of speech emotion recognition. Generally, high emotion recognition accuracies were obtained for two-class emotion recognition, but multi-class emotion recognition is still a challenging task . The main aim of this work is to propose a two-stage feature reduction using Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) for improving the accuracy of the speech emotion recognition (ER) system. Short-term speech features were extracted from the emotional speech signals. Experiments were carried out using four different supervised classifi ers with two different emotional speech databases. From the experimental results, it can be inferred that the proposed method provides better accuracies of 87.48% for speaker dependent (SD) and gender dependent (GD) ER experiment, 85.15% for speaker independent (SI) ER experiment, and 87.09% for gender independent (GI) experiment.

Download Full-text

Databases, Features and Classification Techniques for Speech Emotion Recognition

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.f3487.049620 ◽

2020 ◽

Vol 9 (6) ◽

pp. 185-190

Keyword(s):

Emotion Recognition ◽

Research Area ◽

Research Field ◽

Speech Emotion Recognition ◽

Emotional Speech ◽

Human Machine Interaction ◽

Classification Techniques ◽

Classification Feature ◽

Physical Gestures ◽

Machine Interaction

Emotion recognition is a rapidly growing research field. Emotions can be effectively expressed through speech and can provide insight about speaker’s intentions. Although, humans can easily interpret emotions through speech, physical gestures, and eye movement but to train a machine to do the same with similar preciseness is quite a challenging task. SER systems can improve human-machine interaction when used with automatic speech recognition, as emotions have the tendency to change the semantics of a sentence. Many researchers have contributed their extremely impressive work in this research area, leading to development of numerous classification, feature selection, feature extraction and emotional speech databases. This paper reviews recent accomplishments in the area of speech emotion recognition. It also present a detailed review of various types of emotional speech databases, and different classification techniques which can be used individually or in combination and a brief description of various speech features for emotion recognition.

Download Full-text

Emotion Recognition from Speech Using the Bag-of-Visual Words on Audio Segment Spectrograms

Technologies ◽

10.3390/technologies7010020 ◽

2019 ◽

Vol 7 (1) ◽

pp. 20 ◽

Cited By ~ 3

Author(s):

Evaggelos Spyrou ◽

Rozalia Nikopoulou ◽

Ioannis Vernikos ◽

Phivos Mylonas

Keyword(s):

Computer Vision ◽

Emotion Recognition ◽

Affective State ◽

Real Life ◽

Support Vector ◽

Bag Of Visual Words ◽

Educational Training ◽

Visual Words ◽

Speeded Up Robust Features ◽

Digital World

It is noteworthy nowadays that monitoring and understanding a human’s emotional state plays a key role in the current and forthcoming computational technologies. On the other hand, this monitoring and analysis should be as unobtrusive as possible, since in our era the digital world has been smoothly adopted in everyday life activities. In this framework and within the domain of assessing humans’ affective state during their educational training, the most popular way to go is to use sensory equipment that would allow their observing without involving any kind of direct contact. Thus, in this work, we focus on human emotion recognition from audio stimuli (i.e., human speech) using a novel approach based on a computer vision inspired methodology, namely the bag-of-visual words method, applied on several audio segment spectrograms. The latter are considered to be the visual representation of the considered audio segment and may be analyzed by exploiting well-known traditional computer vision techniques, such as construction of a visual vocabulary, extraction of speeded-up robust features (SURF) features, quantization into a set of visual words, and image histogram construction. As a last step, support vector machines (SVM) classifiers are trained based on the aforementioned information. Finally, to further generalize the herein proposed approach, we utilize publicly available datasets from several human languages to perform cross-language experiments, both in terms of actor-created and real-life ones.

Download Full-text

Content-Based Image Retrieval using Local Features Descriptors and Bag-of-Visual Words

International Journal of Advanced Computer Science and Applications ◽

10.14569/ijacsa.2015.060929 ◽

2015 ◽

Vol 6 (9) ◽

Cited By ~ 9

Author(s):

Mohammed Alkhawlani ◽

Mohammed Elmogy ◽

Hazem Elbakry

Keyword(s):

Image Retrieval ◽

Local Features ◽

Content Based Image Retrieval ◽

Bag Of Visual Words ◽

Visual Words

Download Full-text

Creation of speech corpus for emotion analysis in Gujarati language and its evaluation by various speech parameters

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v10i5.pp4752-4758 ◽

2020 ◽

Vol 10 (5) ◽

pp. 4752

Author(s):

Vishal P. Tank ◽

S. K. Hadia

Keyword(s):

Artificial Intelligence ◽

Facial Expression ◽

Emotion Recognition ◽

Speech Emotion Recognition ◽

Emotional States ◽

Emotional Speech ◽

Speech Corpus ◽

Speech Database ◽

Machine Communication ◽

Gujarati Language

In the last couple of years emotion recognition has proven its significance in the area of artificial intelligence and man machine communication. Emotion recognition can be done using speech and image (facial expression), this paper deals with SER (speech emotion recognition) only. For emotion recognition emotional speech database is essential. In this paper we have proposed emotional database which is developed in Gujarati language, one of the official’s language of India. The proposed speech corpus bifurcate six emotional states as: sadness, surprise, anger, disgust, fear, happiness. To observe effect of different emotions, analysis of proposed Gujarati speech database is carried out using efficient speech parameters like pitch, energy and MFCC using MATLAB Software.

Download Full-text

Group Emotion Recognition Based on Global and Local Features

IEEE Access ◽

10.1109/access.2019.2932797 ◽

2019 ◽

Vol 7 ◽

pp. 111617-111624 ◽

Cited By ~ 1

Author(s):

Dai Yu ◽

Liu Xingyu ◽

Dong Shuzhan ◽

Yang Lei

Keyword(s):

Emotion Recognition ◽

Local Features ◽

Global And Local ◽

Group Emotion

Download Full-text

3D CNN-Based Speech Emotion Recognition Using K-Means Clustering and Spectrograms

Entropy ◽

10.3390/e21050479 ◽

2019 ◽

Vol 21 (5) ◽

pp. 479 ◽

Cited By ~ 21

Author(s):

Noushin Hajarolasvadi ◽

Hasan Demirel

Keyword(s):

Emotion Recognition ◽

Speech Signal ◽

Expressed Emotion ◽

Audio Signal ◽

Research Direction ◽

Recognition System ◽

Speech Emotion Recognition ◽

Mel Frequency Cepstral Coefficients ◽

Audio Features ◽

3D Cnn

Detecting human intentions and emotions helps improve human–robot interactions. Emotion recognition has been a challenging research direction in the past decade. This paper proposes an emotion recognition system based on analysis of speech signals. Firstly, we split each speech signal into overlapping frames of the same length. Next, we extract an 88-dimensional vector of audio features including Mel Frequency Cepstral Coefficients (MFCC), pitch, and intensity for each of the respective frames. In parallel, the spectrogram of each frame is generated. In the final preprocessing step, by applying k-means clustering on the extracted features of all frames of each audio signal, we select k most discriminant frames, namely keyframes, to summarize the speech signal. Then, the sequence of the corresponding spectrograms of keyframes is encapsulated in a 3D tensor. These tensors are used to train and test a 3D Convolutional Neural network using a 10-fold cross-validation approach. The proposed 3D CNN has two convolutional layers and one fully connected layer. Experiments are conducted on the Surrey Audio-Visual Expressed Emotion (SAVEE), Ryerson Multimedia Laboratory (RML), and eNTERFACE’05 databases. The results are superior to the state-of-the-art methods reported in the literature.

Download Full-text

Multi-scale image semantic recognition with hierarchical visual vocabulary

Computer Science and Information Systems ◽

10.2298/csis100423035j ◽

2011 ◽

Vol 8 (3) ◽

pp. 931-951 ◽

Cited By ~ 1

Author(s):

Xinghao Jiang ◽

Tanfeng Sun ◽

Fu Guanglei

Keyword(s):

Semantic Analysis ◽

Level Structure ◽

Local Features ◽

Bag Of Visual Words ◽

Semantic Model ◽

Visual Words ◽

Multi Scale ◽

Visual Vocabulary ◽

Video Semantic Analysis ◽

Relationship Of

Local features have been proved to be effective in image/video semantic analysis. The BOVW (bag of visual words) scheme can cluster local features to form the visual vocabulary which includes an amount of words, where each word is the center of one clustering feature. The vocabulary is used to recognize the image semantic. In this paper, a new scheme to construct semantic-binding hierarchical visual vocabulary is proposed. Some attributes and relationship of the semantic nodes in the model are discussed. The hierarchical semantic model is used to organize the multi-scale semantic into a level-by-level structure. Experiments are performed based on the LabelMe dataset, the performance of our scheme is evaluated and compared with the traditional BOVW scheme, experimental results demonstrate the efficiency and flexibility of our scheme.

Download Full-text