CLSTM: Deep Feature-Based Speech Emotion Recognition Using the Hierarchical ConvLSTM Network

Mustaqeem; Soonil Kwon

doi:10.3390/math8122133

CLSTM: Deep Feature-Based Speech Emotion Recognition Using the Hierarchical ConvLSTM Network

Mathematics ◽

10.3390/math8122133 ◽

2020 ◽

Vol 8 (12) ◽

pp. 2133

Author(s):

Mustaqeem ◽

Soonil Kwon

Keyword(s):

Artificial Intelligence ◽

Machine Learning ◽

Emotion Recognition ◽

Sequence Learning ◽

Learning Strategy ◽

Recognition Rate ◽

Audio Signal ◽

System Structure ◽

Speech Emotion Recognition ◽

Center Loss

Artificial intelligence, deep learning, and machine learning are dominant sources to use in order to make a system smarter. Nowadays, the smart speech emotion recognition (SER) system is a basic necessity and an emerging research area of digital audio signal processing. However, SER plays an important role with many applications that are related to human–computer interactions (HCI). The existing state-of-the-art SER system has a quite low prediction performance, which needs improvement in order to make it feasible for the real-time commercial applications. The key reason for the low accuracy and the poor prediction rate is the scarceness of the data and a model configuration, which is the most challenging task to build a robust machine learning technique. In this paper, we addressed the limitations of the existing SER systems and proposed a unique artificial intelligence (AI) based system structure for the SER that utilizes the hierarchical blocks of the convolutional long short-term memory (ConvLSTM) with sequence learning. We designed four blocks of ConvLSTM, which is called the local features learning block (LFLB), in order to extract the local emotional features in a hierarchical correlation. The ConvLSTM layers are adopted for input-to-state and state-to-state transition in order to extract the spatial cues by utilizing the convolution operations. We placed four LFLBs in order to extract the spatiotemporal cues in the hierarchical correlational form speech signals using the residual learning strategy. Furthermore, we utilized a novel sequence learning strategy in order to extract the global information and adaptively adjust the relevant global feature weights according to the correlation of the input features. Finally, we used the center loss function with the softmax loss in order to produce the probability of the classes. The center loss increases the final classification results and ensures an accurate prediction as well as shows a conspicuous role in the whole proposed SER scheme. We tested the proposed system over two standard, interactive emotional dyadic motion capture (IEMOCAP) and ryerson audio visual database of emotional speech and song (RAVDESS) speech corpora, and obtained a 75% and an 80% recognition rate, respectively.

Download Full-text

Speech emotion recognition based on machine learning tactics and algorithms

Materials Today Proceedings ◽

10.1016/j.matpr.2020.12.207 ◽

2021 ◽

Author(s):

S. Prasanth ◽

M. Roshni Thanka ◽

E. Bijolin Edwin ◽

V. Nagaraj

Keyword(s):

Machine Learning ◽

Emotion Recognition ◽

Speech Emotion Recognition

Download Full-text

A Research of Speech Emotion Recognition Based on Deep Belief Network and SVM

Mathematical Problems in Engineering ◽

10.1155/2014/749604 ◽

2014 ◽

Vol 2014 ◽

pp. 1-7 ◽

Cited By ~ 21

Author(s):

Chenchen Huang ◽

Wei Gong ◽

Wenlong Fu ◽

Dongyu Feng

Keyword(s):

Feature Extraction ◽

Emotion Recognition ◽

Recognition Rate ◽

Original Method ◽

Speech Emotion Recognition ◽

High Dimensional ◽

Svm Classifier ◽

Multiple Classifier System ◽

Classifier System ◽

Multiple Classifier

Feature extraction is a very important part in speech emotion recognition, and in allusion to feature extraction in speech emotion recognition problems, this paper proposed a new method of feature extraction, using DBNs in DNN to extract emotional features in speech signal automatically. By training a 5 layers depth DBNs, to extract speech emotion feature and incorporate multiple consecutive frames to form a high dimensional feature. The features after training in DBNs were the input of nonlinear SVM classifier, and finally speech emotion recognition multiple classifier system was achieved. The speech emotion recognition rate of the system reached 86.5%, which was 7% higher than the original method.

Download Full-text

A Mathematical Morphological Processing of Spectrograms for the Tone of Chinese Vowels Recognition

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.571-572.665 ◽

2014 ◽

Vol 571-572 ◽

pp. 665-671 ◽

Cited By ~ 1

Author(s):

Sen Xu ◽

Xu Zhao ◽

Cheng Hua Duan ◽

Xiao Lin Cao ◽

Hui Yan Li ◽

...

Keyword(s):

Neural Networks ◽

Speech Recognition ◽

Emotion Recognition ◽

Recognition Rate ◽

Morphological Processing ◽

Speech Emotion Recognition ◽

Normal Tone ◽

Tone Recognition ◽

Tone Signal ◽

The Neural Networks

As One of Features from other Languages, the Chinese Tone Changes of Chinese are Mainly Decided by its Vowels, so the Vowel Variation of Chinese Tone Becomes Important in Speech Recognition Research. the Normal Tone Recognition Ways are Always Based on Fundamental Frequency of Signal, which can Not Keep Integrity of Tone Signal. we Bring Forward to a Mathematical Morphological Processing of Spectrograms for the Tone of Chinese Vowels. Firstly, we will have Pretreatment to Recording Good Tone Signal by Using Cooledit Pro Software, and Converted into Spectrograms; Secondly, we will do Smooth and the Normalized Pretreatment to Spectrograms by Mathematical Morphological Processing; Finally, we get Whole Direction Angle Statistics of Tone Signal by Skeletonization way. the Neural Networks Stimulation Shows that the Speech Emotion Recognition Rate can Reach 92.50%.

Download Full-text

Important Attributes Selection Based on Rough Set for Speech Emotion Recognition

Transdisciplinary Advancements in Cognitive Mechanisms and Human Information Processing ◽

10.4018/978-1-60960-553-7.ch016 ◽

2011 ◽

pp. 262-271

Author(s):

Jian Zhou ◽

Guoyin Wang ◽

Yong Yang

Keyword(s):

Emotion Recognition ◽

Set Theory ◽

Rough Set ◽

Rough Set Theory ◽

Recognition Rate ◽

Feature Selection Method ◽

Recognition System ◽

Attribute Selection ◽

Computer Application ◽

Speech Emotion Recognition

Speech emotion recognition is becoming more and more important in such computer application fields as health care, children education, etc. In order to improve the prediction performance or providing faster and more cost-effective recognition system, an attribute selection is often carried out beforehand to select the important attributes from the input attribute sets. However, it is time-consuming for traditional feature selection method used in speech emotion recognition to determine an optimum or suboptimum feature subset. Rough set theory offers an alternative, formal and methodology that can be employed to reduce the dimensionality of data. The purpose of this study is to investigate the effectiveness of Rough Set Theory in identifying important features in speech emotion recognition system. The experiments on CLDC emotion speech database clearly show this approach can reduce the calculation cost while retaining a suitable high recognition rate.

Download Full-text

Speech Emotion Recognition Using Machine Learning Techniques

Advances in Intelligent Systems and Computing - Congress on Intelligent Systems ◽

10.1007/978-981-33-6984-9_15 ◽

2021 ◽

pp. 169-178

Author(s):

Sreeja Sasidharan Rajeswari ◽

G. Gopakumar ◽

Manjusha Nair

Keyword(s):

Machine Learning ◽

Emotion Recognition ◽

Machine Learning Techniques ◽

Speech Emotion Recognition ◽

Learning Techniques

Download Full-text

Comparison of Several Acoustic Modeling Techniques for Speech Emotion Recognition

International Journal of Synthetic Emotions ◽

10.4018/ijse.2016010105 ◽

2016 ◽

Vol 7 (1) ◽

pp. 58-68 ◽

Cited By ~ 4

Author(s):

Imen Trabelsi ◽

Med Salim Bouhlel

Keyword(s):

Emotion Recognition ◽

Linear Prediction ◽

Recognition Rate ◽

Gaussian Mixture ◽

Speech Emotion Recognition ◽

Support Vector ◽

Emotional States ◽

Wide Range ◽

Leibler Divergence ◽

Perceptual Linear Prediction

Automatic Speech Emotion Recognition (SER) is a current research topic in the field of Human Computer Interaction (HCI) with a wide range of applications. The purpose of speech emotion recognition system is to automatically classify speaker's utterances into different emotional states such as disgust, boredom, sadness, neutral, and happiness. The speech samples in this paper are from the Berlin emotional database. Mel Frequency cepstrum coefficients (MFCC), Linear prediction coefficients (LPC), linear prediction cepstrum coefficients (LPCC), Perceptual Linear Prediction (PLP) and Relative Spectral Perceptual Linear Prediction (Rasta-PLP) features are used to characterize the emotional utterances using a combination between Gaussian mixture models (GMM) and Support Vector Machines (SVM) based on the Kullback-Leibler Divergence Kernel. In this study, the effect of feature type and its dimension are comparatively investigated. The best results are obtained with 12-coefficient MFCC. Utilizing the proposed features a recognition rate of 84% has been achieved which is close to the performance of humans on this database.

Download Full-text

Feature extraction algorithms to improve the speech emotion recognition rate

International Journal of Speech Technology ◽

10.1007/s10772-020-09672-4 ◽

2020 ◽

Vol 23 (1) ◽

pp. 45-55 ◽

Cited By ~ 7

Author(s):

Anusha Koduru ◽

Hima Bindu Valiveti ◽

Anil Kumar Budati

Keyword(s):

Feature Extraction ◽

Emotion Recognition ◽

Recognition Rate ◽

Speech Emotion Recognition

Download Full-text

An Appraisal on Speech and Emotion Recognition Technologies based on Machine Learning

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.e5715.018520 ◽

2020 ◽

Vol 8 (5) ◽

pp. 2266-2276 ◽

Cited By ~ 1

Keyword(s):

Machine Learning ◽

Feature Extraction ◽

Emotion Recognition ◽

Speech Development ◽

Speech Emotion Recognition ◽

Part Of Speech ◽

Classification Feature ◽

The Way

In earlier days, people used speech as a means of communication or the way a listener is conveyed by voice or expression. But the idea of machine learning and various methods are necessary for the recognition of speech in the matter of interaction with machines. With a voice as a bio-metric through use and significance, speech has become an important part of speech development. In this article, we attempted to explain a variety of speech and emotion recognition techniques and comparisons between several methods based on existing algorithms and mostly speech-based methods. We have listed and distinguished speaking technologies that are focused on specifications, databases, classification, feature extraction, enhancement, segmentation and process of Speech Emotion recognition in this paper

Download Full-text

Multi-Stage Recognition of Speech Emotion Using Sequential Forward Feature Selection

Electrical Control and Communication Engineering ◽

10.1515/ecce-2016-0005 ◽

2016 ◽

Vol 10 (1) ◽

pp. 35-41 ◽

Cited By ~ 1

Author(s):

Tatjana Liogienė ◽

Gintautas Tamulevičius

Keyword(s):

Feature Selection ◽

Emotion Recognition ◽

Classification Scheme ◽

Recognition Rate ◽

Single Stage ◽

Speech Emotion Recognition ◽

Forward Selection ◽

Multi Stage ◽

Stage Scheme ◽

Stage Classification

Abstract The intensive research of speech emotion recognition introduced a huge collection of speech emotion features. Large feature sets complicate the speech emotion recognition task. Among various feature selection and transformation techniques for one-stage classification, multiple classifier systems were proposed. The main idea of multiple classifiers is to arrange the emotion classification process in stages. Besides parallel and serial cases, the hierarchical arrangement of multi-stage classification is most widely used for speech emotion recognition. In this paper, we present a sequential-forward-feature-selection-based multi-stage classification scheme. The Sequential Forward Selection (SFS) and Sequential Floating Forward Selection (SFFS) techniques were employed for every stage of the multi-stage classification scheme. Experimental testing of the proposed scheme was performed using the German and Lithuanian emotional speech datasets. Sequential-feature-selection-based multi-stage classification outperformed the single-stage scheme by 12–42 % for different emotion sets. The multi-stage scheme has shown higher robustness to the growth of emotion set. The decrease in recognition rate with the increase in emotion set for multi-stage scheme was lower by 10–20 % in comparison with the single-stage case. Differences in SFS and SFFS employment for feature selection were negligible.

Download Full-text

Speech Based Emotion Recognition Using Machine Learning

International Journal for Research in Applied Science and Engineering Technology ◽

10.22214/ijraset.2021.39420 ◽

2021 ◽

Vol 9 (12) ◽

pp. 2093-2095

Author(s):

Vaibhav K. P.

Keyword(s):

Machine Learning ◽

Emotion Recognition ◽

Research Topic ◽

Speech Emotion Recognition ◽

Lexical Analysis ◽

Main Motive

Abstract: Speech emotion recognition is a trending research topic these days, with its main motive to improve the humanmachine interaction. At present, most of the work in this area utilizes extraction of discriminatory features for the purpose of classification of emotions into various categories. Most of the present work involves the utterance of words which is used for lexical analysis for emotion recognition. In our project, a technique is utilized for classifying emotions into Angry',' Calm', 'Fearful', 'Happy', and 'Sad' categories.

Download Full-text