A Study on the Search of the Most Discriminative Speech Features in the Speaker Dependent Speech Emotion Recognition

Speech Emotion Recognition Framework based on User Self-referential Speech Features

2018 IEEE 7th Global Conference on Consumer Electronics (GCCE) ◽

10.1109/gcce.2018.8574676 ◽

2018 ◽

Cited By ~ 1

Author(s):

Kyoungju Noh ◽

Seungeun Chung ◽

Jiyoun Lim ◽

Gague Kim ◽

Hyuntae Jeong

Keyword(s):

Emotion Recognition ◽

Speech Emotion Recognition ◽

Speech Features

Download Full-text

IMPROVED SPEAKER-INDEPENDENT EMOTION RECOGNITION FROM SPEECH USING TWO-STAGE FEATURE REDUCTION

Journal of Information and Communication Technology ◽

10.32890/jict2015.14.0.8156 ◽

2015 ◽

Author(s):

Hasrul Mohd Nazid ◽

Hariharan Muthusamy ◽

Vikneswaran Vijean ◽

Sazali Yaacob

Keyword(s):

Emotion Recognition ◽

Principal Component ◽

Feature Reduction ◽

Speech Emotion Recognition ◽

Emotional Speech ◽

Two Stage ◽

Linear Discriminant ◽

Speaker Independent ◽

Speech Features ◽

And Gender

In the recent years, researchers are focusing to improve the accuracy of speech emotion recognition. Generally, high emotion recognition accuracies were obtained for two-class emotion recognition, but multi-class emotion recognition is still a challenging task . The main aim of this work is to propose a two-stage feature reduction using Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) for improving the accuracy of the speech emotion recognition (ER) system. Short-term speech features were extracted from the emotional speech signals. Experiments were carried out using four different supervised classifi ers with two different emotional speech databases. From the experimental results, it can be inferred that the proposed method provides better accuracies of 87.48% for speaker dependent (SD) and gender dependent (GD) ER experiment, 85.15% for speaker independent (SI) ER experiment, and 87.09% for gender independent (GI) experiment.

Download Full-text

Dimensional speech emotion recognition from speech features and word embeddings by using multitask learning

APSIPA Transactions on Signal and Information Processing ◽

10.1017/atsip.2020.14 ◽

2020 ◽

Vol 9 ◽

Cited By ~ 1

Author(s):

Bagus Tris Atmaja ◽

Masato Akagi

Keyword(s):

Emotion Recognition ◽

Multitask Learning ◽

Speech Emotion Recognition ◽

Word Embeddings ◽

Concordance Correlation ◽

Acoustic Networks ◽

Overall Evaluation ◽

Speech Features ◽

Two Parameters ◽

Emotion Labels

Abstract The majority of research in speech emotion recognition (SER) is conducted to recognize emotion categories. Recognizing dimensional emotion attributes is also important, however, and it has several advantages over categorical emotion. For this research, we investigate dimensional SER using both speech features and word embeddings. The concatenation network joins acoustic networks and text networks from bimodal features. We demonstrate that those bimodal features, both are extracted from speech, improve the performance of dimensional SER over unimodal SER either using acoustic features or word embeddings. A significant improvement on the valence dimension is contributed by the addition of word embeddings to SER system, while arousal and dominance dimensions are also improved. We proposed a multitask learning (MTL) approach for the prediction of all emotional attributes. This MTL maximizes the concordance correlation between predicted emotion degrees and true emotion labels simultaneously. The findings suggest that the use of MTL with two parameters is better than other evaluated methods in representing the interrelation of emotional attributes. In unimodal results, speech features attain higher performance on arousal and dominance, while word embeddings are better for predicting valence. The overall evaluation uses the concordance correlation coefficient score of the three emotional attributes. We also discuss some differences between categorical and dimensional emotion results from psychological and engineering perspectives.

Download Full-text

Unsupervised Feature Learning for Speech Emotion Recognition Based on Autoencoder

Electronics ◽

10.3390/electronics10172086 ◽

2021 ◽

Vol 10 (17) ◽

pp. 2086

Author(s):

Yangwei Ying ◽

Yuanwu Tu ◽

Hong Zhou

Keyword(s):

Emotion Recognition ◽

Data Augmentation ◽

Feature Learning ◽

Human Potential ◽

Speech Emotion Recognition ◽

Unsupervised Feature Learning ◽

Learning Techniques ◽

Speech Data ◽

Data Division ◽

Speech Features

Speech signals contain abundant information on personal emotions, which plays an important part in the representation of human potential characteristics and expressions. However, the deficiency of emotion speech data affects the development of speech emotion recognition (SER), which also limits the promotion of recognition accuracy. Currently, the most effective approach is to make use of unsupervised feature learning techniques to extract speech features from available speech data and generate emotion classifiers with these features. In this paper, we proposed to implement autoencoders such as a denoising autoencoder (DAE) and an adversarial autoencoder (AAE) to extract the features from LibriSpeech for model pre-training, and then conducted experiments on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) datasets for classification. Considering the imbalance of data distribution in IEMOCAP, we developed a novel data augmentation approach to optimize the overlap shift between consecutive segments and redesigned the data division. The best classification accuracy reached 78.67% (weighted accuracy, WA) and 76.89% (unweighted accuracy, UA) with AAE. Compared with state-of-the-art results to our knowledge (76.18% of WA and 76.36% of UA with the supervised learning method), we achieved a slight advantage. This suggests that using unsupervised learning benefits the development of SER and provides a new approach to eliminate the problem of data scarcity.

Download Full-text

Speech Emotion Recognition Using Speech Feature and Word Embedding

10.31227/osf.io/b34gn ◽

2019 ◽

Cited By ~ 1

Author(s):

Bagus Tris Atmaja

Keyword(s):

Emotion Recognition ◽

Automatic Speech Recognition ◽

Recognition Accuracy ◽

Speech Emotion Recognition ◽

Input Feature ◽

Text Features ◽

Speech Feature ◽

Speech Features ◽

Speech Segments ◽

Fully Connected

Emotion recognition can be performed automati-cally from many modalities. This paper presents categoricalspeech emotion recognition using speech feature and wordembedding. Text features can be combined with speech features toimprove emotion recognition accuracy, and both features can beobtained from speech via automatic speech recognition. Here, weuse speech segments of an utterance where the acoustic featureis extracted for speech emotion recognition. Word embeddingis used as an input feature for text emotion recognition anda combination of both features is proposed for performanceimprovement purpose. Two unidirectional LSTM layers are usedfor text and fully connected layers are applied for acousticemotion recognition. Both networks then are merged to produceone of four predicted emotion categories by fully connectednetworks. The result shows the combination of speech and textachieve higher accuracy i.e. 75.49% compared to speech only with71.34% or text only emotion recognition with 66.09%. This resultalso outperforms the previously proposed methods by othersusing the same dataset on the same and/or similar modalities.

Download Full-text

Speech Emotion Recognition Using Deep Feedforward Neural Network

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v10.i2.pp554-561 ◽

2018 ◽

Vol 10 (2) ◽

pp. 554 ◽

Cited By ~ 3

Author(s):

Muhammad Fahreza Alghifari ◽

Teddy Surya Gunawan ◽

Mira Kartiwi

Keyword(s):

Emotion Recognition ◽

Recognition Rate ◽

Speech Emotion Recognition ◽

Accuracy Rate ◽

Optimum Configuration ◽

Custom Made ◽

Audio Data ◽

Speech Feature ◽

Speech Features ◽

Mel Frequency Cepstral Coefficient

Speech emotion recognition (SER) is currently a research hotspot due to its challenging nature but bountiful future prospects. The objective of this research is to utilize Deep Neural Networks (DNNs) to recognize human speech emotion. First, the chosen speech feature Mel-frequency cepstral coefficient (MFCC) were extracted from raw audio data. Second, the speech features extracted were fed into the DNN to train the network. The trained network was then tested onto a set of labelled emotion speech audio and the recognition rate was evaluated. Based on the accuracy rate the MFCC, number of neurons and layers are adjusted for optimization. Moreover, a custom-made database is introduced and validated using the network optimized. The optimum configuration for SER is 13 MFCC, 12 neurons and 2 layers for 3 emotions and 25 MFCC, 21 neurons and 4 layers for 4 emotions, achieving a total recognition rate of 96.3% for 3 emotions and 97.1% for 4 emotions.Speech emotion recognition (SER) is currently a research hotspot due to its challenging nature but bountiful future prospects. The objective of this research is to utilize Deep Neural Networks (DNNs) to recognize human speech emotion. First, the chosen speech feature Mel-frequency cepstral coefficient (MFCC) were extracted from raw audio data. Second, the speech features extracted were fed into the DNN to train the network. The trained network was then tested onto a set of labelled emotion speech audio and the recognition rate was evaluated. Based on the accuracy rate the MFCC, number of neurons and layers are adjusted for optimization. Moreover, a custom-made database is introduced and validated using the network optimized.The optimum configuration for SER is 13 MFCC, 12 neurons and 2 layers for 3 emotions and 25 MFCC, 21 neurons and 4 layers for 4 emotions, achieving a total recognition rate of 96.3% for 3 emotions and 97.1% for 4 emotions.

Download Full-text

Speech Emotion Recognition Based on Sparse Representation

Archives of Acoustics ◽

10.2478/aoa-2013-0055 ◽

2013 ◽

Vol 38 (4) ◽

pp. 465-470 ◽

Cited By ~ 11

Author(s):

Jingjie Yan ◽

Xiaolan Wang ◽

Weiyi Gu ◽

LiLi Ma

Keyword(s):

Dimensionality Reduction ◽

Emotion Recognition ◽

Least Squares ◽

Partial Least Squares ◽

Partial Least Squares Regression ◽

Speech Emotion Recognition ◽

Least Squares Regression ◽

Computer Science Pedagogy ◽

Reduction Methods ◽

Analysis Computer

Abstract Speech emotion recognition is deemed to be a meaningful and intractable issue among a number of do- mains comprising sentiment analysis, computer science, pedagogy, and so on. In this study, we investigate speech emotion recognition based on sparse partial least squares regression (SPLSR) approach in depth. We make use of the sparse partial least squares regression method to implement the feature selection and dimensionality reduction on the whole acquired speech emotion features. By the means of exploiting the SPLSR method, the component parts of those redundant and meaningless speech emotion features are lessened to zero while those serviceable and informative speech emotion features are maintained and selected to the following classification step. A number of tests on Berlin database reveal that the recogni- tion rate of the SPLSR method can reach up to 79.23% and is superior to other compared dimensionality reduction methods.

Download Full-text