Multiclass SVM-based language-independent emotion recognition using selective speech features

Mobile-based human emotion recognition is very challenging subject, most of the approaches suggested and built in this field utilized various contexts that can be derived from the external sensors and the smartphone, but these approaches suffer from different obstacles and challenges. The proposed system integrated human speech signal and heart rate, in one system, to leverage the accuracy of the human emotion recognition. The proposed system is designed to recognize four human emotions; angry, happy, sad and normal. In this system, the smartphone is used to record user speech and send it to a server. The smartwatch, fixed on user wrist, is used to measure user heart rate while the user is speaking and send it, via Bluetooth, to the smartphone which in turn sends it to the server. At the server side, the speech features are extracted from the speech signal to be classified by neural network. To minimize the misclassification of the neural network, the user heart rate measurement is used to direct the extracted speech features to either excited (angry and happy) neural network or to the calm (sad and normal) neural network. In spite of the challenges associated with the system, the system achieved 96.49% for known speakers and 79.05% for unknown speakers

Download Full-text

IMPROVED SPEAKER-INDEPENDENT EMOTION RECOGNITION FROM SPEECH USING TWO-STAGE FEATURE REDUCTION

Journal of Information and Communication Technology ◽

10.32890/jict2015.14.0.8156 ◽

2015 ◽

Author(s):

Hasrul Mohd Nazid ◽

Hariharan Muthusamy ◽

Vikneswaran Vijean ◽

Sazali Yaacob

Keyword(s):

Emotion Recognition ◽

Principal Component ◽

Feature Reduction ◽

Speech Emotion Recognition ◽

Emotional Speech ◽

Two Stage ◽

Linear Discriminant ◽

Speaker Independent ◽

Speech Features ◽

And Gender

In the recent years, researchers are focusing to improve the accuracy of speech emotion recognition. Generally, high emotion recognition accuracies were obtained for two-class emotion recognition, but multi-class emotion recognition is still a challenging task . The main aim of this work is to propose a two-stage feature reduction using Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) for improving the accuracy of the speech emotion recognition (ER) system. Short-term speech features were extracted from the emotional speech signals. Experiments were carried out using four different supervised classifi ers with two different emotional speech databases. From the experimental results, it can be inferred that the proposed method provides better accuracies of 87.48% for speaker dependent (SD) and gender dependent (GD) ER experiment, 85.15% for speaker independent (SI) ER experiment, and 87.09% for gender independent (GI) experiment.

Download Full-text

Multi-Modal Emotion Recognition Using Speech Features and Text-Embedding

Applied Sciences ◽

10.3390/app11177967 ◽

2021 ◽

Vol 11 (17) ◽

pp. 7967

Author(s):

Sung-Woo Byun ◽

Ju-Hee Kim ◽

Seok-Pil Lee

Keyword(s):

Emotion Recognition ◽

Acoustic Feature ◽

Natural Interaction ◽

Text Data ◽

Feature Vectors ◽

Proposed Model ◽

Accurate Performance ◽

Speech Features ◽

Personal Assistants ◽

Deep Learning Model

Recently, intelligent personal assistants, chat-bots and AI speakers are being utilized more broadly as communication interfaces and the demands for more natural interaction measures have increased as well. Humans can express emotions in various ways, such as using voice tones or facial expressions; therefore, multimodal approaches to recognize human emotions have been studied. In this paper, we propose an emotion recognition method to deliver more accuracy by using speech and text data. The strengths of the data are also utilized in this method. We conducted 43 feature vectors such as spectral features, harmonic features and MFCC from speech datasets. In addition, 256 embedding vectors from transcripts using pre-trained Tacotron encoder were extracted. The acoustic feature vectors and embedding vectors were fed into each deep learning model which produced a probability for the predicted output classes. The results show that the proposed model exhibited more accurate performance than in previous research.

Download Full-text

Optimized multi-channel deep neural network with 2D graphical representation of acoustic speech features for emotion recognition

2014 8th International Conference on Signal Processing and Communication Systems (ICSPCS) ◽

10.1109/icspcs.2014.7021120 ◽

2014 ◽

Cited By ~ 3

Author(s):

Melissa N Stolar ◽

Margaret Lech ◽

Ian S Burnett

Keyword(s):

Neural Network ◽

Emotion Recognition ◽

Deep Neural Network ◽

Graphical Representation ◽

Speech Features

Download Full-text

Dimensional speech emotion recognition from speech features and word embeddings by using multitask learning

APSIPA Transactions on Signal and Information Processing ◽

10.1017/atsip.2020.14 ◽

2020 ◽

Vol 9 ◽

Cited By ~ 1

Author(s):

Bagus Tris Atmaja ◽

Masato Akagi

Keyword(s):

Emotion Recognition ◽

Multitask Learning ◽

Speech Emotion Recognition ◽

Word Embeddings ◽

Concordance Correlation ◽

Acoustic Networks ◽

Overall Evaluation ◽

Speech Features ◽

Two Parameters ◽

Emotion Labels

Abstract The majority of research in speech emotion recognition (SER) is conducted to recognize emotion categories. Recognizing dimensional emotion attributes is also important, however, and it has several advantages over categorical emotion. For this research, we investigate dimensional SER using both speech features and word embeddings. The concatenation network joins acoustic networks and text networks from bimodal features. We demonstrate that those bimodal features, both are extracted from speech, improve the performance of dimensional SER over unimodal SER either using acoustic features or word embeddings. A significant improvement on the valence dimension is contributed by the addition of word embeddings to SER system, while arousal and dominance dimensions are also improved. We proposed a multitask learning (MTL) approach for the prediction of all emotional attributes. This MTL maximizes the concordance correlation between predicted emotion degrees and true emotion labels simultaneously. The findings suggest that the use of MTL with two parameters is better than other evaluated methods in representing the interrelation of emotional attributes. In unimodal results, speech features attain higher performance on arousal and dominance, while word embeddings are better for predicting valence. The overall evaluation uses the concordance correlation coefficient score of the three emotional attributes. We also discuss some differences between categorical and dimensional emotion results from psychological and engineering perspectives.

Download Full-text

Enhancing robustness of speech recognizers by bimodal features

Facta universitatis - series Electronics and Energetics ◽

10.2298/fuee0602287g ◽

2006 ◽

Vol 19 (2) ◽

pp. 287-298

Author(s):

Inge Gavat ◽

Gabriel Costache ◽

Claudia Iancu

Keyword(s):

Speech Signal ◽

Additive Noise ◽

Effective Algorithm ◽

Feature Vectors ◽

Speech Recognizers ◽

Combined Features ◽

Speech Features ◽

Multiclass Svm ◽

Speech Recognizer ◽

Better Than

In this paper a robust speech recognizer is presented based on features obtained from the speech signal and also from the image of the speaker. The features were combined by simple concatenation, resulting composed feature vectors to train the models corresponding to each class. For recognition, the classification process relies on a very effective algorithm, namely the multiclass SVM. Under additive noise conditions the bimodal system based on combined features acts better than the unimodal system, based only on the speech features, the added information obtained from the image playing an important role in robustness improvement.

Download Full-text