Multi-Modal Emotion Recognition Using Speech Features and Text-Embedding

Sung-Woo Byun; Ju-Hee Kim; Seok-Pil Lee

doi:10.3390/app11177967

Multi-Modal Emotion Recognition Using Speech Features and Text-Embedding

Applied Sciences ◽

10.3390/app11177967 ◽

2021 ◽

Vol 11 (17) ◽

pp. 7967

Author(s):

Sung-Woo Byun ◽

Ju-Hee Kim ◽

Seok-Pil Lee

Keyword(s):

Emotion Recognition ◽

Acoustic Feature ◽

Natural Interaction ◽

Text Data ◽

Feature Vectors ◽

Proposed Model ◽

Accurate Performance ◽

Speech Features ◽

Personal Assistants ◽

Deep Learning Model

Recently, intelligent personal assistants, chat-bots and AI speakers are being utilized more broadly as communication interfaces and the demands for more natural interaction measures have increased as well. Humans can express emotions in various ways, such as using voice tones or facial expressions; therefore, multimodal approaches to recognize human emotions have been studied. In this paper, we propose an emotion recognition method to deliver more accuracy by using speech and text data. The strengths of the data are also utilized in this method. We conducted 43 feature vectors such as spectral features, harmonic features and MFCC from speech datasets. In addition, 256 embedding vectors from transcripts using pre-trained Tacotron encoder were extracted. The acoustic feature vectors and embedding vectors were fed into each deep learning model which produced a probability for the predicted output classes. The results show that the proposed model exhibited more accurate performance than in previous research.

Download Full-text

A Densely Connected GRU Neural Network Based on Coattention Mechanism for Chinese Rice-Related Question Similarity Matching

Agronomy ◽

10.3390/agronomy11071307 ◽

2021 ◽

Vol 11 (7) ◽

pp. 1307

Author(s):

Haoriqin Wang ◽

Huaji Zhu ◽

Huarui Wu ◽

Xiaomin Wang ◽

Xiao Han ◽

...

Keyword(s):

Fractal dimension pattern-based multiresolution analysis for rough estimator of speaker-dependent audio emotion recognition

International Journal of Wavelets Multiresolution and Information Processing ◽

10.1142/s0219691317500424 ◽

2017 ◽

Vol 15 (05) ◽

pp. 1750042 ◽

Cited By ~ 2

Author(s):

Miao Cheng ◽

Ah Chung Tsoi

Keyword(s):

Fractal Dimension ◽

Emotion Recognition ◽

Multiresolution Analysis ◽

Real Life ◽

Emotional States ◽

Acoustic Feature ◽

Comparative Performance ◽

Audio Analysis ◽

The Arts ◽

The Given

As a general means of expression, audio analysis and recognition have attracted much attention for its wide applications in real-life world. Audio emotion recognition (AER) attempts to understand the emotional states of human with the given utterance signals, and has been studied abroad for its further development on friendly human–machine interfaces. Though there have been several the-state-of-the-arts auditory methods devised to audio recognition, most of them focus on discriminative usage of acoustic features, while feedback efficiency of recognition demands is ignored. This makes possible application of AER, and rapid learning of emotion patterns is desired. In order to make predication of audio emotion possible, the speaker-dependent patterns of audio emotions are learned with multiresolution analysis, and fractal dimension (FD) features are calculated for acoustic feature extraction. Furthermore, it is able to efficiently learn the intrinsic characteristics of auditory emotions, while the utterance features are learned from FDs of each sub-band. Experimental results show the proposed method is able to provide comparative performance for AER.

Download Full-text

Speech Emotion Recognition Framework based on User Self-referential Speech Features

2018 IEEE 7th Global Conference on Consumer Electronics (GCCE) ◽

10.1109/gcce.2018.8574676 ◽

2018 ◽

Cited By ~ 1

Author(s):

Kyoungju Noh ◽

Seungeun Chung ◽

Jiyoun Lim ◽

Gague Kim ◽

Hyuntae Jeong

Keyword(s):

Emotion Recognition ◽

Speech Emotion Recognition ◽

Speech Features

Download Full-text

Parametric Density Estimation for the Classification of Acoustic Feature Vectors in Speech Recognition

Nonlinear Modeling ◽

10.1007/978-1-4615-5703-6_4 ◽

1998 ◽

pp. 87-118 ◽

Cited By ~ 5

Author(s):

Sankar Basu ◽

Charles A. Micchelli

Keyword(s):

Speech Recognition ◽

Density Estimation ◽

Acoustic Feature ◽

Feature Vectors ◽

Parametric Density

Download Full-text

Deep Learning Model for Facial Emotion Recognition

Proceedings of ICETIT 2019 - Lecture Notes in Electrical Engineering ◽

10.1007/978-3-030-30577-2_48 ◽

2019 ◽

pp. 543-558 ◽

Cited By ~ 1

Author(s):

Ajeet Ram Pathak ◽

Somesh Bhalsing ◽

Shivani Desai ◽

Monica Gandhi ◽

Pranathi Patwardhan

Keyword(s):

Deep Learning ◽

Emotion Recognition ◽

Learning Model ◽

Facial Emotion Recognition ◽

Facial Emotion ◽

Deep Learning Model

Download Full-text

An Adversarial Attacks Resistance-based Approach to Emotion Recognition from Images using Facial Landmarks

10.26686/wgtn.14411018.v1 ◽

2021 ◽

Author(s):

Harisu Abdullahi Shehu ◽

William Browne ◽

Hedwig Eisenbarth

Keyword(s):

Emotion Recognition ◽

High Performance ◽

Input Image ◽

Large Decrease ◽

Facial Landmarks ◽

The Past ◽

Homogeneous Representation ◽

Small Change ◽

Deep Learning Model ◽

Representation Of Knowledge

Emotion recognition has become an increasingly important area of research due to the increasing number of CCTV cameras in the past few years. Deep network-based methods have made impressive progress in performing emotion recognition-based tasks, achieving high performance on many datasets and their related competitions such as the ImageNet challenge. However, deep networks are vulnerable to adversarial attacks. Due to their homogeneous representation of knowledge across all images, a small change to the input image made by an adversary might result in a large decrease in the accuracy of the algorithm. By detecting heterogeneous facial landmarks using the machine learning library Dlib we hypothesize we can build robustness to adversarial attacks. The residual neural network (ResNet) model has been used as an example of a deep learning model. While the accuracy achieved by ResNet showed a decrease of up to 22%, our proposed approach has shown strong resistance to an attack and showed only a little (< 0.3%) or no decrease when the attack is launched on the data. Furthermore, the proposed approach has shown considerably less execution time compared to the ResNet model.

Download Full-text

An Enhanced Faster-RCNN Based Deep Learning Model for Crop Diseases Detection and Classification

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.f9212.088619 ◽

2019 ◽

Vol 8 (6) ◽

pp. 4714-4719

Keyword(s):

Neural Network ◽

Vital Role ◽

Convolution Neural Network ◽

Plant Phenotyping ◽

Early Disease ◽

Early Disease Detection ◽

Proposed Model ◽

Crop Diseases ◽

Deep Learning Model ◽

State Of Art

Recently Plant phenotyping has gained the attention of many researchers such that it plays a vital role in the context of enhancing agricultural productivity. Indian economy highly depends on agriculture and this factor elevates the importance of early disease detection of the crops within the agricultural fields. Addressing this problem several researchers have proposed Computer Vision and Pattern recognition based mechanisms through which they have attempted to identify the infected crops in the early stages.in this scenario, CNN convolution neural network-based architecture has demonstrated exceptional performance when compared with state-of-art mechanisms. This paper introduces an enhanced RCNN recurrent convolution neural network-based architecture that enhances the prediction accuracy while detecting the crop diseases in early stages. Based on the simulative studies is observed that the proposed model outperforms when compared with CNN and other state-of-art mechanisms.

Download Full-text

Optimally organized GRU-deep learning model with Chi2 feature selection for heart disease prediction

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-212438 ◽

2021 ◽

pp. 1-12

Author(s):

Irfan Javid ◽

Ahmed Khalaf Zager Alsaedi ◽

Rozaida Binti Ghazali ◽

Yana Mazwin ◽

Muhammad Zulqarnain

Keyword(s):

Heart Disease ◽

Search Strategy ◽

Medical Practitioner ◽

Training Data ◽

Hybrid Technique ◽

Network Configuration ◽

Proposed Model ◽

Selection For ◽

Gated Recurrent Unit ◽

Deep Learning Model

In previous studies, various machine-driven decision support systems based on recurrent neural networks (RNN) were ordinarily projected for the detection of cardiovascular disease. However, the majority of these approaches are restricted to feature preprocessing. In this paper, we concentrate on both, including, feature refinement and the removal of the predictive model’s problems, e.g., underfitting and overfitting. By evading overfitting and underfitting, the model will demonstrate good enactment on equally the training and testing datasets. Overfitting the training data is often triggered by inadequate network configuration and inappropriate features. We advocate using Chi2 statistical model to remove irrelevant features when searching for the best-configured gated recurrent unit (GRU) using an exhaustive search strategy. The suggested hybrid technique, called Chi2 GRU, is tested against traditional ANN and GRU models, as well as different progressive machine learning models and antecedently revealed strategies for cardiopathy prediction. The prediction accuracy of proposed model is 92.17% . In contrast to formerly stated approaches, the obtained outcomes are promising. The study’s results indicate that medical practitioner will use the proposed diagnostic method to reliably predict heart disease.

Download Full-text

3D U-Net: A voxel-based method in binding site prediction of protein structure

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720021500062 ◽

2021 ◽

Vol 19 (02) ◽

pp. 2150006

Author(s):

Fatemeh Nazem ◽

Fahimeh Ghasemi ◽

Afshin Fassihi ◽

Alireza Mehri Dehnavi

Keyword(s):

Drug Design ◽

Binding Site ◽

Binding Sites ◽

Semantic Segmentation ◽

Structure Based Drug Design ◽

Site Prediction ◽

Binding Site Prediction ◽

Novel Proteins ◽

Proposed Model ◽

Deep Learning Model

Binding site prediction for new proteins is important in structure-based drug design. The identified binding sites may be helpful in the development of treatments for new viral outbreaks in the world when there is no information available about their pockets with COVID-19 being a case in point. Identification of the pockets using computational methods, as an alternative method, has recently attracted much interest. In this study, the binding site prediction is viewed as a semantic segmentation problem. An improved 3D version of the U-Net model based on the dice loss function is utilized to predict the binding sites accurately. The performance of the proposed model on the independent test datasets and SARS-COV-2 shows the segmentation model could predict the binding sites with a more accurate shape than the recently published deep learning model, i.e. DeepSite. Therefore, the model may help predict the binding sites of proteins and could be used in drug design for novel proteins.

Download Full-text

Bimodal Emotion Recognition Model for Minnan Songs

Information ◽

10.3390/info11030145 ◽

2020 ◽

Vol 11 (3) ◽

pp. 145 ◽

Cited By ~ 1

Author(s):

Zhenglong Xiang ◽

Xialei Dong ◽

Yuanxiang Li ◽

Fei Yu ◽

Xing Xu ◽

...

Keyword(s):

Neural Network ◽

Emotion Recognition ◽

Short Term Memory ◽

Music Appreciation ◽

Research Papers ◽

Audio Features ◽

Analysis Theory ◽

Proposed Model ◽

Song Lyrics ◽

Long Short Term Memory

Most of the existing research papers study the emotion recognition of Minnan songs from the perspectives of music analysis theory and music appreciation. However, these investigations do not explore any possibility of carrying out an automatic emotion recognition of Minnan songs. In this paper, we propose a model that consists of four main modules to classify the emotion of Minnan songs by using the bimodal data—song lyrics and audio. In the proposed model, an attention-based Long Short-Term Memory (LSTM) neural network is applied to extract lyrical features, and a Convolutional Neural Network (CNN) is used to extract the audio features from the spectrum. Then, two kinds of extracted features are concatenated by multimodal compact bilinear pooling, and finally, the concatenated features are input to the classifying module to determine the song emotion. We designed three experiment groups to investigate the classifying performance of combinations of the four main parts, the comparisons of proposed model with the current approaches and the influence of a few key parameters on the performance of emotion recognition. The results show that the proposed model exhibits better performance over all other experimental groups. The accuracy, precision and recall of the proposed model exceed 0.80 in a combination of appropriate parameters.

Download Full-text