A CNN-Assisted Enhanced Audio Signal Processing for Speech Emotion Recognition

Mustaqeem; Soonil Kwon

doi:10.3390/s20010183

A CNN-Assisted Enhanced Audio Signal Processing for Speech Emotion Recognition

Sensors ◽

10.3390/s20010183 ◽

2019 ◽

Vol 20 (1) ◽

pp. 183 ◽

Cited By ~ 8

Author(s):

Mustaqeem ◽

Soonil Kwon

Keyword(s):

Emotion Recognition ◽

Audio Signal ◽

Potential Method ◽

Speech Emotion Recognition ◽

Speech Signals ◽

Human Beings ◽

Feature Maps ◽

Softmax Classifier ◽

Visual Database ◽

Fully Connected

Speech is the most significant mode of communication among human beings and a potential method for human-computer interaction (HCI) by using a microphone sensor. Quantifiable emotion recognition using these sensors from speech signals is an emerging area of research in HCI, which applies to multiple applications such as human-reboot interaction, virtual reality, behavior assessment, healthcare, and emergency call centers to determine the speaker’s emotional state from an individual’s speech. In this paper, we present major contributions for; (i) increasing the accuracy of speech emotion recognition (SER) compared to state of the art and (ii) reducing the computational complexity of the presented SER model. We propose an artificial intelligence-assisted deep stride convolutional neural network (DSCNN) architecture using the plain nets strategy to learn salient and discriminative features from spectrogram of speech signals that are enhanced in prior steps to perform better. Local hidden patterns are learned in convolutional layers with special strides to down-sample the feature maps rather than pooling layer and global discriminative features are learned in fully connected layers. A SoftMax classifier is used for the classification of emotions in speech. The proposed technique is evaluated on Interactive Emotional Dyadic Motion Capture (IEMOCAP) and Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) datasets to improve accuracy by 7.85% and 4.5%, respectively, with the model size reduced by 34.5 MB. It proves the effectiveness and significance of the proposed SER technique and reveals its applicability in real-world applications.

Download Full-text

Random Deep Belief Networks for Recognizing Emotions from Speech Signals

Computational Intelligence and Neuroscience ◽

10.1155/2017/1945630 ◽

2017 ◽

Vol 2017 ◽

pp. 1-9 ◽

Cited By ~ 19

Author(s):

Guihua Wen ◽

Huihui Li ◽

Jubing Huang ◽

Danyang Li ◽

Eryang Xun

Keyword(s):

Emotion Recognition ◽

Speech Signal ◽

Majority Voting ◽

Speech Emotion Recognition ◽

Speech Signals ◽

Belief Networks ◽

Deep Belief Networks ◽

Emotion Label ◽

The Rich ◽

Emotion Labels

Now the human emotions can be recognized from speech signals using machine learning methods; however, they are challenged by the lower recognition accuracies in real applications due to lack of the rich representation ability. Deep belief networks (DBN) can automatically discover the multiple levels of representations in speech signals. To make full of its advantages, this paper presents an ensemble of random deep belief networks (RDBN) method for speech emotion recognition. It firstly extracts the low level features of the input speech signal and then applies them to construct lots of random subspaces. Each random subspace is then provided for DBN to yield the higher level features as the input of the classifier to output an emotion label. All outputted emotion labels are then fused through the majority voting to decide the final emotion label for the input speech signal. The conducted experimental results on benchmark speech emotion databases show that RDBN has better accuracy than the compared methods for speech emotion recognition.

Download Full-text

Research on Speech Emotion Recognition Based on Weighted Euclidean Distance

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.543-547.2192 ◽

2014 ◽

Vol 543-547 ◽

pp. 2192-2195 ◽

Cited By ~ 1

Author(s):

Chen Chen Huang ◽

Wei Gong ◽

Wen Long Fu ◽

Dong Yu Feng

Keyword(s):

Emotion Recognition ◽

Template Matching ◽

Euclidean Distance ◽

Recognition Rate ◽

Speech Emotion Recognition ◽

Human Beings ◽

Analysis Method ◽

Characteristic Parameters ◽

Emotional Information ◽

Voice Data

As the most important medium of communication in human beings life, speech carries abundant emotional information. In recent years, how to recognize the speakers emotional state automatically from the speech is attracting extensive attention of researchers in various fields. In this paper, we studied the method of speech emotion recognition. We collected a total of 360 sentences from four speakers with the emotional statement about happiness, anger, surprise, sadness, and extracted eight emotional characteristics from these voice data. Contribution analysis method is proposed to determine the value of emotion characteristic parameters. We also have used the weighted Euclidean distance template matching to identify the speech emotion, got more than 80% of the average emotional recognition rate.

Download Full-text

Speech Emotion Recognition System Using Gaussian Mixture Model and Improvement proposed via Boosted GMM

IRA-International Journal of Technology & Engineering (ISSN 2455-4480) ◽

10.21013/jte.icsesd201706 ◽

2017 ◽

Vol 7 (2 (S)) ◽

pp. 56

Author(s):

Pavitra Patel ◽

A. A. Chaudhari ◽

M. A. Pund ◽

D. H. Deshmukh

Keyword(s):

Emotion Recognition ◽

Speech Signal ◽

Gaussian Mixture ◽

Recognition System ◽

Training Data ◽

Speech Emotion Recognition ◽

Human Beings ◽

Human Machine Interaction ◽

Data Set ◽

Communication Partner

<p>Speech emotion recognition is an important issue which affects the human machine interaction. Automatic recognition of human emotion in speech aims at recognizing the underlying emotional state of a speaker from the speech signal. Gaussian mixture models (GMMs) and the minimum error rate classifier (i.e. Bayesian optimal classifier) are popular and effective tools for speech emotion recognition. Typically, GMMs are used to model the class-conditional distributions of acoustic features and their parameters are estimated by the expectation maximization (EM) algorithm based on a training data set. In this paper, we introduce a boosting algorithm for reliably and accurately estimating the class-conditional GMMs. The resulting algorithm is named the Boosted-GMM algorithm. Our speech emotion recognition experiments show that the emotion recognition rates are effectively and significantly boosted by the Boosted-GMM algorithm as compared to the EM-GMM algorithm.<br />During this interaction, human beings have some feelings that they want to convey to their communication partner with whom they are communicating, and then their communication partner may be the human or machine. This work dependent on the emotion recognition of the human beings from their speech signal<br />Emotion recognition from the speaker’s speech is very difficult because of the following reasons: Because of the existence of the different sentences, speakers, speaking styles, speaking rates accosting variability was introduced. The same utterance may show different emotions. Therefore it is very difficult to differentiate these portions of utterance. Another problem is that emotion expression is depending on the speaker and his or her culture and environment. As the culture and environment gets change the speaking style also gets change, which is another challenge in front of the speech emotion recognition system.</p>

Download Full-text

CLSTM: Deep Feature-Based Speech Emotion Recognition Using the Hierarchical ConvLSTM Network

Mathematics ◽

10.3390/math8122133 ◽

2020 ◽

Vol 8 (12) ◽

pp. 2133

Author(s):

Mustaqeem ◽

Soonil Kwon

Keyword(s):

Artificial Intelligence ◽

Machine Learning ◽

Emotion Recognition ◽

Sequence Learning ◽

Learning Strategy ◽

Recognition Rate ◽

Audio Signal ◽

System Structure ◽

Speech Emotion Recognition ◽

Center Loss

Artificial intelligence, deep learning, and machine learning are dominant sources to use in order to make a system smarter. Nowadays, the smart speech emotion recognition (SER) system is a basic necessity and an emerging research area of digital audio signal processing. However, SER plays an important role with many applications that are related to human–computer interactions (HCI). The existing state-of-the-art SER system has a quite low prediction performance, which needs improvement in order to make it feasible for the real-time commercial applications. The key reason for the low accuracy and the poor prediction rate is the scarceness of the data and a model configuration, which is the most challenging task to build a robust machine learning technique. In this paper, we addressed the limitations of the existing SER systems and proposed a unique artificial intelligence (AI) based system structure for the SER that utilizes the hierarchical blocks of the convolutional long short-term memory (ConvLSTM) with sequence learning. We designed four blocks of ConvLSTM, which is called the local features learning block (LFLB), in order to extract the local emotional features in a hierarchical correlation. The ConvLSTM layers are adopted for input-to-state and state-to-state transition in order to extract the spatial cues by utilizing the convolution operations. We placed four LFLBs in order to extract the spatiotemporal cues in the hierarchical correlational form speech signals using the residual learning strategy. Furthermore, we utilized a novel sequence learning strategy in order to extract the global information and adaptively adjust the relevant global feature weights according to the correlation of the input features. Finally, we used the center loss function with the softmax loss in order to produce the probability of the classes. The center loss increases the final classification results and ensures an accurate prediction as well as shows a conspicuous role in the whole proposed SER scheme. We tested the proposed system over two standard, interactive emotional dyadic motion capture (IEMOCAP) and ryerson audio visual database of emotional speech and song (RAVDESS) speech corpora, and obtained a 75% and an 80% recognition rate, respectively.

Download Full-text

3D CNN-Based Speech Emotion Recognition Using K-Means Clustering and Spectrograms

Entropy ◽

10.3390/e21050479 ◽

2019 ◽

Vol 21 (5) ◽

pp. 479 ◽

Cited By ~ 21

Author(s):

Noushin Hajarolasvadi ◽

Hasan Demirel

Keyword(s):

Emotion Recognition ◽

Speech Signal ◽

Expressed Emotion ◽

Audio Signal ◽

Research Direction ◽

Recognition System ◽

Speech Emotion Recognition ◽

Mel Frequency Cepstral Coefficients ◽

Audio Features ◽

3D Cnn

Detecting human intentions and emotions helps improve human–robot interactions. Emotion recognition has been a challenging research direction in the past decade. This paper proposes an emotion recognition system based on analysis of speech signals. Firstly, we split each speech signal into overlapping frames of the same length. Next, we extract an 88-dimensional vector of audio features including Mel Frequency Cepstral Coefficients (MFCC), pitch, and intensity for each of the respective frames. In parallel, the spectrogram of each frame is generated. In the final preprocessing step, by applying k-means clustering on the extracted features of all frames of each audio signal, we select k most discriminant frames, namely keyframes, to summarize the speech signal. Then, the sequence of the corresponding spectrograms of keyframes is encapsulated in a 3D tensor. These tensors are used to train and test a 3D Convolutional Neural network using a 10-fold cross-validation approach. The proposed 3D CNN has two convolutional layers and one fully connected layer. Experiments are conducted on the Surrey Audio-Visual Expressed Emotion (SAVEE), Ryerson Multimedia Laboratory (RML), and eNTERFACE’05 databases. The results are superior to the state-of-the-art methods reported in the literature.

Download Full-text

Dual Attention Network for Pitch Estimation of Monophonic Music

Symmetry ◽

10.3390/sym13071296 ◽

2021 ◽

Vol 13 (7) ◽

pp. 1296

Author(s):

Wenfang Ma ◽

Ying Hu ◽

Hao Huang

Keyword(s):

Audio Signal ◽

Attention Mechanism ◽

Feature Maps ◽

Pitch Estimation ◽

Attention Network ◽

Coupled Mode ◽

Tightly Coupled ◽

Combination Modes ◽

The Time Domain ◽

Fully Connected

The task of pitch estimation is an essential step in many audio signal processing applications. In this paper, we propose a data-driven pitch estimation network, the Dual Attention Network (DA-Net), which processes directly on the time-domain samples of monophonic music. DA-Net includes six Dual Attention Modules (DA-Modules), and each of them includes two kinds of attention: element-wise and channel-wise attention. DA-Net is to perform element attention and channel attention operations on convolution features, which reflects the idea of "symmetry". DA-Modules can model the semantic interdependencies between element-wise and channel-wise features. In the DA-Module, the element-wise attention mechanism is realized by a Convolutional Gated Linear Unit (ConvGLU), and the channel-wise attention mechanism is realized by a Squeeze-and-Excitation (SE) block. We explored three kinds of combination modes (serial mode, parallel mode, and tightly coupled mode) of the element-wise attention and channel-wise attention. Element-wise attention selectively emphasizes useful features by re-weighting the features at all positions. Channel-wise attention can learn to use global information to selectively emphasize the informative feature maps and suppress the less useful ones. Therefore, DA-Net adaptively integrates the local features with their global dependencies. The outputs of DA-Net are fed into a fully connected layer to generate a 360-dimensional vector corresponding to 360 pitches. We trained the proposed network on the iKala and MDB-stem-synth datasets, respectively. According to the experimental results, our proposed dual attention network with tightly coupled mode achieved the best performance.

Download Full-text

Speech Emotion Recognition Using Speech Feature and Word Embedding

10.31227/osf.io/b34gn ◽

2019 ◽

Cited By ~ 1

Author(s):

Bagus Tris Atmaja

Keyword(s):

Emotion Recognition ◽

Automatic Speech Recognition ◽

Recognition Accuracy ◽

Speech Emotion Recognition ◽

Input Feature ◽

Text Features ◽

Speech Feature ◽

Speech Features ◽

Speech Segments ◽

Fully Connected

Emotion recognition can be performed automati-cally from many modalities. This paper presents categoricalspeech emotion recognition using speech feature and wordembedding. Text features can be combined with speech features toimprove emotion recognition accuracy, and both features can beobtained from speech via automatic speech recognition. Here, weuse speech segments of an utterance where the acoustic featureis extracted for speech emotion recognition. Word embeddingis used as an input feature for text emotion recognition anda combination of both features is proposed for performanceimprovement purpose. Two unidirectional LSTM layers are usedfor text and fully connected layers are applied for acousticemotion recognition. Both networks then are merged to produceone of four predicted emotion categories by fully connectednetworks. The result shows the combination of speech and textachieve higher accuracy i.e. 75.49% compared to speech only with71.34% or text only emotion recognition with 66.09%. This resultalso outperforms the previously proposed methods by othersusing the same dataset on the same and/or similar modalities.

Download Full-text

A Perspective Study on Speech Emotion Recognition: Databases, Features and Classification Models

Traitement du signal ◽

10.18280/ts.380631 ◽

2021 ◽

Vol 38 (6) ◽

pp. 1861-1873

Author(s):

Kogila Raghu ◽

Manchala Sadanandam

Keyword(s):

Emotion Recognition ◽

Speech Processing ◽

Emotional State ◽

Research Area ◽

Speech Emotion Recognition ◽

Human Beings ◽

Environmental Statistics ◽

Speech Generation ◽

Research Problems ◽

Work Done

Automatic Speech Recognition (ASR) is a popular research area with many variations in human behaviour functionalities and interactions. Human beings want speech for communication and Conversations. When the conversation is going on, the information or message of the speech utterances is transferred. It also consists of message which includes speaker’s traits like emotion, his or her physiological characteristics and environmental statistics. There is a tremendous number of signals or records that are complex and encoded, but these can be decoded quickly because of human intelligence. Many academics in the domain of Human Computer Interaction (HCI) are working to automate speech generation and the extraction of speech attributes and meaning. For example, ASR can regulate the usage of voice command and maintain dictation discipline while also recognizing and verifying the speech of the speaker. As a result of accent and nativity traits, the speaker's emotional state can be discerned from the speech. In this Paper, we discussed Speech Production System of Human, Research Problems in Speech Processing, SER system Motivation, Challenges and Objectives of Speech Emotion Recognition, so far the work done on Telugu Speech Emotion Databases and their role thoroughly explained. In this Paper, our own Created Database i.e., (DETL) Database for Emotions in Telugu Language and the software Audacity for creating that database is discussed clearly.

Download Full-text

A Study of Cross-Linguistic Speech Emotion Recognition Based on 2D Feature Spaces

Electronics ◽

10.3390/electronics9101725 ◽

2020 ◽

Vol 9 (10) ◽

pp. 1725

Author(s):

Gintautas Tamulevičius ◽

Gražina Korvel ◽

Anil Bora Yayak ◽

Povilas Treigys ◽

Jolita Bernatavičienė ◽

...

Keyword(s):

Neural Network ◽

Emotion Recognition ◽

Audio Signal ◽

Speech Emotion Recognition ◽

Two Dimensional ◽

Acoustic Feature ◽

Emotional Information ◽

Feature Spaces ◽

Speaker Independent ◽

Acoustic Representation

In this research, a study of cross-linguistic speech emotion recognition is performed. For this purpose, emotional data of different languages (English, Lithuanian, German, Spanish, Serbian, and Polish) are collected, resulting in a cross-linguistic speech emotion dataset with the size of more than 10.000 emotional utterances. Despite the bi-modal character of the databases gathered, our focus is on the acoustic representation only. The assumption is that the speech audio signal carries sufficient emotional information to detect and retrieve it. Several two-dimensional acoustic feature spaces, such as cochleagrams, spectrograms, mel-cepstrograms, and fractal dimension-based space, are employed as the representations of speech emotional features. A convolutional neural network (CNN) is used as a classifier. The results show the superiority of cochleagrams over other feature spaces utilized. In the CNN-based speaker-independent cross-linguistic speech emotion recognition (SER) experiment, the accuracy of over 90% is achieved, which is close to the monolingual case of SER.

Download Full-text

Speech Emotion Recognition Based on Sparse Representation

Archives of Acoustics ◽

10.2478/aoa-2013-0055 ◽

2013 ◽

Vol 38 (4) ◽

pp. 465-470 ◽

Cited By ~ 11

Author(s):

Jingjie Yan ◽

Xiaolan Wang ◽

Weiyi Gu ◽

LiLi Ma

Keyword(s):

Dimensionality Reduction ◽

Emotion Recognition ◽

Least Squares ◽

Partial Least Squares ◽

Partial Least Squares Regression ◽

Speech Emotion Recognition ◽

Least Squares Regression ◽

Computer Science Pedagogy ◽

Reduction Methods ◽

Analysis Computer

Abstract Speech emotion recognition is deemed to be a meaningful and intractable issue among a number of do- mains comprising sentiment analysis, computer science, pedagogy, and so on. In this study, we investigate speech emotion recognition based on sparse partial least squares regression (SPLSR) approach in depth. We make use of the sparse partial least squares regression method to implement the feature selection and dimensionality reduction on the whole acquired speech emotion features. By the means of exploiting the SPLSR method, the component parts of those redundant and meaningless speech emotion features are lessened to zero while those serviceable and informative speech emotion features are maintained and selected to the following classification step. A number of tests on Berlin database reveal that the recogni- tion rate of the SPLSR method can reach up to 79.23% and is superior to other compared dimensionality reduction methods.

Download Full-text