Emotion Recognition from Speech Using the Bag-of-Visual Words on Audio Segment Spectrograms

Evaggelos Spyrou; Rozalia Nikopoulou; Ioannis Vernikos; Phivos Mylonas

doi:10.3390/technologies7010020

Emotion Recognition from Speech Using the Bag-of-Visual Words on Audio Segment Spectrograms

Technologies ◽

10.3390/technologies7010020 ◽

2019 ◽

Vol 7 (1) ◽

pp. 20 ◽

Cited By ~ 3

Author(s):

Evaggelos Spyrou ◽

Rozalia Nikopoulou ◽

Ioannis Vernikos ◽

Phivos Mylonas

Keyword(s):

Computer Vision ◽

Emotion Recognition ◽

Affective State ◽

Real Life ◽

Support Vector ◽

Bag Of Visual Words ◽

Educational Training ◽

Visual Words ◽

Speeded Up Robust Features ◽

Digital World

It is noteworthy nowadays that monitoring and understanding a human’s emotional state plays a key role in the current and forthcoming computational technologies. On the other hand, this monitoring and analysis should be as unobtrusive as possible, since in our era the digital world has been smoothly adopted in everyday life activities. In this framework and within the domain of assessing humans’ affective state during their educational training, the most popular way to go is to use sensory equipment that would allow their observing without involving any kind of direct contact. Thus, in this work, we focus on human emotion recognition from audio stimuli (i.e., human speech) using a novel approach based on a computer vision inspired methodology, namely the bag-of-visual words method, applied on several audio segment spectrograms. The latter are considered to be the visual representation of the considered audio segment and may be analyzed by exploiting well-known traditional computer vision techniques, such as construction of a visual vocabulary, extraction of speeded-up robust features (SURF) features, quantization into a set of visual words, and image histogram construction. As a last step, support vector machines (SVM) classifiers are trained based on the aforementioned information. Finally, to further generalize the herein proposed approach, we utilize publicly available datasets from several human languages to perform cross-language experiments, both in terms of actor-created and real-life ones.

Download Full-text

Discrimination of Common Ragweed (Ambrosia artemisiifolia) and Mugwort (Artemisia vulgaris) Based on Bag of Visual Words Model

Weed Technology ◽

10.1614/wt-d-16-00068.1 ◽

2017 ◽

Vol 31 (2) ◽

pp. 310-319 ◽

Cited By ~ 2

Author(s):

Anton Ustyuzhanin ◽

Karl-Heinz Dammer ◽

Antje Giebel ◽

Cornelia Weltzien ◽

Michael Schirrmann

Keyword(s):

Plant Species ◽

Ambrosia Artemisiifolia ◽

Training Data ◽

Support Vector ◽

Identification System ◽

Bag Of Visual Words ◽

Common Ragweed ◽

Visual Words ◽

Grid Sampling ◽

Speeded Up Robust Features

Common ragweed is a plant species causing allergic and asthmatic symptoms in humans. To control its propagation, an early identification system is needed. However, due to its similar appearance with mugwort, proper differentiation between these two weed species is important. Therefore, we propose a method to discriminate common ragweed and mugwort leaves based on digital images using bag of visual words (BoVW). BoVW is an object-based image classification that has gained acceptance in many areas of science. We compared speeded-up robust features (SURF) and grid sampling for keypoint selection. The image vocabulary was built using K-means clustering. The image classifier was trained using support vector machines. To check the robustness of the classifier, specific model runs were conducted with and without damaged leaves in the trainings dataset. The results showed that the BoVW model allows the discrimination between common ragweed and mugwort leaves with high accuracy. Based on SURF keypoints with 50% of 788 images in total as training data, we achieved a 100% correct recognition of the two plant species. The grid sampling resulted in slightly less recognition accuracy (98 to 99%). In addition, the classification based on SURF was up to 31 times faster.

Download Full-text

Image Classification and Detection of Insulators using Bag of Visual Words and Speeded up Robust Features

Regular Issue - International Journal of Innovative Science and Modern Engineering ◽

10.35940/ijisme.j1260.0961020 ◽

2020 ◽

Vol 6 (10) ◽

pp. 7-13

Keyword(s):

Nearest Neighbor ◽

Training Image ◽

Reference Image ◽

Inspection System ◽

Bag Of Visual Words ◽

Visual Words ◽

Speeded Up Robust Features ◽

Matching Process ◽

Computer Vision Technology ◽

Security Surveillance

Electricalsubstation online monitoring in computer vision technology is based on image processingalgorithm to perform visual analysis.This paperpresents classification of ceramicand glass insulators through Bag of Visual Words and detection of these insulators by Point Feature Matching.The training image datasets are used for categorization by forming a visual vocabularywhile a new unlabeled image from testing image dataset is classify using nearest neighbor classification method for features descriptor. For detection we use Speeded up Robust Features for detecting position of insulator present in cluttered scene image. Matching process is done between test and reference image and decision is made based on similar features. Weconducted experiment on insulators to verify the superiority of our proposed method.The proposed method can be used in security, surveillance and inspection system.

Download Full-text

Fusing Visual Attention CNN and Bag of Visual Words for Cross-Corpus Speech Emotion Recognition

Sensors ◽

10.3390/s20195559 ◽

2020 ◽

Vol 20 (19) ◽

pp. 5559

Author(s):

Minji Seo ◽

Myungho Kim

Keyword(s):

Visual Attention ◽

Emotion Recognition ◽

Expressed Emotion ◽

Local Features ◽

Speech Emotion Recognition ◽

Bag Of Visual Words ◽

Emotional Speech ◽

Visual Words ◽

Performance Reduction ◽

Global And Local

Speech emotion recognition (SER) classifies emotions using low-level features or a spectrogram of an utterance. When SER methods are trained and tested using different datasets, they have shown performance reduction. Cross-corpus SER research identifies speech emotion using different corpora and languages. Recent cross-corpus SER research has been conducted to improve generalization. To improve the cross-corpus SER performance, we pretrained the log-mel spectrograms of the source dataset using our designed visual attention convolutional neural network (VACNN), which has a 2D CNN base model with channel- and spatial-wise visual attention modules. To train the target dataset, we extracted the feature vector using a bag of visual words (BOVW) to assist the fine-tuned model. Because visual words represent local features in the image, the BOVW helps VACNN to learn global and local features in the log-mel spectrogram by constructing a frequency histogram of visual words. The proposed method shows an overall accuracy of 83.33%, 86.92%, and 75.00% in the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), the Berlin Database of Emotional Speech (EmoDB), and Surrey Audio-Visual Expressed Emotion (SAVEE), respectively. Experimental results on RAVDESS, EmoDB, SAVEE demonstrate improvements of 7.73%, 15.12%, and 2.34% compared to existing state-of-the-art cross-corpus SER approaches.

Download Full-text

CATEGORIZATION OF SIMILAR OBJECTS USING BAG OF VISUAL WORDS AND SUPPORT VECTOR MACHINES

Proceedings of the 4th International Conference on Agents and Artificial Intelligence ◽

10.5220/0003714702310236 ◽

2012 ◽

Keyword(s):

Support Vector Machines ◽

Support Vector ◽

Bag Of Visual Words ◽

Visual Words ◽

Vector Machines

Download Full-text

Weed Mapping with UAS Imagery and a Bag of Visual Words Based Image Classifier

Remote Sensing ◽

10.3390/rs10101530 ◽

2018 ◽

Vol 10 (10) ◽

pp. 1530 ◽

Cited By ~ 11

Author(s):

Michael Pflanz ◽

Henning Nordmeyer ◽

Michael Schirrmann

Keyword(s):

Plant Protection ◽

Unmanned Aircraft ◽

Aerial Images ◽

Support Vector ◽

Bag Of Visual Words ◽

Weed Species ◽

Weed Detection ◽

Site Specific ◽

Wheat Field ◽

Visual Words

Weed detection with aerial images is a great challenge to generate field maps for site-specific plant protection application. The requirements might be met with low altitude flights of unmanned aerial vehicles (UAV), to provide adequate ground resolutions for differentiating even single weeds accurately. The following study proposed and tested an image classifier based on a Bag of Visual Words (BoVW) framework for mapping weed species, using a small unmanned aircraft system (UAS) with a commercial camera on board, at low flying altitudes. The image classifier was trained with support vector machines after building a visual dictionary of local features from many collected UAS images. A window-based processing of the models was used for mapping the weed occurrences in the UAS imagery. The UAS flight campaign was carried out over a weed infested wheat field, and images were acquired between a 1 and 6 m flight altitude. From the UAS images, 25,452 weed plants were annotated on species level, along with wheat and soil as background classes for training and validation of the models. The results showed that the BoVW model allowed the discrimination of single plants with high accuracy for Matricaria recutita L. (88.60%), Papaver rhoeas L. (89.08%), Viola arvensis M. (87.93%), and winter wheat (94.09%), within the generated maps. Regarding site specific weed control, the classified UAS images would enable the selection of the right herbicide based on the distribution of the predicted weed species.

Download Full-text

Sparse Based Image Classification With Bag-of-Visual-Words Representations

International Journal of Software Science and Computational Intelligence ◽

10.4018/jssci.2011010101 ◽

2011 ◽

Vol 3 (1) ◽

pp. 1-15 ◽

Cited By ~ 2

Author(s):

Yuanyuan Zuo ◽

Bo Zhang

Keyword(s):

Sparse Representation ◽

Image Classification ◽

Image Feature ◽

Support Vector ◽

Bag Of Visual Words ◽

Human Face Recognition ◽

Visual Words ◽

Object Categories ◽

Comparable Performance ◽

Background Clutter

The sparse representation based classification algorithm has been used to solve the problem of human face recognition, but the image database is restricted to human frontal faces with only slight illumination and expression changes. This paper applies the sparse representation based algorithm to the problem of generic image classification, with a certain degree of intra-class variations and background clutter. Experiments are conducted with the sparse representation based algorithm and Support Vector Machine (SVM) classifiers on 25 object categories selected from the Caltech101 dataset. Experimental results show that without the time-consuming parameter optimization, the sparse representation based algorithm achieves comparable performance with SVM. The experiments also demonstrate that the algorithm is robust to a certain degree of background clutter and intra-class variations with the bag-of-visual-words representations. The sparse representation based algorithm can also be applied to generic image classification task when the appropriate image feature is used.

Download Full-text

Automatic Classification of Optical Defects of Mirrors from Ronchigram Images Using Bag of Visual Words and Support Vector Machines

Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications - Lecture Notes in Computer Science ◽

10.1007/978-3-319-75193-1_86 ◽

2018 ◽

pp. 719-726

Author(s):

Daniel Zapata ◽

Angel Cruz-Roa ◽

Andrés Jiménez

Keyword(s):

Support Vector Machines ◽

Automatic Classification ◽

Support Vector ◽

Bag Of Visual Words ◽

Visual Words ◽

Vector Machines

Download Full-text

Commodity Image Classification Based on Improved Bag-of-Visual-Words Model

Complexity ◽

10.1155/2021/5556899 ◽

2021 ◽

Vol 2021 ◽

pp. 1-10

Author(s):

Huadong Sun ◽

Xu Zhang ◽

Xiaowei Han ◽

Xuesong Jin ◽

Zhijie Zhao

Keyword(s):

Feature Extraction ◽

Image Classification ◽

Local Feature ◽

Image Feature ◽

Bag Of Visual Words ◽

Shape Information ◽

Image Feature Extraction ◽

Visual Words ◽

Speeded Up Robust Features ◽

Rectangular Pattern

With the increasing scale of e-commerce, the complexity of image content makes commodity image classification face great challenges. Image feature extraction often determines the quality of the final classification results. At present, the image feature extraction part mainly includes the underlying visual feature and the intermediate semantic feature. The intermediate semantics of the image acts as a bridge between the underlying features and the advanced semantics of the image, which can make up for the semantic gap to a certain extent and has strong robustness. As a typical intermediate semantic representation method, the bag-of-visual-words (BoVW) model has received extensive attention in image classification. However, the traditional BoVW model loses the location information of local features, and its local feature descriptors mainly focus on the texture shape information of local regions but lack the expression of color information. Therefore, in this paper, the improved bag-of-visual-words model is presented, which contains three aspects of improvement: (1) multiscale local region extraction; (2) local feature description by speeded up robust features (SURF) and color vector angle histogram (CVAH); and (3) diagonal concentric rectangular pattern. Experimental results show that the three aspects of improvement to the BoVW model are complementary, while compared with the traditional BoVW and the BoVW adopting SURF + SPM, the classification accuracy of the improved BoVW is increased by 3.60% and 2.33%, respectively.

Download Full-text

Analysis of color feature extraction techniques for Fish Species Identification

10.5753/wvc.2020.13495 ◽

2020 ◽

Author(s):

Uéliton Freitas ◽

Marcio Pache ◽

Wesley Gonçalves ◽

Edson Matsubara ◽

José Sabino ◽

...

Keyword(s):

Computer Vision ◽

Feature Extraction ◽

Species Identification ◽

Fish Species ◽

Good Alternative ◽

Support Vector ◽

Extraction Techniques ◽

K Nearest Neighbors ◽

Visual Words ◽

Fish Species Identification

Color recognition is an important step for computer vision to be able to recognize objects in the most different environmental conditions. Classifying objects by color using computer vision is a good alternative for different color conditions such as the aquarium. In which it is possible to use resources of a smartphone with real-time image classification applications. This paper presents some experimental results regarding the use of five different feature extraction techniques to the problem of fish species identification. The feature extractors tested are the Bag of Visual Words (BoVW), the Bag of Colors (BoC), the Bag of Features and Colors (BoFC), the Bag of Colored Words (BoCW), and the histograms HSV and RGB color spaces. The experiments were performed using a dataset, which is also a contribution of this work, containing 1120 images from fishes of 28 different species. The feature extractors were tested under three different supervised learning setups based on Decision Trees, K-Nearest Neighbors, and Support Vector Machine. From the attribute extraction techniques described, the best performance was BoC using the Support Vector Machines as a classifier with an FMeasure of 0.90 and AUC of 0.983348 with a dictionary size of 2048.

Download Full-text

Dynamic Spatio-Temporal Bag of Expressions (D-STBoE) Model for Human Action Recognition

Sensors ◽

10.3390/s19122790 ◽

2019 ◽

Vol 19 (12) ◽

pp. 2790 ◽

Cited By ~ 5

Author(s):

Saima Nazir ◽

Muhammad Haroon Yousaf ◽

Jean-Christophe Nebel ◽

Sergio A. Velastin

Keyword(s):

Action Recognition ◽

Human Action Recognition ◽

Human Action ◽

Fixed Number ◽

Visual Word ◽

Support Vector ◽

Bag Of Visual Words ◽

Visual Words ◽

Visual Expression ◽

Spatio Temporal

Human action recognition (HAR) has emerged as a core research domain for video understanding and analysis, thus attracting many researchers. Although significant results have been achieved in simple scenarios, HAR is still a challenging task due to issues associated with view independence, occlusion and inter-class variation observed in realistic scenarios. In previous research efforts, the classical bag of visual words approach along with its variations has been widely used. In this paper, we propose a Dynamic Spatio-Temporal Bag of Expressions (D-STBoE) model for human action recognition without compromising the strengths of the classical bag of visual words approach. Expressions are formed based on the density of a spatio-temporal cube of a visual word. To handle inter-class variation, we use class-specific visual word representation for visual expression generation. In contrast to the Bag of Expressions (BoE) model, the formation of visual expressions is based on the density of spatio-temporal cubes built around each visual word, as constructing neighborhoods with a fixed number of neighbors could include non-relevant information making a visual expression less discriminative in scenarios with occlusion and changing viewpoints. Thus, the proposed approach makes the model more robust to occlusion and changing viewpoint challenges present in realistic scenarios. Furthermore, we train a multi-class Support Vector Machine (SVM) for classifying bag of expressions into action classes. Comprehensive experiments on four publicly available datasets: KTH, UCF Sports, UCF11 and UCF50 show that the proposed model outperforms existing state-of-the-art human action recognition methods in term of accuracy to 99.21%, 98.60%, 96.94 and 94.10%, respectively.

Download Full-text