Multi-scale image semantic recognition with hierarchical visual vocabulary

Xinghao Jiang; Tanfeng Sun; Fu Guanglei

doi:10.2298/csis100423035j

Multi-scale image semantic recognition with hierarchical visual vocabulary

Computer Science and Information Systems ◽

10.2298/csis100423035j ◽

2011 ◽

Vol 8 (3) ◽

pp. 931-951 ◽

Cited By ~ 1

Author(s):

Xinghao Jiang ◽

Tanfeng Sun ◽

Fu Guanglei

Keyword(s):

Semantic Analysis ◽

Level Structure ◽

Local Features ◽

Bag Of Visual Words ◽

Semantic Model ◽

Visual Words ◽

Multi Scale ◽

Visual Vocabulary ◽

Video Semantic Analysis ◽

Relationship Of

Local features have been proved to be effective in image/video semantic analysis. The BOVW (bag of visual words) scheme can cluster local features to form the visual vocabulary which includes an amount of words, where each word is the center of one clustering feature. The vocabulary is used to recognize the image semantic. In this paper, a new scheme to construct semantic-binding hierarchical visual vocabulary is proposed. Some attributes and relationship of the semantic nodes in the model are discussed. The hierarchical semantic model is used to organize the multi-scale semantic into a level-by-level structure. Experiments are performed based on the LabelMe dataset, the performance of our scheme is evaluated and compared with the traditional BOVW scheme, experimental results demonstrate the efficiency and flexibility of our scheme.

Download Full-text

Fusing Visual Attention CNN and Bag of Visual Words for Cross-Corpus Speech Emotion Recognition

Sensors ◽

10.3390/s20195559 ◽

2020 ◽

Vol 20 (19) ◽

pp. 5559

Author(s):

Minji Seo ◽

Myungho Kim

Keyword(s):

Visual Attention ◽

Emotion Recognition ◽

Expressed Emotion ◽

Local Features ◽

Speech Emotion Recognition ◽

Bag Of Visual Words ◽

Emotional Speech ◽

Visual Words ◽

Performance Reduction ◽

Global And Local

Speech emotion recognition (SER) classifies emotions using low-level features or a spectrogram of an utterance. When SER methods are trained and tested using different datasets, they have shown performance reduction. Cross-corpus SER research identifies speech emotion using different corpora and languages. Recent cross-corpus SER research has been conducted to improve generalization. To improve the cross-corpus SER performance, we pretrained the log-mel spectrograms of the source dataset using our designed visual attention convolutional neural network (VACNN), which has a 2D CNN base model with channel- and spatial-wise visual attention modules. To train the target dataset, we extracted the feature vector using a bag of visual words (BOVW) to assist the fine-tuned model. Because visual words represent local features in the image, the BOVW helps VACNN to learn global and local features in the log-mel spectrogram by constructing a frequency histogram of visual words. The proposed method shows an overall accuracy of 83.33%, 86.92%, and 75.00% in the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), the Berlin Database of Emotional Speech (EmoDB), and Surrey Audio-Visual Expressed Emotion (SAVEE), respectively. Experimental results on RAVDESS, EmoDB, SAVEE demonstrate improvements of 7.73%, 15.12%, and 2.34% compared to existing state-of-the-art cross-corpus SER approaches.

Download Full-text

Human Behavior Recognition based on Conditional Random Field and Bag-Of-Visual-Words Semantic Model

International Journal of Signal Processing Image Processing and Pattern Recognition ◽

10.14257/ijsip.2015.8.1.03 ◽

2015 ◽

Vol 8 (1) ◽

pp. 23-32 ◽

Cited By ~ 1

Author(s):

Fengju Bu

Keyword(s):

Random Field ◽

Human Behavior ◽

Conditional Random Field ◽

Bag Of Visual Words ◽

Semantic Model ◽

Behavior Recognition ◽

Visual Words ◽

Human Behavior Recognition

Download Full-text

Content-Based Image Retrieval using Local Features Descriptors and Bag-of-Visual Words

International Journal of Advanced Computer Science and Applications ◽

10.14569/ijacsa.2015.060929 ◽

2015 ◽

Vol 6 (9) ◽

Cited By ~ 9

Author(s):

Mohammed Alkhawlani ◽

Mohammed Elmogy ◽

Hazem Elbakry

Keyword(s):

Image Retrieval ◽

Local Features ◽

Content Based Image Retrieval ◽

Bag Of Visual Words ◽

Visual Words

Download Full-text

Temu Kembali Citra Tenun Nusa Tenggara Timur menggunakan Esktraksi Fitur yang Robust terhadap Perubahan Skala, Rotasi, dan Pencahayaan

Jurnal Teknologi Informasi dan Ilmu Komputer ◽

10.25126/jtiik.2020722002 ◽

2020 ◽

Vol 7 (2) ◽

pp. 349

Author(s):

Budiman Baso ◽

Nanik Suciati

Keyword(s):

Image Retrieval ◽

Computation Time ◽

Histogram Equalization ◽

Content Based Image Retrieval ◽

Bag Of Visual Words ◽

Query Image ◽

Feature Extraction Method ◽

Visual Words ◽

Adaptive Histogram Equalization ◽

Visual Vocabulary

Ragam motif pada tenun Nusa Tenggara Timur (NTT) seperti flora, fauna dan geometris menjadi suatu keunikan yang dapat membedakan daerah asal dan jenis dari tenun tersebut. Pada penelitian ini, sistem temu kembali citra berbasis isi atau Content-Based Image Retrieval (CBIR) diimplementasikan pada citra tenun NTT sehingga user dapat mencari citra tenun pada database menggunakan citra query berdasarkan fitur visual yang terkandung dalam citra. Seringkali citra query yang diinputkan user memiliki skala, rotasi dan pencahayaan yang bervariasi, sehingga diperlukan suatu metode ektraksi fitur yang dapat mengakomodasi variasi tersebut. Sistem temu kembali citra tenun pada penelitian ini menggunakan model Bag of Visual Words (BoVW) dari keypoints pada citra yang diekstrak dengan metode Speeded Up Robust Feature (SURF). BoVW dibangun menggunakan K-Means untuk menghasilkan visual vocabulary dari keypoints pada seluruh citra training. Representasi BoVW diharapkan dapat menangani variasi skala dan rotasi pada citra. Sedangkan untuk mengatasi variasi pencahayaan pada citra, dilakukan perbaikan kualitas citra dengan menggunakan Contrast Limited Adaptive Histogram Equalization (CLAHE). Percobaan dilakukan dengan membandingkan kinerja dari representasi BoVW yang dibangun menggunakan fitur SURF dengan Maximally Stable Extremal Regions (MSER) pada temu kembali citra tenun. Hasil uji coba menunjukkan bahwa metode SURF menghasilkan rata-rata akurasi 89,86% dan waktu komputasi 9,94 detik, sedangkan MSER menghasilkan rata-rata akurasi 84,04% dan waktu komputasi 1,95 detik. AbstractThe variety of motifs in East Nusa Tenggara tenun such as flora, fauna and geometric is an unique thing that can distinguish the region of origin and type of the tenun. In this study, the Content-Based Image Retrieval (CBIR) system is implemented in the tenun image. With Content-based techniques Users can search tenun images on the image database by using query images based on visual features contained in the image. Often the query image that the user enters has a different scale, rotation and lighting, so a feature extraction method is needed that can accommodate these differences. The tenun image retrieval system in this study used the Bag of Visual Words (BoVW) model of the keypoints in the extracted image using the Speeded Up Robust Feature (SURF) method. BoVW was built using K-Means to produce visual vocabulary from keypoints on all training images. The representation of BoVW is expected to be able to handle scale variations and rotations in images. Whereas to overcome the lighting variations in the image, image quality improvement is done by using Contrast Limited Adaptive Histogram Equalization (CLAHE). The experiment was conducted by comparing the performance of the BoVW representation which was built using the SURF feature with Maximally Stable Extremal Regions (MSER) at the tenun image retrieval. The results of the trial showed that SURF obtained higher accuracy in all conditions of tenun image data with an average value of 89.86% whereas MSER obtained an average accuracy value of 84.04%. But MSER's computation time is 1.95 seconds faster than SURF which is 9.94 seconds.

Download Full-text

Multi-Scale Locality-Constrained Spatiotemporal Coding for Local Feature Based Human Action Recognition

The Scientific World JOURNAL ◽

10.1155/2013/405645 ◽

2013 ◽

Vol 2013 ◽

pp. 1-11 ◽

Cited By ~ 3

Author(s):

Bin Wang ◽

Yu Liu ◽

Wei Wang ◽

Wei Xu ◽

Maojun Zhang

Keyword(s):

Action Recognition ◽

Human Action Recognition ◽

Human Action ◽

Local Features ◽

Local Feature ◽

Multi Scale ◽

Spatiotemporal Feature ◽

Group Information ◽

Feature Based ◽

Relationship Of

We propose a Multiscale Locality-Constrained Spatiotemporal Coding (MLSC) method to improve the traditional bag of features (BoF) algorithm which ignores the spatiotemporal relationship of local features for human action recognition in video. To model this spatiotemporal relationship, MLSC involves the spatiotemporal position of local feature into feature coding processing. It projects local features into a sub space-time-volume (sub-STV) and encodes them with a locality-constrained linear coding. A group of sub-STV features obtained from one video with MLSC and max-pooling are used to classify this video. In classification stage, the Locality-Constrained Group Sparse Representation (LGSR) is adopted to utilize the intrinsic group information of these sub-STV features. The experimental results on KTH, Weizmann, and UCF sports datasets show that our method achieves better performance than the competing local spatiotemporal feature-based human action recognition methods.

Download Full-text

An Object Classification Approach Based on Randomized Visual Vocabulary and Clustering Aggregation

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.433-435.778 ◽

2013 ◽

Vol 433-435 ◽

pp. 778-782 ◽

Cited By ~ 1

Author(s):

Li Su ◽

Jie Nan Liu ◽

Lan Fang Ren ◽

Feng Zhang

Keyword(s):

Object Classification ◽

Local Features ◽

Locality Sensitive Hashing ◽

Training Dataset ◽

Support Vector ◽

Visual Words ◽

Visual Vocabulary ◽

Vector Machines ◽

Clustering Aggregation ◽

Aggregation Technique

Considering the problems with the conventional Bag-of-Visual-Words approaches, such as high time consumption, the synonymy and ambiguity of visual word, and instability of clustering high-dimensionality image local features, this paper presents a novel object classificaiton approach based on randomized visual vocabulary and clustering aggregation. Firstly, Exact Euclidean Locality Sensitive Hashing (E2LSH) is used to cluster local features of the training dataset, and a group of randomized visual vocabularies is constructed. Then, the randomized visual vocabularies are aggregated using clustering aggregation technique, resulting in Randomized Visual Vocabularies Aggregating Dictionary (RVVAD). Finally, the visual words histogram is generated according to the dictionary, and the Support Vector Machines are learned to accomplish image object categorization. Experimental results indicate that the expression ability of the dictionary is effectively improved, and the object classification precision is increased dramatically.

Download Full-text

On Building a Universal and Compact Visual Vocabulary

Mathematical Problems in Engineering ◽

10.1155/2013/163976 ◽

2013 ◽

Vol 2013 ◽

pp. 1-8 ◽

Cited By ~ 3

Author(s):

Jian Hou ◽

Wei-Xue Liu ◽

Xu E ◽

Hamid Reza Karimi

Keyword(s):

Image Representation ◽

Classification Performance ◽

Great Success ◽

Bag Of Words ◽

Bag Of Visual Words ◽

Vocabulary Size ◽

Visual Words ◽

Visual Vocabulary ◽

Similarity Threshold ◽

Small Dataset

Bag-of-visual-words has been shown to be a powerful image representation and attained great success in many computer vision and pattern recognition applications. Usually, for a given dataset, researchers choose to build a specific visual vocabulary from the dataset, and the problem of deriving a universal visual vocabulary is rarely addressed. Based on previous work on the classification performance with respect to visual vocabulary sizes, we arrive at a hypothesis that a universal visual vocabulary can be obtained by taking-into account the similarity extent of keypoints represented by one visual word. We then propose to use a similarity threshold-based clustering method to calculate the optimal vocabulary size, where the universal similarity threshold can be obtained empirically. With the optimal vocabulary size, the optimal visual vocabularies of limited sizes from three datasets are shown to be exchangeable and therefore universal. This result indicates that a universal and compact visual vocabulary can be built from a not too small dataset. Our work narrows the gab between bag-of-visual-words and bag-of-words, where a relatively fixed vocabulary can be used with different text datasets.

Download Full-text

Image moment invariants as local features for content based image retrieval using the Bag-of-Visual-Words model

Pattern Recognition Letters ◽

10.1016/j.patrec.2015.01.005 ◽

2015 ◽

Vol 55 ◽

pp. 22-27 ◽

Cited By ~ 54

Author(s):

E.G. Karakasis ◽

A. Amanatiadis ◽

A. Gasteratos ◽

S.A. Chatzichristofis

Keyword(s):

Image Retrieval ◽

Local Features ◽

Content Based Image Retrieval ◽

Moment Invariants ◽

Bag Of Visual Words ◽

Visual Words ◽

Image Moment ◽

Image Moment Invariants

Download Full-text

Exploring Local Features and the Bag-of-Visual-Words Approach for Bioimage Classification

Proceedings of the International Conference on Bioinformatics, Computational Biology and Biomedical Informatics - BCB'13 ◽

10.1145/2506583.2512370 ◽

2013 ◽

Author(s):

Afzal Godil ◽

Zhouhui Lian ◽

Asim Wagan

Keyword(s):

Local Features ◽

Bag Of Visual Words ◽

Visual Words

Download Full-text

“Bag of visual words” and latent semantic analysis-based burning state recognition for rotary kiln sintering process

2011 Chinese Control and Decision Conference (CCDC) ◽

10.1109/ccdc.2011.5968206 ◽

2011 ◽

Author(s):

Weitao Li ◽

Xiaojie Zhou ◽

Tianyou Chai

Keyword(s):

Rotary Kiln ◽

Latent Semantic Analysis ◽

Semantic Analysis ◽

Bag Of Visual Words ◽

Sintering Process ◽

State Recognition ◽

Visual Words

Download Full-text