scholarly journals Multi-scale image semantic recognition with hierarchical visual vocabulary

2011 ◽  
Vol 8 (3) ◽  
pp. 931-951 ◽  
Author(s):  
Xinghao Jiang ◽  
Tanfeng Sun ◽  
Fu Guanglei

Local features have been proved to be effective in image/video semantic analysis. The BOVW (bag of visual words) scheme can cluster local features to form the visual vocabulary which includes an amount of words, where each word is the center of one clustering feature. The vocabulary is used to recognize the image semantic. In this paper, a new scheme to construct semantic-binding hierarchical visual vocabulary is proposed. Some attributes and relationship of the semantic nodes in the model are discussed. The hierarchical semantic model is used to organize the multi-scale semantic into a level-by-level structure. Experiments are performed based on the LabelMe dataset, the performance of our scheme is evaluated and compared with the traditional BOVW scheme, experimental results demonstrate the efficiency and flexibility of our scheme.

Sensors ◽  
2020 ◽  
Vol 20 (19) ◽  
pp. 5559
Author(s):  
Minji Seo ◽  
Myungho Kim

Speech emotion recognition (SER) classifies emotions using low-level features or a spectrogram of an utterance. When SER methods are trained and tested using different datasets, they have shown performance reduction. Cross-corpus SER research identifies speech emotion using different corpora and languages. Recent cross-corpus SER research has been conducted to improve generalization. To improve the cross-corpus SER performance, we pretrained the log-mel spectrograms of the source dataset using our designed visual attention convolutional neural network (VACNN), which has a 2D CNN base model with channel- and spatial-wise visual attention modules. To train the target dataset, we extracted the feature vector using a bag of visual words (BOVW) to assist the fine-tuned model. Because visual words represent local features in the image, the BOVW helps VACNN to learn global and local features in the log-mel spectrogram by constructing a frequency histogram of visual words. The proposed method shows an overall accuracy of 83.33%, 86.92%, and 75.00% in the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), the Berlin Database of Emotional Speech (EmoDB), and Surrey Audio-Visual Expressed Emotion (SAVEE), respectively. Experimental results on RAVDESS, EmoDB, SAVEE demonstrate improvements of 7.73%, 15.12%, and 2.34% compared to existing state-of-the-art cross-corpus SER approaches.


2020 ◽  
Vol 7 (2) ◽  
pp. 349
Author(s):  
Budiman Baso ◽  
Nanik Suciati

<p class="Abstrak">Ragam motif pada tenun Nusa Tenggara Timur (NTT) seperti flora, fauna dan geometris menjadi suatu keunikan yang dapat membedakan daerah asal dan jenis dari tenun tersebut. Pada penelitian ini, sistem temu kembali citra berbasis isi atau <em>Content-Based Image Retrieval</em> (CBIR) diimplementasikan pada citra tenun NTT sehingga user dapat mencari citra tenun pada <em>database</em> menggunakan citra <em>query </em>berdasarkan fitur visual yang terkandung dalam citra. Seringkali citra <em>query</em> yang diinputkan <em>user</em> memiliki skala, rotasi dan pencahayaan yang bervariasi, sehingga diperlukan suatu metode ektraksi fitur yang dapat mengakomodasi variasi tersebut. Sistem temu kembali citra tenun pada penelitian ini menggunakan model <em>Bag of Visual Words</em> (BoVW) dari <em>keypoints</em> pada citra yang diekstrak dengan metode <em>Speeded Up Robust Feature</em> (SURF). BoVW dibangun menggunakan K-Means untuk menghasilkan <em>visual vocabulary</em> dari <em>keypoints</em> pada seluruh citra <em>training</em>. Representasi BoVW diharapkan dapat menangani variasi skala dan rotasi pada citra. Sedangkan untuk mengatasi variasi pencahayaan pada citra, dilakukan perbaikan kualitas citra dengan menggunakan <em>Contrast Limited Adaptive Histogram Equalization</em> (CLAHE). Percobaan dilakukan dengan membandingkan kinerja dari representasi BoVW yang dibangun menggunakan fitur SURF dengan <em>Maximally Stable Extremal Regions</em> (MSER) pada temu kembali citra tenun. Hasil uji coba menunjukkan bahwa metode SURF menghasilkan rata-rata akurasi 89,86% dan waktu komputasi 9,94 detik, sedangkan MSER menghasilkan rata-rata akurasi 84,04% dan waktu komputasi 1,95 detik.</p><p class="Abstrak"> </p><p class="Abstrak"><em><strong>Abstract</strong></em></p><p class="Abstract"><em>The variety of motifs in East Nusa Tenggara tenun such as flora, fauna and geometric is an unique thing that can distinguish the region of origin and type of the tenun. In this study, the Content-Based Image Retrieval (CBIR) system is implemented in the tenun image. With Content-based techniques Users can search tenun images on the image database by using query images based on visual features contained in the image. Often the query image that the user enters has a different scale, rotation and lighting, so a feature extraction method is needed that can accommodate these differences. The tenun image retrieval system in this study used the Bag of Visual Words (BoVW) model of the keypoints in the extracted image using the Speeded Up Robust Feature (SURF) method. BoVW was built using K-Means to produce visual vocabulary from keypoints on all training images. The representation of BoVW is expected to be able to handle scale variations and rotations in images. Whereas to overcome the lighting variations in the image, image quality improvement is done by using Contrast Limited Adaptive Histogram Equalization (CLAHE). The experiment was conducted by comparing the performance of the BoVW representation which was built using the SURF feature with Maximally Stable Extremal Regions (MSER) at the tenun image retrieval. The results of the trial showed that SURF obtained higher accuracy in all conditions of tenun image data with an average value of 89.86% whereas MSER obtained an average accuracy value of 84.04%. But MSER's computation time is 1.95 seconds faster than SURF which is 9.94 seconds.</em></p><p class="Abstrak"><em><strong><br /></strong></em></p>


2013 ◽  
Vol 2013 ◽  
pp. 1-11 ◽  
Author(s):  
Bin Wang ◽  
Yu Liu ◽  
Wei Wang ◽  
Wei Xu ◽  
Maojun Zhang

We propose a Multiscale Locality-Constrained Spatiotemporal Coding (MLSC) method to improve the traditional bag of features (BoF) algorithm which ignores the spatiotemporal relationship of local features for human action recognition in video. To model this spatiotemporal relationship, MLSC involves the spatiotemporal position of local feature into feature coding processing. It projects local features into a sub space-time-volume (sub-STV) and encodes them with a locality-constrained linear coding. A group of sub-STV features obtained from one video with MLSC and max-pooling are used to classify this video. In classification stage, the Locality-Constrained Group Sparse Representation (LGSR) is adopted to utilize the intrinsic group information of these sub-STV features. The experimental results on KTH, Weizmann, and UCF sports datasets show that our method achieves better performance than the competing local spatiotemporal feature-based human action recognition methods.


2013 ◽  
Vol 433-435 ◽  
pp. 778-782 ◽  
Author(s):  
Li Su ◽  
Jie Nan Liu ◽  
Lan Fang Ren ◽  
Feng Zhang

Considering the problems with the conventional Bag-of-Visual-Words approaches, such as high time consumption, the synonymy and ambiguity of visual word, and instability of clustering high-dimensionality image local features, this paper presents a novel object classificaiton approach based on randomized visual vocabulary and clustering aggregation. Firstly, Exact Euclidean Locality Sensitive Hashing (E2LSH) is used to cluster local features of the training dataset, and a group of randomized visual vocabularies is constructed. Then, the randomized visual vocabularies are aggregated using clustering aggregation technique, resulting in Randomized Visual Vocabularies Aggregating Dictionary (RVVAD). Finally, the visual words histogram is generated according to the dictionary, and the Support Vector Machines are learned to accomplish image object categorization. Experimental results indicate that the expression ability of the dictionary is effectively improved, and the object classification precision is increased dramatically.


2013 ◽  
Vol 2013 ◽  
pp. 1-8 ◽  
Author(s):  
Jian Hou ◽  
Wei-Xue Liu ◽  
Xu E ◽  
Hamid Reza Karimi

Bag-of-visual-words has been shown to be a powerful image representation and attained great success in many computer vision and pattern recognition applications. Usually, for a given dataset, researchers choose to build a specific visual vocabulary from the dataset, and the problem of deriving a universal visual vocabulary is rarely addressed. Based on previous work on the classification performance with respect to visual vocabulary sizes, we arrive at a hypothesis that a universal visual vocabulary can be obtained by taking-into account the similarity extent of keypoints represented by one visual word. We then propose to use a similarity threshold-based clustering method to calculate the optimal vocabulary size, where the universal similarity threshold can be obtained empirically. With the optimal vocabulary size, the optimal visual vocabularies of limited sizes from three datasets are shown to be exchangeable and therefore universal. This result indicates that a universal and compact visual vocabulary can be built from a not too small dataset. Our work narrows the gab between bag-of-visual-words and bag-of-words, where a relatively fixed vocabulary can be used with different text datasets.


Sign in / Sign up

Export Citation Format

Share Document