scholarly journals Two-Level Attentions and Grouping Attention Convolutional Network for Fine-Grained Image Classification

2019 ◽  
Vol 9 (9) ◽  
pp. 1939 ◽  
Author(s):  
Yadong Yang ◽  
Xiaofeng Wang ◽  
Quan Zhao ◽  
Tingting Sui

The focus of fine-grained image classification tasks is to ignore interference information and grasp local features. This challenge is what the visual attention mechanism excels at. Firstly, we have constructed a two-level attention convolutional network, which characterizes the object-level attention and the pixel-level attention. Then, we combine the two kinds of attention through a second-order response transform algorithm. Furthermore, we propose a clustering-based grouping attention model, which implies the part-level attention. The grouping attention method is to stretch all the semantic features, in a deeper convolution layer of the network, into vectors. These vectors are clustered by a vector dot product, and each category represents a special semantic. The grouping attention algorithm implements the functions of group convolution and feature clustering, which can greatly reduce the network parameters and improve the recognition rate and interpretability of the network. Finally, the low-level visual features and high-level semantic information are merged by a multi-level feature fusion method to accurately classify fine-grained images. We have achieved good results without using pre-training networks and fine-tuning techniques.

Sensors ◽  
2021 ◽  
Vol 21 (23) ◽  
pp. 7877
Author(s):  
Yang Li ◽  
Guanghui Han ◽  
Xiujian Liu

Nasopharyngeal Carcinoma segmentation in magnetic resonance imagery (MRI) is vital to radiotherapy. Exact dose delivery hinges on an accurate delineation of the gross tumor volume (GTV). However, the large-scale variation in tumor volume is intractable, and the performance of current models is mostly unsatisfactory with indistinguishable and blurred boundaries of segmentation results of tiny tumor volume. To address the problem, we propose a densely connected deep convolutional network consisting of an encoder network and a corresponding decoder network, which extracts high-level semantic features from different levels and uses low-level spatial features concurrently to obtain fine-grained segmented masks. Skip-connection architecture is involved and modified to propagate spatial information to the decoder network. Preliminary experiments are conducted on 30 patients. Experimental results show our model outperforms all baseline models, with improvements of 4.17%. An ablation study is performed, and the effectiveness of the novel loss function is validated.


PLoS ONE ◽  
2021 ◽  
Vol 16 (5) ◽  
pp. e0250782
Author(s):  
Bin Wang ◽  
Bin Xu

With the rapid development of Unmanned Aerial Vehicles, vehicle detection in aerial images plays an important role in different applications. Comparing with general object detection problems, vehicle detection in aerial images is still a challenging research topic since it is plagued by various unique factors, e.g. different camera angle, small vehicle size and complex background. In this paper, a Feature Fusion Deep-Projection Convolution Neural Network is proposed to enhance the ability to detect small vehicles in aerial images. The backbone of the proposed framework utilizes a novel residual block named stepwise res-block to explore high-level semantic features as well as conserve low-level detail features at the same time. A specially designed feature fusion module is adopted in the proposed framework to further balance the features obtained from different levels of the backbone. A deep-projection deconvolution module is used to minimize the impact of the information contamination introduced by down-sampling/up-sampling processes. The proposed framework has been evaluated by UCAS-AOD, VEDAI, and DOTA datasets. According to the evaluation results, the proposed framework outperforms other state-of-the-art vehicle detection algorithms for aerial images.


2019 ◽  
Vol 56 (11) ◽  
pp. 111001
Author(s):  
贺琪 Qi He ◽  
李瑶 Yao Li ◽  
宋巍 Wei Song ◽  
黄冬梅 Dongmei Huang ◽  
何盛琪 Shengqi He ◽  
...  

Author(s):  
Weichun Liu ◽  
Xiaoan Tang ◽  
Chenglin Zhao

Recently, deep trackers based on the siamese networking are enjoying increasing popularity in the tracking community. Generally, those trackers learn a high-level semantic embedding space for feature representation but lose low-level fine-grained details. Meanwhile, the learned high-level semantic features are not updated during online tracking, which results in tracking drift in presence of target appearance variation and similar distractors. In this paper, we present a novel end-to-end trainable Convolutional Neural Network (CNN) based on the siamese network for distractor-aware tracking. It enhances target appearance representation in both the offline training stage and online tracking stage. In the offline training stage, this network learns both the low-level fine-grained details and high-level coarse-grained semantics simultaneously in a multi-task learning framework. The low-level features with better resolution are complementary to semantic features and able to distinguish the foreground target from background distractors. In the online stage, the learned low-level features are fed into a correlation filter layer and updated in an interpolated manner to encode target appearance variation adaptively. The learned high-level features are fed into a cross-correlation layer without online update. Therefore, the proposed tracker benefits from both the adaptability of the fine-grained correlation filter and the generalization capability of the semantic embedding. Extensive experiments are conducted on the public OTB100 and UAV123 benchmark datasets. Our tracker achieves state-of-the-art performance while running with a real-time frame-rate.


Sensors ◽  
2020 ◽  
Vol 20 (18) ◽  
pp. 5279
Author(s):  
Yang Li ◽  
Huahu Xu ◽  
Junsheng Xiao

Language-based person search retrieves images of a target person using natural language description and is a challenging fine-grained cross-modal retrieval task. A novel hybrid attention network is proposed for the task. The network includes the following three aspects: First, a cubic attention mechanism for person image, which combines cross-layer spatial attention and channel attention. It can fully excavate both important midlevel details and key high-level semantics to obtain better discriminative fine-grained feature representation of a person image. Second, a text attention network for language description, which is based on bidirectional LSTM (BiLSTM) and self-attention mechanism. It can better learn the bidirectional semantic dependency and capture the key words of sentences, so as to extract the context information and key semantic features of the language description more effectively and accurately. Third, a cross-modal attention mechanism and a joint loss function for cross-modal learning, which can pay more attention to the relevant parts between text and image features. It can better exploit both the cross-modal and intra-modal correlation and can better solve the problem of cross-modal heterogeneity. Extensive experiments have been conducted on the CUHK-PEDES dataset. Our approach obtains higher performance than state-of-the-art approaches, demonstrating the advantage of the approach we propose.


2019 ◽  
Vol 11 (13) ◽  
pp. 1617 ◽  
Author(s):  
Jicheng Wang ◽  
Li Shen ◽  
Wenfan Qiao ◽  
Yanshuai Dai ◽  
Zhilin Li

The classification of very-high-resolution (VHR) remote sensing images is essential in many applications. However, high intraclass and low interclass variations in these kinds of images pose serious challenges. Fully convolutional network (FCN) models, which benefit from a powerful feature learning ability, have shown impressive performance and great potential. Nevertheless, only classification results with coarse resolution can be obtained from the original FCN method. Deep feature fusion is often employed to improve the resolution of outputs. Existing strategies for such fusion are not capable of properly utilizing the low-level features and considering the importance of features at different scales. This paper proposes a novel, end-to-end, fully convolutional network to integrate a multiconnection ResNet model and a class-specific attention model into a unified framework to overcome these problems. The former fuses multilevel deep features without introducing any redundant information from low-level features. The latter can learn the contributions from different features of each geo-object at each scale. Extensive experiments on two open datasets indicate that the proposed method can achieve class-specific scale-adaptive classification results and it outperforms other state-of-the-art methods. The results were submitted to the International Society for Photogrammetry and Remote Sensing (ISPRS) online contest for comparison with more than 50 other methods. The results indicate that the proposed method (ID: SWJ_2) ranks #1 in terms of overall accuracy, even though no additional digital surface model (DSM) data that were offered by ISPRS were used and no postprocessing was applied.


Author(s):  
Xing Xu ◽  
Yifan Wang ◽  
Yixuan He ◽  
Yang Yang ◽  
Alan Hanjalic ◽  
...  

Image-sentence matching is a challenging task in the field of language and vision, which aims at measuring the similarities between images and sentence descriptions. Most existing methods independently map the global features of images and sentences into a common space to calculate the image-sentence similarity. However, the image-sentence similarity obtained by these methods may be coarse as (1) an intermediate common space is introduced to implicitly match the heterogeneous features of images and sentences in a global level, and (2) only the inter-modality relations of images and sentences are captured while the intra-modality relations are ignored. To overcome the limitations, we propose a novel Cross-Modal Hybrid Feature Fusion (CMHF) framework for directly learning the image-sentence similarity by fusing multimodal features with inter- and intra-modality relations incorporated. It can robustly capture the high-level interactions between visual regions in images and words in sentences, where flexible attention mechanisms are utilized to generate effective attention flows within and across the modalities of images and sentences. A structured objective with ranking loss constraint is formed in CMHF to learn the image-sentence similarity based on the fused fine-grained features of different modalities bypassing the usage of intermediate common space. Extensive experiments and comprehensive analysis performed on two widely used datasets—Microsoft COCO and Flickr30K—show the effectiveness of the hybrid feature fusion framework in CMHF, in which the state-of-the-art matching performance is achieved by our proposed CMHF method.


2021 ◽  
Vol 11 (8) ◽  
pp. 2231-2242
Author(s):  
Fei Gao ◽  
Kai Qiao ◽  
Jinjin Hai ◽  
Bin Yan ◽  
Minghui Wu ◽  
...  

The goal of this research is to achieve accurate segmentation of liver tumors in noncontrast T2-weighted magnetic resonance imaging. As liver tumors and adjacent organs are represented by pixels of very similar gray intensity, segmentation is challenging, and the presence of different sizes of liver tumor makes segmentation more difficult. Differing from previous work to capture contextual information using multiscale feature fusion with concatenation, attention mechanism is added to our segmentation model to extract precise global contextual information for pixel labeling without requiring complex dilated convolution. This study describe a liver lesion segmentation model derived from FC-DenseNet with attention mechanism. Specifically, a global attention module (GAM) is added to up-sampling path, and high-level features are processed by the GAM to generating weighting information for guiding high resolution detail features recovery. High-level features are very effective for accurate category classification, but relatively weak at pixel classification and predicting restoration of the original resolution, so the fusion of high-level semantic features and low-level detail features can improve segmentation accuracy. A weighted focal loss function is used to solve the problem of lesion area occupying a relatively low proportion of the whole image, and to deal with the disequilibrium of foreground and background in the training liver lesion images. Experimental results show our segmentation model can automatically segment liver tumors from complete MRI images, and the addition of the GAM model can effectively improve liver tumor segmentation. Our algorithms have obvious advantages over other CNN algorithms and traditional manual methods of feature extraction.


Sign in / Sign up

Export Citation Format

Share Document