Two-Level Attentions and Grouping Attention Convolutional Network for Fine-Grained Image Classification

Yadong Yang; Xiaofeng Wang; Quan Zhao; Tingting Sui

doi:10.3390/app9091939

Two-Level Attentions and Grouping Attention Convolutional Network for Fine-Grained Image Classification

Applied Sciences ◽

10.3390/app9091939 ◽

2019 ◽

Vol 9 (9) ◽

pp. 1939 ◽

Cited By ~ 5

Author(s):

Yadong Yang ◽

Xiaofeng Wang ◽

Quan Zhao ◽

Tingting Sui

Keyword(s):

Image Classification ◽

Feature Fusion ◽

Recognition Rate ◽

Fine Tuning ◽

Semantic Features ◽

Convolutional Network ◽

Attention Model ◽

Fine Grained ◽

Visual Attention Mechanism ◽

High Level

The focus of fine-grained image classification tasks is to ignore interference information and grasp local features. This challenge is what the visual attention mechanism excels at. Firstly, we have constructed a two-level attention convolutional network, which characterizes the object-level attention and the pixel-level attention. Then, we combine the two kinds of attention through a second-order response transform algorithm. Furthermore, we propose a clustering-based grouping attention model, which implies the part-level attention. The grouping attention method is to stretch all the semantic features, in a deeper convolution layer of the network, into vectors. These vectors are clustered by a vector dot product, and each category represents a special semantic. The grouping attention algorithm implements the functions of group convolution and feature clustering, which can greatly reduce the network parameters and improve the recognition rate and interpretability of the network. Finally, the low-level visual features and high-level semantic information are merged by a multi-level feature fusion method to accurately classify fine-grained images. We have achieved good results without using pre-training networks and fine-tuning techniques.

Download Full-text

DCNet: Densely Connected Deep Convolutional Encoder–Decoder Network for Nasopharyngeal Carcinoma Segmentation

Sensors ◽

10.3390/s21237877 ◽

2021 ◽

Vol 21 (23) ◽

pp. 7877

Author(s):

Yang Li ◽

Guanghui Han ◽

Xiujian Liu

Keyword(s):

Nasopharyngeal Carcinoma ◽

Large Scale ◽

Tumor Volume ◽

Spatial Information ◽

Semantic Features ◽

Convolutional Network ◽

Fine Grained ◽

Ablation Study ◽

High Level ◽

Convolutional Encoder

Nasopharyngeal Carcinoma segmentation in magnetic resonance imagery (MRI) is vital to radiotherapy. Exact dose delivery hinges on an accurate delineation of the gross tumor volume (GTV). However, the large-scale variation in tumor volume is intractable, and the performance of current models is mostly unsatisfactory with indistinguishable and blurred boundaries of segmentation results of tiny tumor volume. To address the problem, we propose a densely connected deep convolutional network consisting of an encoder network and a corresponding decoder network, which extracts high-level semantic features from different levels and uses low-level spatial features concurrently to obtain fine-grained segmented masks. Skip-connection architecture is involved and modified to propagate spatial information to the decoder network. Preliminary experiments are conducted on 30 patients. Experimental results show our model outperforms all baseline models, with improvements of 4.17%. An ablation study is performed, and the effectiveness of the novel loss function is validated.

Download Full-text

A feature fusion deep-projection convolution neural network for vehicle detection in aerial images

PLoS ONE ◽

10.1371/journal.pone.0250782 ◽

2021 ◽

Vol 16 (5) ◽

pp. e0250782

Author(s):

Bin Wang ◽

Bin Xu

Keyword(s):

Neural Network ◽

Feature Fusion ◽

Rapid Development ◽

Vehicle Detection ◽

Convolution Neural Network ◽

Aerial Images ◽

Semantic Features ◽

General Object ◽

High Level ◽

The Impact

With the rapid development of Unmanned Aerial Vehicles, vehicle detection in aerial images plays an important role in different applications. Comparing with general object detection problems, vehicle detection in aerial images is still a challenging research topic since it is plagued by various unique factors, e.g. different camera angle, small vehicle size and complex background. In this paper, a Feature Fusion Deep-Projection Convolution Neural Network is proposed to enhance the ability to detect small vehicles in aerial images. The backbone of the proposed framework utilizes a novel residual block named stepwise res-block to explore high-level semantic features as well as conserve low-level detail features at the same time. A specially designed feature fusion module is adopted in the proposed framework to further balance the features obtained from different levels of the backbone. A deep-projection deconvolution module is used to minimize the impact of the information contamination introduced by down-sampling/up-sampling processes. The proposed framework has been evaluated by UCAS-AOD, VEDAI, and DOTA datasets. According to the evaluation results, the proposed framework outperforms other state-of-the-art vehicle detection algorithms for aerial images.

Download Full-text

Fine-Grained Image Classification Based on Target Acquisition and Feature Fusion

Knowledge Science, Engineering and Management - Lecture Notes in Computer Science ◽

10.1007/978-3-030-82153-1_18 ◽

2021 ◽

pp. 209-221

Author(s):

Yan Chu ◽

Zhengkui Wang ◽

Lina Wang ◽

Qingchao Zhao ◽

Wen Shan

Keyword(s):

Image Classification ◽

Feature Fusion ◽

Target Acquisition ◽

Fine Grained

Download Full-text

Multimodal Remote Sensing Image Classification with Small Sample Size Based on High-Level Feature Fusion

Laser & Optoelectronics Progress ◽

10.3788/lop56.111001 ◽

2019 ◽

Vol 56 (11) ◽

pp. 111001

Author(s):

贺琪 Qi He ◽

李瑶 Yao Li ◽

宋巍 Wei Song ◽

黄冬梅 Dongmei Huang ◽

何盛琪 Shengqi He ◽

...

Keyword(s):

Remote Sensing ◽

Sample Size ◽

Image Classification ◽

Feature Fusion ◽

Small Sample Size ◽

Remote Sensing Image ◽

Small Sample ◽

Remote Sensing Image Classification ◽

High Level ◽

High Level Feature

Download Full-text

Distractor-Aware Tracking with Multi-Task and Dynamic Feature Learning

Journal of Circuits System and Computers ◽

10.1142/s0218126621500316 ◽

2020 ◽

pp. 2150031

Author(s):

Weichun Liu ◽

Xiaoan Tang ◽

Chenglin Zhao

Keyword(s):

Correlation Filter ◽

Coarse Grained ◽

Dynamic Feature ◽

Semantic Features ◽

Low Level ◽

Fine Grained ◽

Semantic Embedding ◽

Training Stage ◽

Online Tracking ◽

High Level

Recently, deep trackers based on the siamese networking are enjoying increasing popularity in the tracking community. Generally, those trackers learn a high-level semantic embedding space for feature representation but lose low-level fine-grained details. Meanwhile, the learned high-level semantic features are not updated during online tracking, which results in tracking drift in presence of target appearance variation and similar distractors. In this paper, we present a novel end-to-end trainable Convolutional Neural Network (CNN) based on the siamese network for distractor-aware tracking. It enhances target appearance representation in both the offline training stage and online tracking stage. In the offline training stage, this network learns both the low-level fine-grained details and high-level coarse-grained semantics simultaneously in a multi-task learning framework. The low-level features with better resolution are complementary to semantic features and able to distinguish the foreground target from background distractors. In the online stage, the learned low-level features are fed into a correlation filter layer and updated in an interpolated manner to encode target appearance variation adaptively. The learned high-level features are fed into a cross-correlation layer without online update. Therefore, the proposed tracker benefits from both the adaptability of the fine-grained correlation filter and the generalization capability of the semantic embedding. Extensive experiments are conducted on the public OTB100 and UAV123 benchmark datasets. Our tracker achieves state-of-the-art performance while running with a real-time frame-rate.

Download Full-text

Hybrid Attention Network for Language-Based Person Search

Sensors ◽

10.3390/s20185279 ◽

2020 ◽

Vol 20 (18) ◽

pp. 5279

Author(s):

Yang Li ◽

Huahu Xu ◽

Junsheng Xiao

Keyword(s):

Image Features ◽

Attention Mechanism ◽

Feature Representation ◽

Semantic Features ◽

Retrieval Task ◽

Attention Network ◽

Fine Grained ◽

Person Search ◽

High Level ◽

Language Description

Language-based person search retrieves images of a target person using natural language description and is a challenging fine-grained cross-modal retrieval task. A novel hybrid attention network is proposed for the task. The network includes the following three aspects: First, a cubic attention mechanism for person image, which combines cross-layer spatial attention and channel attention. It can fully excavate both important midlevel details and key high-level semantics to obtain better discriminative fine-grained feature representation of a person image. Second, a text attention network for language description, which is based on bidirectional LSTM (BiLSTM) and self-attention mechanism. It can better learn the bidirectional semantic dependency and capture the key words of sentences, so as to extract the context information and key semantic features of the language description more effectively and accurately. Third, a cross-modal attention mechanism and a joint loss function for cross-modal learning, which can pay more attention to the relevant parts between text and image features. It can better exploit both the cross-modal and intra-modal correlation and can better solve the problem of cross-modal heterogeneity. Extensive experiments have been conducted on the CUHK-PEDES dataset. Our approach obtains higher performance than state-of-the-art approaches, demonstrating the advantage of the approach we propose.

Download Full-text

Deep Feature Fusion with Integration of Residual Connection and Attention Model for Classification of VHR Remote Sensing Images

Remote Sensing ◽

10.3390/rs11131617 ◽

2019 ◽

Vol 11 (13) ◽

pp. 1617 ◽

Cited By ~ 5

Author(s):

Jicheng Wang ◽

Li Shen ◽

Wenfan Qiao ◽

Yanshuai Dai ◽

Zhilin Li

Keyword(s):

Remote Sensing ◽

Feature Fusion ◽

Learning Ability ◽

Remote Sensing Images ◽

Convolutional Network ◽

Fully Convolutional Network ◽

Attention Model ◽

Low Level ◽

Deep Feature

The classification of very-high-resolution (VHR) remote sensing images is essential in many applications. However, high intraclass and low interclass variations in these kinds of images pose serious challenges. Fully convolutional network (FCN) models, which benefit from a powerful feature learning ability, have shown impressive performance and great potential. Nevertheless, only classification results with coarse resolution can be obtained from the original FCN method. Deep feature fusion is often employed to improve the resolution of outputs. Existing strategies for such fusion are not capable of properly utilizing the low-level features and considering the importance of features at different scales. This paper proposes a novel, end-to-end, fully convolutional network to integrate a multiconnection ResNet model and a class-specific attention model into a unified framework to overcome these problems. The former fuses multilevel deep features without introducing any redundant information from low-level features. The latter can learn the contributions from different features of each geo-object at each scale. Extensive experiments on two open datasets indicate that the proposed method can achieve class-specific scale-adaptive classification results and it outperforms other state-of-the-art methods. The results were submitted to the International Society for Photogrammetry and Remote Sensing (ISPRS) online contest for comparison with more than 50 other methods. The results indicate that the proposed method (ID: SWJ_2) ranks #1 in terms of overall accuracy, even though no additional digital surface model (DSM) data that were offered by ISPRS were used and no postprocessing was applied.

Download Full-text

Cross-Modal Hybrid Feature Fusion for Image-Sentence Matching

ACM Transactions on Multimedia Computing Communications and Applications ◽

10.1145/3458281 ◽

2021 ◽

Vol 17 (4) ◽

pp. 1-23

Author(s):

Xing Xu ◽

Yifan Wang ◽

Yixuan He ◽

Yang Yang ◽

Alan Hanjalic ◽

...

Keyword(s):

Feature Fusion ◽

Global Features ◽

Fine Grained ◽

Common Space ◽

Multimodal Features ◽

Sentence Similarity ◽

Language And Vision ◽

Sentence Matching ◽

High Level ◽

Ranking Loss

Image-sentence matching is a challenging task in the field of language and vision, which aims at measuring the similarities between images and sentence descriptions. Most existing methods independently map the global features of images and sentences into a common space to calculate the image-sentence similarity. However, the image-sentence similarity obtained by these methods may be coarse as (1) an intermediate common space is introduced to implicitly match the heterogeneous features of images and sentences in a global level, and (2) only the inter-modality relations of images and sentences are captured while the intra-modality relations are ignored. To overcome the limitations, we propose a novel Cross-Modal Hybrid Feature Fusion (CMHF) framework for directly learning the image-sentence similarity by fusing multimodal features with inter- and intra-modality relations incorporated. It can robustly capture the high-level interactions between visual regions in images and words in sentences, where flexible attention mechanisms are utilized to generate effective attention flows within and across the modalities of images and sentences. A structured objective with ranking loss constraint is formed in CMHF to learn the image-sentence similarity based on the fused fine-grained features of different modalities bypassing the usage of intermediate common space. Extensive experiments and comprehensive analysis performed on two widely used datasets—Microsoft COCO and Flickr30K—show the effectiveness of the hybrid feature fusion framework in CMHF, in which the state-of-the-art matching performance is achieved by our proposed CMHF method.

Download Full-text

Fully Convolutional DenseNet with Attention Mechanism for Liver Lesion Segmentation in MRI Images

Journal of Medical Imaging and Health Informatics ◽

10.1166/jmihi.2021.3617 ◽

2021 ◽

Vol 11 (8) ◽

pp. 2231-2242

Author(s):

Fei Gao ◽

Kai Qiao ◽

Jinjin Hai ◽

Bin Yan ◽

Minghui Wu ◽

...

Keyword(s):

Liver Tumor ◽

Feature Fusion ◽

Liver Tumors ◽

Contextual Information ◽

Liver Lesion ◽

Attention Mechanism ◽

Tumor Segmentation ◽

Semantic Features ◽

Lesion Segmentation ◽

High Level

The goal of this research is to achieve accurate segmentation of liver tumors in noncontrast T2-weighted magnetic resonance imaging. As liver tumors and adjacent organs are represented by pixels of very similar gray intensity, segmentation is challenging, and the presence of different sizes of liver tumor makes segmentation more difficult. Differing from previous work to capture contextual information using multiscale feature fusion with concatenation, attention mechanism is added to our segmentation model to extract precise global contextual information for pixel labeling without requiring complex dilated convolution. This study describe a liver lesion segmentation model derived from FC-DenseNet with attention mechanism. Specifically, a global attention module (GAM) is added to up-sampling path, and high-level features are processed by the GAM to generating weighting information for guiding high resolution detail features recovery. High-level features are very effective for accurate category classification, but relatively weak at pixel classification and predicting restoration of the original resolution, so the fusion of high-level semantic features and low-level detail features can improve segmentation accuracy. A weighted focal loss function is used to solve the problem of lesion area occupying a relatively low proportion of the whole image, and to deal with the disequilibrium of foreground and background in the training liver lesion images. Experimental results show our segmentation model can automatically segment liver tumors from complete MRI images, and the addition of the GAM model can effectively improve liver tumor segmentation. Our algorithms have obvious advantages over other CNN algorithms and traditional manual methods of feature extraction.

Download Full-text

Multilayer feature fusion with parallel convolutional block for fine-grained image classification

Applied Intelligence ◽

10.1007/s10489-021-02573-2 ◽

2021 ◽

Author(s):

Lei Wang ◽

Kai He ◽

Xu Feng ◽

Xitao Ma

Keyword(s):

Image Classification ◽

Feature Fusion ◽

Fine Grained

Download Full-text