Structure Preserving Convolutional Attention for Image Captioning

Shichen Lu; Ruimin Hu; Jing Liu; Longteng Guo; Fei Zheng

doi:10.3390/app9142888

Structure Preserving Convolutional Attention for Image Captioning

Applied Sciences ◽

10.3390/app9142888 ◽

2019 ◽

Vol 9 (14) ◽

pp. 2888 ◽

Cited By ~ 1

Author(s):

Shichen Lu ◽

Ruimin Hu ◽

Jing Liu ◽

Longteng Guo ◽

Fei Zheng

Keyword(s):

Spatial Structure ◽

Spatial Attention ◽

Large Scale ◽

Attention Mechanism ◽

Vector Representation ◽

Image Captioning ◽

Feature Maps ◽

Structure Preserving ◽

Convolution Operation

In the task of image captioning, learning the attentive image regions is necessary to adaptively and precisely focus on the object semantics relevant to each decoded word. In this paper, we propose a convolutional attention module that can preserve the spatial structure of the image by performing the convolution operation directly on the 2D feature maps. The proposed attention mechanism contains two components: convolutional spatial attention and cross-channel attention, aiming to determine the intended regions to describe the image along the spatial and channel dimensions, respectively. Both of the two attentions are calculated at each decoding step. In order to preserve the spatial structure, instead of operating on the vector representation of each image grid, the two attention components are both computed directly on the entire feature maps with convolution operations. Experiments on two large-scale datasets (MSCOCO and Flickr30K) demonstrate the outstanding performance of our proposed method.

Download Full-text

Video Description Model Based on Temporal-Spatial and Channel Multi-Attention Mechanisms

Applied Sciences ◽

10.3390/app10124312 ◽

2020 ◽

Vol 10 (12) ◽

pp. 4312 ◽

Cited By ~ 1

Author(s):

Jie Xu ◽

Haoliang Wei ◽

Linke Li ◽

Qiuru Fu ◽

Jinhong Guo

Keyword(s):

Neural Network ◽

Spatial Attention ◽

Semantic Information ◽

Attention Mechanism ◽

Visual Features ◽

Feature Maps ◽

Global Features ◽

Model Based ◽

Video Description ◽

Video Visualization

Video description plays an important role in the field of intelligent imaging technology. Attention perception mechanisms are extensively applied in video description models based on deep learning. Most existing models use a temporal-spatial attention mechanism to enhance the accuracy of models. Temporal attention mechanisms can obtain the global features of a video, whereas spatial attention mechanisms obtain local features. Nevertheless, because each channel of the convolutional neural network (CNN) feature maps has certain spatial semantic information, it is insufficient to merely divide the CNN features into regions and then apply a spatial attention mechanism. In this paper, we propose a temporal-spatial and channel attention mechanism that enables the model to take advantage of various video features and ensures the consistency of visual features between sentence descriptions to enhance the effect of the model. Meanwhile, in order to prove the effectiveness of the attention mechanism, this paper proposes a video visualization model based on the video description. Experimental results show that, our model has achieved good performance on the Microsoft Video Description (MSVD) dataset and a certain improvement on the Microsoft Research-Video to Text (MSR-VTT) dataset.

Download Full-text

SSD7-FFAM: A Real-Time Object Detection Network Friendly to Embedded Devices from Scratch

Applied Sciences ◽

10.3390/app11031096 ◽

2021 ◽

Vol 11 (3) ◽

pp. 1096

Author(s):

Qing Li ◽

Yingcheng Lin ◽

Wei He

Keyword(s):

Object Detection ◽

Real Time ◽

Large Scale ◽

Feature Fusion ◽

Contextual Information ◽

Attention Mechanism ◽

Detection Accuracy ◽

Single Shot ◽

Feature Maps ◽

Embedded Devices

The high requirements for computing and memory are the biggest challenges in deploying existing object detection networks to embedded devices. Living lightweight object detectors directly use lightweight neural network architectures such as MobileNet or ShuffleNet pre-trained on large-scale classification datasets, which results in poor network structure flexibility and is not suitable for some specific scenarios. In this paper, we propose a lightweight object detection network Single-Shot MultiBox Detector (SSD)7-Feature Fusion and Attention Mechanism (FFAM), which saves storage space and reduces the amount of calculation by reducing the number of convolutional layers. We offer a novel Feature Fusion and Attention Mechanism (FFAM) method to improve detection accuracy. Firstly, the FFAM method fuses high-level semantic information-rich feature maps with low-level feature maps to improve small objects’ detection accuracy. The lightweight attention mechanism cascaded by channels and spatial attention modules is employed to enhance the target’s contextual information and guide the network to focus on its easy-to-recognize features. The SSD7-FFAM achieves 83.7% mean Average Precision (mAP), 1.66 MB parameters, and 0.033 s average running time on the NWPU VHR-10 dataset. The results indicate that the proposed SSD7-FFAM is more suitable for deployment to embedded devices for real-time object detection.

Download Full-text

Panchromatic Image Super-Resolution Via Self Attention-Augmented Wasserstein Generative Adversarial Network

Sensors ◽

10.3390/s21062158 ◽

2021 ◽

Vol 21 (6) ◽

pp. 2158

Author(s):

Juan Du ◽

Kuanhong Cheng ◽

Yue Yu ◽

Dabao Wang ◽

Huixin Zhou

Keyword(s):

Large Scale ◽

Spatial Information ◽

Super Resolution ◽

Attention Mechanism ◽

Feature Representation ◽

Similarity Function ◽

Feature Maps ◽

Generative Adversarial Network ◽

Convolutional Network ◽

Adversarial Network

Panchromatic (PAN) images contain abundant spatial information that is useful for earth observation, but always suffer from low-resolution ( LR) due to the sensor limitation and large-scale view field. The current super-resolution (SR) methods based on traditional attention mechanism have shown remarkable advantages but remain imperfect to reconstruct the edge details of SR images. To address this problem, an improved SR model which involves the self-attention augmented Wasserstein generative adversarial network ( SAA-WGAN) is designed to dig out the reference information among multiple features for detail enhancement. We use an encoder-decoder network followed by a fully convolutional network (FCN) as the backbone to extract multi-scale features and reconstruct the High-resolution (HR) results. To exploit the relevance between multi-layer feature maps, we first integrate a convolutional block attention module (CBAM) into each skip-connection of the encoder-decoder subnet, generating weighted maps to enhance both channel-wise and spatial-wise feature representation automatically. Besides, considering that the HR results and LR inputs are highly similar in structure, yet cannot be fully reflected in traditional attention mechanism, we, therefore, designed a self augmented attention (SAA) module, where the attention weights are produced dynamically via a similarity function between hidden features; this design allows the network to flexibly adjust the fraction relevance among multi-layer features and keep the long-range inter information, which is helpful to preserve details. In addition, the pixel-wise loss is combined with perceptual and gradient loss to achieve comprehensive supervision. Experiments on benchmark datasets demonstrate that the proposed method outperforms other SR methods in terms of both objective evaluation and visual effect.

Download Full-text

Attention graph: Learning effective visual features for large-scale image classification

Journal of Algorithms & Computational Technology ◽

10.1177/17483026211065375 ◽

2022 ◽

Vol 16 ◽

pp. 174830262110653

Author(s):

Xuelian Cui ◽

Zhanjie Zhang ◽

Tao Zhang ◽

Zhuoqun Yang ◽

Jie Yang

Keyword(s):

Neural Network ◽

Image Classification ◽

Spatial Attention ◽

Network Model ◽

Large Scale ◽

Spatial Dimension ◽

Attention Mechanism ◽

Main Function ◽

Proposed Model ◽

Informative Part

In recent years, the research of deep learning has received extensive attention, and many breakthroughs have been made in various fields. On this basis, a neural network with the attention mechanism has become a research hotspot. In this paper, we try to solve the image classification task by implementing channel and spatial attention mechanism which improve the expression ability of neural network model. Different from previous studies, we propose an attention module consisting of channel attention module (CAM) and spatial attention module (SAM). The proposed module derives attention graphs from channel dimension and spatial dimension respectively, then the input features are selectively learned according to the importance of the features. Besides, this module is lightweight and can be easily integrated into image classification algorithms. In the experiment, we combine the deep residual network model with the attention module and the experimental results show that the proposed method brings higher image classification accuracy. The channel attention module adds weight to the signals on different convolution channels to represent the correlation. For different channels, the higher the weight, the higher the correlation which required more attention. The main function of spatial attention is to capture the most informative part in the local feature graph, which is a supplement to channel attention. We evaluate our proposed module based on the ImageNet-1K and Cifar-100 respectively. Through a large number of comparative experiments, our proposed model achieved outstanding performance.

Download Full-text

Deterioration Level Estimation Based on Convolutional Neural Network Using Confidence-Aware Attention Mechanism for Infrastructure Inspection

Sensors ◽

10.3390/s22010382 ◽

2022 ◽

Vol 22 (1) ◽

pp. 382

Author(s):

Naoki Ogawa ◽

Keisuke Maeda ◽

Takahiro Ogawa ◽

Miki Haseyama

Keyword(s):

Neural Network ◽

Neural Networks ◽

Convolutional Neural Network ◽

Performance Improvement ◽

Spatial Attention ◽

Convolutional Neural Networks ◽

Significant Contribution ◽

Attention Mechanism ◽

Feature Maps ◽

Infrastructure Inspection

This paper presents deterioration level estimation based on convolutional neural networks using a confidence-aware attention mechanism for infrastructure inspection. Spatial attention mechanisms try to highlight the important regions in feature maps for estimation by using an attention map. The attention mechanism using an effective attention map can improve feature maps. However, the conventional attention mechanisms have a problem as they fail to highlight important regions for estimation when an ineffective attention map is mistakenly used. To solve the above problem, this paper introduces the confidence-aware attention mechanism that reduces the effect of ineffective attention maps by considering the confidence corresponding to the attention map. The confidence is calculated from the entropy of the estimated class probabilities when generating the attention map. Because the proposed method can effectively utilize the attention map by considering the confidence, it can focus more on the important regions in the final estimation. This is the most significant contribution of this paper. The experimental results using images from actual infrastructure inspections confirm the performance improvement of the proposed method in estimating the deterioration level.

Download Full-text

Panchromatic Image Super-Resolution via Self Attention-augmented WGAN

10.20944/preprints202012.0592.v1 ◽

2020 ◽

Author(s):

Juan Du ◽

Kuanhong Cheng ◽

Yue Yu ◽

Dabao Wang ◽

Huixin Zhou

Keyword(s):

Large Scale ◽

Spatial Information ◽

Super Resolution ◽

Objective Evaluation ◽

Attention Mechanism ◽

Feature Representation ◽

Similarity Function ◽

Feature Maps ◽

Convolutional Network ◽

Benchmark Datasets

Panchromatic (PAN) images contain abundant spatial information that is useful for earth observation, but always suffer from low-resolution due to the sensor limitation and large-scale view field. The current super-resolution (SR) methods based on traditional attention mechanism have shown remarkable advantages but remain imperfect to reconstruct the edge details of SR images. To address this problem, an improved super-resolution model which involves the self-attention augmented WGAN is designed to dig out the reference information among multiple features for detail enhancement. We use an encoder-decoder network followed by a fully convolutional network (FCN) as the backbone to extract multi-scale features and reconstruct the HR results. To exploit the relevance between multi-layer feature maps, we first integrate a convolutional block attention module (CBAM) into each skip-connection of the encoder-decoder subnet, generating weighted maps to enhance both channel-wise and spatial-wise feature representation automatically. Besides, considering that the HR results and LR inputs are highly similar in structure, yet cannot be fully reflected in traditional attention mechanism, we therefore design a self augmented attention (SAA) module, where the attention weights are produced dynamically via a similarity function between hidden features, this design allows the network to flexibly adjust the fraction relevance among multi-layer features and keep the long-range inter information, which is helpful to preserve details. In addition, the pixel-wise loss is combined with perceptual and gradient loss to achieve comprehensive supervision. Experiments on benchmark datasets demonstrate that the proposed method outperforms other SR methods in terms of both objective evaluation and visual effect.

Download Full-text

A Global-Local Blur Disentangling Network for Dynamic Scene Deblurring

Applied Sciences ◽

10.3390/app11052174 ◽

2021 ◽

Vol 11 (5) ◽

pp. 2174

Author(s):

Xiaoguang Li ◽

Feifan Yang ◽

Jianglu Huang ◽

Li Zhuo

Keyword(s):

Local Features ◽

Attention Mechanism ◽

Experimental Results ◽

Dynamic Scene ◽

Feature Maps ◽

Training Scheme ◽

Real Scene ◽

Global And Local

Images captured in a real scene usually suffer from complex non-uniform degradation, which includes both global and local blurs. It is difficult to handle the complex blur variances by a unified processing model. We propose a global-local blur disentangling network, which can effectively extract global and local blur features via two branches. A phased training scheme is designed to disentangle the global and local blur features, that is the branches are trained with task-specific datasets, respectively. A branch attention mechanism is introduced to dynamically fuse global and local features. Complex blurry images are used to train the attention module and the reconstruction module. The visualized feature maps of different branches indicated that our dual-branch network can decouple the global and local blur features efficiently. Experimental results show that the proposed dual-branch blur disentangling network can improve both the subjective and objective deblurring effects for real captured images.

Download Full-text

Dual Attention on Pyramid Feature Maps for Image Captioning

IEEE Transactions on Multimedia ◽

10.1109/tmm.2021.3072479 ◽

2021 ◽

pp. 1-1

Author(s):

Litao Yu ◽

Jian Zhang ◽

Qiang Wu

Keyword(s):

Image Captioning ◽

Feature Maps

Download Full-text

Building Damage Detection Using U-Net with Attention Mechanism from Pre- and Post-Disaster Remote Sensing Datasets

Remote Sensing ◽

10.3390/rs13050905 ◽

2021 ◽

Vol 13 (5) ◽

pp. 905

Author(s):

Chuyi Wu ◽

Feng Zhang ◽

Junshi Xia ◽

Yichen Xu ◽

Guoqing Li ◽

...

Keyword(s):

Damage Assessment ◽

Large Scale ◽

Binary Classification ◽

Open Data ◽

Building Damage ◽

Attention Mechanism ◽

Large Scale Dataset ◽

Data Program ◽

The Impact ◽

Post Disaster

The building damage status is vital to plan rescue and reconstruction after a disaster and is also hard to detect and judge its level. Most existing studies focus on binary classification, and the attention of the model is distracted. In this study, we proposed a Siamese neural network that can localize and classify damaged buildings at one time. The main parts of this network are a variety of attention U-Nets using different backbones. The attention mechanism enables the network to pay more attention to the effective features and channels, so as to reduce the impact of useless features. We train them using the xBD dataset, which is a large-scale dataset for the advancement of building damage assessment, and compare their result balanced F (F1) scores. The score demonstrates that the performance of SEresNeXt with an attention mechanism gives the best performance, with the F1 score reaching 0.787. To improve the accuracy, we fused the results and got the best overall F1 score of 0.792. To verify the transferability and robustness of the model, we selected the dataset on the Maxar Open Data Program of two recent disasters to investigate the performance. By visual comparison, the results show that our model is robust and transferable.

Download Full-text

U2-ONet: A Two-Level Nested Octave U-Structure Network with a Multi-Scale Attention Mechanism for Moving Object Segmentation

Remote Sensing ◽

10.3390/rs13010060 ◽

2020 ◽

Vol 13 (1) ◽

pp. 60

Author(s):

Chenjie Wang ◽

Chengyuan Li ◽

Jun Liu ◽

Bin Luo ◽

Xin Su ◽

...

Keyword(s):

Moving Objects ◽

Object Segmentation ◽

Contextual Information ◽

Attention Mechanism ◽

Moving Object ◽

Feature Maps ◽

Moving Object Segmentation ◽

Practical Applications ◽

Multi Scale ◽

Spatial Redundancy

Most scenes in practical applications are dynamic scenes containing moving objects, so accurately segmenting moving objects is crucial for many computer vision applications. In order to efficiently segment all the moving objects in the scene, regardless of whether the object has a predefined semantic label, we propose a two-level nested octave U-structure network with a multi-scale attention mechanism, called U2-ONet. U2-ONet takes two RGB frames, the optical flow between these frames, and the instance segmentation of the frames as inputs. Each stage of U2-ONet is filled with the newly designed octave residual U-block (ORSU block) to enhance the ability to obtain more contextual information at different scales while reducing the spatial redundancy of the feature maps. In order to efficiently train the multi-scale deep network, we introduce a hierarchical training supervision strategy that calculates the loss at each level while adding knowledge-matching loss to keep the optimization consistent. The experimental results show that the proposed U2-ONet method can achieve a state-of-the-art performance in several general moving object segmentation datasets.

Download Full-text