Vision Transformers for Remote Sensing Image Classification

Yakoub Bazi; Laila Bashmal; Mohamad M. Al Rahhal; Reham Al Dayil; Naif Al Ajlan

doi:10.3390/rs13030516

Vision Transformers for Remote Sensing Image Classification

Remote Sensing ◽

10.3390/rs13030516 ◽

2021 ◽

Vol 13 (3) ◽

pp. 516

Author(s):

Yakoub Bazi ◽

Laila Bashmal ◽

Mohamad M. Al Rahhal ◽

Reham Al Dayil ◽

Naif Al Ajlan

Keyword(s):

Remote Sensing ◽

Language Processing ◽

Additional Data ◽

Data Augmentation ◽

State Of The Art ◽

Remote Sensing Image ◽

Classification Performance ◽

Scene Classification ◽

Remote Sensing Image Classification ◽

Augmentation Strategies

In this paper, we propose a remote-sensing scene-classification method based on vision transformers. These types of networks, which are now recognized as state-of-the-art models in natural language processing, do not rely on convolution layers as in standard convolutional neural networks (CNNs). Instead, they use multihead attention mechanisms as the main building block to derive long-range contextual relation between pixels in images. In a first step, the images under analysis are divided into patches, then converted to sequence by flattening and embedding. To keep information about the position, embedding position is added to these patches. Then, the resulting sequence is fed to several multihead attention layers for generating the final representation. At the classification stage, the first token sequence is fed to a softmax classification layer. To boost the classification performance, we explore several data augmentation strategies to generate additional data for training. Moreover, we show experimentally that we can compress the network by pruning half of the layers while keeping competing classification accuracies. Experimental results conducted on different remote-sensing image datasets demonstrate the promising capability of the model compared to state-of-the-art methods. Specifically, Vision Transformer obtains an average classification accuracy of 98.49%, 95.86%, 95.56% and 93.83% on Merced, AID, Optimal31 and NWPU datasets, respectively. While the compressed version obtained by removing half of the multihead attention layers yields 97.90%, 94.27%, 95.30% and 93.05%, respectively.

Download Full-text

A Multi-Branch Feature Fusion Strategy Based on an Attention Mechanism for Remote Sensing Image Scene Classification

Remote Sensing ◽

10.3390/rs13101950 ◽

2021 ◽

Vol 13 (10) ◽

pp. 1950

Author(s):

Cuiping Shi ◽

Xin Zhao ◽

Liguo Wang

Keyword(s):

Remote Sensing ◽

Feature Extraction ◽

Classification Accuracy ◽

Feature Fusion ◽

State Of The Art ◽

Rapid Development ◽

Remote Sensing Image ◽

Classification Performance ◽

Attention Mechanism ◽

Scene Classification

In recent years, with the rapid development of computer vision, increasing attention has been paid to remote sensing image scene classification. To improve the classification performance, many studies have increased the depth of convolutional neural networks (CNNs) and expanded the width of the network to extract more deep features, thereby increasing the complexity of the model. To solve this problem, in this paper, we propose a lightweight convolutional neural network based on attention-oriented multi-branch feature fusion (AMB-CNN) for remote sensing image scene classification. Firstly, we propose two convolution combination modules for feature extraction, through which the deep features of images can be fully extracted with multi convolution cooperation. Then, the weights of the feature are calculated, and the extracted deep features are sent to the attention mechanism for further feature extraction. Next, all of the extracted features are fused by multiple branches. Finally, depth separable convolution and asymmetric convolution are implemented to greatly reduce the number of parameters. The experimental results show that, compared with some state-of-the-art methods, the proposed method still has a great advantage in classification accuracy with very few parameters.

Download Full-text

Rotation Invariance Regularization for Remote Sensing Image Scene Classification with Convolutional Neural Networks

Remote Sensing ◽

10.3390/rs13040569 ◽

2021 ◽

Vol 13 (4) ◽

pp. 569

Author(s):

Kunlun Qi ◽

Chao Yang ◽

Chuli Hu ◽

Yonglin Shen ◽

Shengyu Shen ◽

...

Keyword(s):

Remote Sensing ◽

Neural Networks ◽

Convolutional Neural Networks ◽

Data Augmentation ◽

Remote Sensing Image ◽

Classification Performance ◽

Rotation Invariance ◽

Scene Classification ◽

Deep Convolutional Neural Networks ◽

Convolutional Network

Deep convolutional neural networks (DCNNs) have shown significant improvements in remote sensing image scene classification for powerful feature representations. However, because of the high variance and volume limitations of the available remote sensing datasets, DCNNs are prone to overfit the data used for their training. To address this problem, this paper proposes a novel scene classification framework based on a deep Siamese convolutional network with rotation invariance regularization. Specifically, we design a data augmentation strategy for the Siamese model to learn a rotation invariance DCNN model that is achieved by directly enforcing the labels of the training samples before and after rotating to be mapped close to each other. In addition to the cross-entropy cost function for the traditional CNN models, we impose a rotation invariance regularization constraint on the objective function of our proposed model. The experimental results obtained using three publicly-available scene classification datasets show that the proposed method can generally improve the classification performance by 2~3% and achieves satisfactory classification performance compared with some state-of-the-art methods.

Download Full-text

Deep Learning Based High-Resolution Remote Sensing Image classification

International Journal of Advanced Research in Computer Science and Software Engineering ◽

10.23956/ijarcsse.v7i10.384 ◽

2017 ◽

Vol 7 (10) ◽

pp. 22

Author(s):

Sumit Kaur

Keyword(s):

Machine Learning ◽

Remote Sensing ◽

Deep Learning ◽

Image Classification ◽

Language Processing ◽

Object Perception ◽

Remote Sensing Image ◽

Research Area ◽

Remote Sensing Image Classification ◽

Unsupervised Algorithms

Abstract- Deep learning is an emerging research area in machine learning and pattern recognition field which has been presented with the goal of drawing Machine Learning nearer to one of its unique objectives, Artificial Intelligence. It tries to mimic the human brain, which is capable of processing and learning from the complex input data and solving different kinds of complicated tasks well. Deep learning (DL) basically based on a set of supervised and unsupervised algorithms that attempt to model higher level abstractions in data and make it self-learning for hierarchical representation for classification. In the recent years, it has attracted much attention due to its state-of-the-art performance in diverse areas like object perception, speech recognition, computer vision, collaborative filtering and natural language processing. This paper will present a survey on different deep learning techniques for remote sensing image classification.

Download Full-text

An Efficient and Lightweight Convolutional Neural Network for Remote Sensing Image Scene Classification

Sensors ◽

10.3390/s20071999 ◽

2020 ◽

Vol 20 (7) ◽

pp. 1999 ◽

Cited By ~ 6

Author(s):

Donghang Yu ◽

Qing Xu ◽

Haitao Guo ◽

Chuan Zhao ◽

Yuzhun Lin ◽

...

Keyword(s):

Neural Network ◽

Remote Sensing ◽

Convolutional Neural Network ◽

Visual Recognition ◽

Feature Fusion ◽

Remote Sensing Image ◽

Classification Performance ◽

Image Features ◽

Training Dataset ◽

Scene Classification

Classifying remote sensing images is vital for interpreting image content. Presently, remote sensing image scene classification methods using convolutional neural networks have drawbacks, including excessive parameters and heavy calculation costs. More efficient and lightweight CNNs have fewer parameters and calculations, but their classification performance is generally weaker. We propose a more efficient and lightweight convolutional neural network method to improve classification accuracy with a small training dataset. Inspired by fine-grained visual recognition, this study introduces a bilinear convolutional neural network model for scene classification. First, the lightweight convolutional neural network, MobileNetv2, is used to extract deep and abstract image features. Each feature is then transformed into two features with two different convolutional layers. The transformed features are subjected to Hadamard product operation to obtain an enhanced bilinear feature. Finally, the bilinear feature after pooling and normalization is used for classification. Experiments are performed on three widely used datasets: UC Merced, AID, and NWPU-RESISC45. Compared with other state-of-art methods, the proposed method has fewer parameters and calculations, while achieving higher accuracy. By including feature fusion with bilinear pooling, performance and accuracy for remote scene classification can greatly improve. This could be applied to any remote sensing image classification task.

Download Full-text

Novel Multi-Scale Filter Profile-Based Framework for VHR Remote Sensing Image Classification

Remote Sensing ◽

10.3390/rs11182153 ◽

2019 ◽

Vol 11 (18) ◽

pp. 2153

Author(s):

Zhiyong Lv ◽

Guangfei Li ◽

Yixiang Chen ◽

Jón Atli Benediktsson

Keyword(s):

Remote Sensing ◽

Principal Component ◽

Remote Sensing Image ◽

Classification Performance ◽

Remote Sensing Images ◽

Multi Scale ◽

Remote Sensing Image Classification ◽

Very High Spatial Resolution ◽

Layer Stacking ◽

Initial Classification

Filter is a well-known tool for noise reduction of very high spatial resolution (VHR) remote sensing images. However, a single-scale filter usually demonstrates limitations in covering various targets with different sizes and shapes in a given image scene. A novel method called multi-scale filter profile (MFP)-based framework (MFPF) is introduced in this study to improve the classification performance of a remote sensing image of VHR and address the aforementioned problem. First, an adaptive filter is extended with a series of parameters for MFP construction. Then, a layer-stacking technique is used to concatenate the MPFs and all the features into a stacked vector. Afterward, principal component analysis, a classical descending dimension algorithm, is performed on the fused profiles to reduce the redundancy of the stacked vector. Finally, the spatial adaptive region of each filter in the MFPs is used for post-processing of the obtained initial classification map through a supervised classifier. This process aims to revise the initial classification map and generate a final classification map. Experimental results performed on the three real VHR remote sensing images demonstrate the effectiveness of the proposed MFPF in comparison with the state-of-the-art methods. Hard-tuning parameters are unnecessary in the application of the proposed approach. Thus, such a method can be conveniently applied in real applications.

Download Full-text

Ensemble of Deep Learning-Based Multimodal Remote Sensing Image Classification Model on Unmanned Aerial Vehicle Networks

Mathematics ◽

10.3390/math9222984 ◽

2021 ◽

Vol 9 (22) ◽

pp. 2984

Author(s):

Gyanendra Prasad Joshi ◽

Fayadh Alenezi ◽

Gopalakrishnan Thirumoorthy ◽

Ashit Kumar Dutta ◽

Jinsang You

Keyword(s):

Remote Sensing ◽

Land Cover ◽

Image Classification ◽

Data Augmentation ◽

Land Cover Classification ◽

Remote Sensing Image ◽

Environmental Modeling ◽

Classification Model ◽

Remote Sensing Images ◽

Remote Sensing Image Classification

Recently, unmanned aerial vehicles (UAVs) have been used in several applications of environmental modeling and land use inventories. At the same time, the computer vision-based remote sensing image classification models are needed to monitor the modifications over time such as vegetation, inland water, bare soil or human infrastructure regardless of spectral, spatial, temporal, and radiometric resolutions. In this aspect, this paper proposes an ensemble of DL-based multimodal land cover classification (EDL-MMLCC) models using remote sensing images. The EDL-MMLCC technique aims to classify remote sensing images into the different cloud, shades, and land cover classes. Primarily, median filtering-based preprocessing and data augmentation techniques take place. In addition, an ensemble of DL models, namely VGG-19, Capsule Network (CapsNet), and MobileNet, is used for feature extraction. In addition, the training process of the DL models can be enhanced by the use of hosted cuckoo optimization (HCO) algorithm. Finally, the salp swarm algorithm (SSA) with regularized extreme learning machine (RELM) classifier is applied for land cover classification. The design of the HCO algorithm for hyperparameter optimization and SSA for parameter tuning of the RELM model helps to increase the classification outcome to a maximum level considerably. The proposed EDL-MMLCC technique is tested using an Amazon dataset from the Kaggle repository. The experimental results pointed out the promising performance of the EDL-MMLCC technique over the recent state of art approaches.

Download Full-text

Unsupervised Representation High-Resolution Remote Sensing Image Scene Classification via Contrastive Learning Convolutional Neural Network

Photogrammetric Engineering & Remote Sensing ◽

10.14358/pers.87.8.577 ◽

2021 ◽

Vol 87 (8) ◽

pp. 577-591

Author(s):

Fengpeng Li ◽

Jiabao Li ◽

Wei Han ◽

Ruyi Feng ◽

Lizhe Wang

Keyword(s):

Neural Network ◽

Remote Sensing ◽

Deep Learning ◽

High Resolution ◽

Convolutional Neural Network ◽

State Of The Art ◽

Remote Sensing Image ◽

Scene Classification ◽

Data Set ◽

Unsupervised Deep Learning

Inspired by the outstanding achievement of deep learning, supervised deep learning representation methods for high-spatial-resolution remote sensing image scene classification obtained state-of-the-art performance. However, supervised deep learning representation methods need a considerable amount of labeled data to capture class-specific features, limiting the application of deep learning-based methods while there are a few labeled training samples. An unsupervised deep learning representation, high-resolution remote sensing image scene classification method is proposed in this work to address this issue. The proposed method, called contrastive learning, narrows the distance between positive views: color channels belonging to the same images widens the gaps between negative view pairs consisting of color channels from different images to obtain class-specific data representations of the input data without any supervised information. The classifier uses extracted features by the convolutional neural network (CNN)-based feature extractor with labeled information of training data to set space of each category and then, using linear regression, makes predictions in the testing procedure. Comparing with existing unsupervised deep learning representation high-resolution remote sensing image scene classification methods, contrastive learning CNN achieves state-of-the-art performance on three different scale benchmark data sets: small scale RSSCN7 data set, midscale aerial image data set, and large-scale NWPU-RESISC45 data set.

Download Full-text

Remote Sensing Image Scene Classification: Benchmark and State of the Art

Proceedings of the IEEE ◽

10.1109/jproc.2017.2675998 ◽

2017 ◽

Vol 105 (10) ◽

pp. 1865-1883 ◽

Cited By ~ 444

Author(s):

Gong Cheng ◽

Junwei Han ◽

Xiaoqiang Lu

Keyword(s):

Remote Sensing ◽

State Of The Art ◽

Remote Sensing Image ◽

Scene Classification

Download Full-text

An Attention-Guided Multilayer Feature Aggregation Network for Remote Sensing Image Scene Classification

Remote Sensing ◽

10.3390/rs13163113 ◽

2021 ◽

Vol 13 (16) ◽

pp. 3113

Author(s):

Ming Li ◽

Lin Lei ◽

Yuqi Tang ◽

Yuli Sun ◽

Gangyao Kuang

Keyword(s):

Remote Sensing ◽

Feature Learning ◽

Remote Sensing Image ◽

Classification Performance ◽

Learning Ability ◽

Scene Classification ◽

Feature Maps ◽

Feature Aggregation ◽

Scene Representation ◽

High Level

Remote sensing image scene classification (RSISC) has broad application prospects, but related challenges still exist and urgently need to be addressed. One of the most important challenges is how to learn a strong discriminative scene representation. Recently, convolutional neural networks (CNNs) have shown great potential in RSISC due to their powerful feature learning ability; however, their performance may be restricted by the complexity of remote sensing images, such as spatial layout, varying scales, complex backgrounds, category diversity, etc. In this paper, we propose an attention-guided multilayer feature aggregation network (AGMFA-Net) that attempts to improve the scene classification performance by effectively aggregating features from different layers. Specifically, to reduce the discrepancies between different layers, we employed the channel–spatial attention on multiple high-level convolutional feature maps to capture more accurately semantic regions that correspond to the content of the given scene. Then, we utilized the learned semantic regions as guidance to aggregate the valuable information from multilayer convolutional features, so as to achieve stronger scene features for classification. Experimental results on three remote sensing scene datasets indicated that our approach achieved competitive classification performance in comparison to the baselines and other state-of-the-art methods.

Download Full-text

A Convolutional Neural Network Based on Grouping Structure for Scene Classification

Remote Sensing ◽

10.3390/rs13132457 ◽

2021 ◽

Vol 13 (13) ◽

pp. 2457

Author(s):

Xuan Wu ◽

Zhijie Zhang ◽

Wanchang Zhang ◽

Yaning Yi ◽

Chuanrong Zhang ◽

...

Keyword(s):

Neural Network ◽

Remote Sensing ◽

Feature Extraction ◽

Convolutional Neural Network ◽

Data Augmentation ◽

Remote Sensing Image ◽

Difficult Problem ◽

Image Features ◽

Attention Mechanism ◽

Scene Classification

Convolutional neural network (CNN) is capable of automatically extracting image features and has been widely used in remote sensing image classifications. Feature extraction is an important and difficult problem in current research. In this paper, data augmentation for avoiding over fitting was attempted to enrich features of samples to improve the performance of a newly proposed convolutional neural network with UC-Merced and RSI-CB datasets for remotely sensed scene classifications. A multiple grouped convolutional neural network (MGCNN) for self-learning that is capable of promoting the efficiency of CNN was proposed, and the method of grouping multiple convolutional layers capable of being applied elsewhere as a plug-in model was developed. Meanwhile, a hyper-parameter C in MGCNN is introduced to probe into the influence of different grouping strategies for feature extraction. Experiments on the two selected datasets, the RSI-CB dataset and UC-Merced dataset, were carried out to verify the effectiveness of this newly proposed convolutional neural network, the accuracy obtained by MGCNN was 2% higher than the ResNet-50. An algorithm of attention mechanism was thus adopted and incorporated into grouping processes and a multiple grouped attention convolutional neural network (MGCNN-A) was therefore constructed to enhance the generalization capability of MGCNN. The additional experiments indicate that the incorporation of the attention mechanism to MGCNN slightly improved the accuracy of scene classification, but the robustness of the proposed network was enhanced considerably in remote sensing image classifications.

Download Full-text