scholarly journals HROM: Learning High-Resolution Representation and Object-Aware Masks for Visual Object Tracking

Sensors ◽  
2020 ◽  
Vol 20 (17) ◽  
pp. 4807
Author(s):  
Dawei Zhang ◽  
Zhonglong Zheng ◽  
Tianxiang Wang ◽  
Yiran He

Siamese network-based trackers consider tracking as features cross-correlation between the target template and the search region. Therefore, feature representation plays an important role for constructing a high-performance tracker. However, all existing Siamese networks extract the deep but low-resolution features of the entire patch, which is not robust enough to estimate the target bounding box accurately. In this work, to address this issue, we propose a novel high-resolution Siamese network, which connects the high-to-low resolution convolution streams in parallel as well as repeatedly exchanges the information across resolutions to maintain high-resolution representations. The resulting representation is semantically richer and spatially more precise by a simple yet effective multi-scale feature fusion strategy. Moreover, we exploit attention mechanisms to learn object-aware masks for adaptive feature refinement, and use deformable convolution to handle complex geometric transformations. This makes the target more discriminative against distractors and background. Without bells and whistles, extensive experiments on popular tracking benchmarks containing OTB100, UAV123, VOT2018 and LaSOT demonstrate that the proposed tracker achieves state-of-the-art performance and runs in real time, confirming its efficiency and effectiveness.

2021 ◽  
Author(s):  
Huan Zhang ◽  
Zhao Zhang ◽  
Haijun Zhang ◽  
Yi Yang ◽  
Shuicheng Yan ◽  
...  

<div>Deep learning based image inpainting methods have improved the performance greatly due to powerful representation ability of deep learning. However, current deep inpainting methods still tend to produce unreasonable structure and blurry texture, implying that image inpainting is still a challenging topic due to the ill-posed property of the task. To address these issues, we propose a novel deep multi-resolution learning-based progressive image inpainting method, termed MR-InpaintNet, which takes the damaged images of different resolutions as input and then fuses the multi-resolution features for repairing the damaged images. The idea is motivated by the fact that images of different resolutions can provide different levels of feature information. Specifically, the low-resolution image provides strong semantic information and the high-resolution image offers detailed texture information. The middle-resolution image can be used to reduce the gap between low-resolution and high-resolution images, which can further refine the inpainting result. To fuse and improve the multi-resolution features, a novel multi-resolution feature learning (MRFL) process is designed, which is consisted of a multi-resolution feature fusion (MRFF) module, an adaptive feature enhancement (AFE) module and a memory enhanced mechanism (MEM) module for information preservation. Then, the refined multi-resolution features contain both rich semantic information and detailed texture information from multiple resolutions. We further handle the refined multiresolution features by the decoder to obtain the recovered image. Extensive experiments on the Paris Street View, Places2 and CelebA-HQ datasets demonstrate that our proposed MRInpaintNet can effectively recover the textures and structures, and performs favorably against state-of-the-art methods.</div>


Sensors ◽  
2020 ◽  
Vol 20 (14) ◽  
pp. 4021 ◽  
Author(s):  
Mustansar Fiaz ◽  
Arif Mahmood ◽  
Soon Ki Jung

We propose to improve the visual object tracking by introducing a soft mask based low-level feature fusion technique. The proposed technique is further strengthened by integrating channel and spatial attention mechanisms. The proposed approach is integrated within a Siamese framework to demonstrate its effectiveness for visual object tracking. The proposed soft mask is used to give more importance to the target regions as compared to the other regions to enable effective target feature representation and to increase discriminative power. The low-level feature fusion improves the tracker robustness against distractors. The channel attention is used to identify more discriminative channels for better target representation. The spatial attention complements the soft mask based approach to better localize the target objects in challenging tracking scenarios. We evaluated our proposed approach over five publicly available benchmark datasets and performed extensive comparisons with 39 state-of-the-art tracking algorithms. The proposed tracker demonstrates excellent performance compared to the existing state-of-the-art trackers.


2018 ◽  
Vol 2018 ◽  
pp. 1-13 ◽  
Author(s):  
Yunlong Yu ◽  
Fuxian Liu

One of the challenging problems in understanding high-resolution remote sensing images is aerial scene classification. A well-designed feature representation method and classifier can improve classification accuracy. In this paper, we construct a new two-stream deep architecture for aerial scene classification. First, we use two pretrained convolutional neural networks (CNNs) as feature extractor to learn deep features from the original aerial image and the processed aerial image through saliency detection, respectively. Second, two feature fusion strategies are adopted to fuse the two different types of deep convolutional features extracted by the original RGB stream and the saliency stream. Finally, we use the extreme learning machine (ELM) classifier for final classification with the fused features. The effectiveness of the proposed architecture is tested on four challenging datasets: UC-Merced dataset with 21 scene categories, WHU-RS dataset with 19 scene categories, AID dataset with 30 scene categories, and NWPU-RESISC45 dataset with 45 challenging scene categories. The experimental results demonstrate that our architecture gets a significant classification accuracy improvement over all state-of-the-art references.


Author(s):  
Guoqing Zhang ◽  
Yuhao Chen ◽  
Weisi Lin ◽  
Arun Chandran ◽  
Xuan Jing

As a prevailing task in video surveillance and forensics field, person re-identification (re-ID) aims to match person images captured from non-overlapped cameras. In unconstrained scenarios, person images often suffer from the resolution mismatch problem, i.e., Cross-Resolution Person Re-ID. To overcome this problem, most existing methods restore low resolution (LR) images to high resolution (HR) by super-resolution (SR). However, they only focus on the HR feature extraction and ignore the valid information from original LR images. In this work, we explore the influence of resolutions on feature extraction and develop a novel method for cross-resolution person re-ID called Multi-Resolution Representations Joint Learning (MRJL). Our method consists of a Resolution Reconstruction Network (RRN) and a Dual Feature Fusion Network (DFFN). The RRN uses an input image to construct a HR version and a LR version with an encoder and two decoders, while the DFFN adopts a dual-branch structure to generate person representations from multi-resolution images. Comprehensive experiments on five benchmarks verify the superiority of the proposed MRJL over the relevent state-of-the-art methods.


Author(s):  
Zheng Wang ◽  
Mang Ye ◽  
Fan Yang ◽  
Xiang Bai ◽  
Shin'ichi Satoh

Person re-identification (REID) is an important task in video surveillance and forensics applications. Most of previous approaches are based on a key assumption that all person images have uniform and sufficiently high resolutions. Actually, various low-resolutions and scale mismatching always exist in open world REID. We name this kind of problem as Scale-Adaptive Low Resolution Person Re-identification (SALR-REID). The most intuitive way to address this problem is to increase various low-resolutions (not only low, but also with different scales) to a uniform high-resolution. SR-GAN is one of the most competitive image super-resolution deep networks, designed with a fixed upscaling factor. However, it is still not suitable for SALR-REID task, which requires a network not only synthesizing high-resolution images with different upscaling factors, but also extracting discriminative image feature for judging person’s identity. (1) To promote the ability of scale-adaptive upscaling, we cascade multiple SRGANs in series. (2) To supplement the ability of image feature representation, we plug-in a reidentification network. With a unified formulation, a Cascaded Super-Resolution GAN (CSR-GAN) framework is proposed. Extensive evaluations on two simulated datasets and one public dataset demonstrate the advantages of our method over related state-of-the-art methods.


2016 ◽  
Vol 2016 ◽  
pp. 1-12
Author(s):  
Jino Hans William ◽  
N. Venkateswaran ◽  
Srinath Narayanan ◽  
Sandeep Ramachandran

A selfie is typically a self-portrait captured using the front camera of a smartphone. Most state-of-the-art smartphones are equipped with a high-resolution (HR) rear camera and a low-resolution (LR) front camera. As selfies are captured by front camera with limited pixel resolution, the fine details in it are explicitly missed. This paper aims to improve the resolution of selfies by exploiting the fine details in HR images captured by rear camera using an example-based super-resolution (SR) algorithm. HR images captured by rear camera carry significant fine details and are used as an exemplar to train an optimal matrix-value regression (MVR) operator. The MVR operator serves as an image-pair priori which learns the correspondence between the LR-HR patch-pairs and is effectively used to super-resolve LR selfie images. The proposed MVR algorithm avoids vectorization of image patch-pairs and preserves image-level information during both learning and recovering process. The proposed algorithm is evaluated for its efficiency and effectiveness both qualitatively and quantitatively with other state-of-the-art SR algorithms. The results validate that the proposed algorithm is efficient as it requires less than 3 seconds to super-resolve LR selfie and is effective as it preserves sharp details without introducing any counterfeit fine details.


2021 ◽  
Author(s):  
Florian Fregien ◽  
Sebastian Pasewaldt ◽  
Jürgen Döllner ◽  
Matthias Trapp

With the ongoing improvement of digital cameras and smartphones, more and more people can acquire high- resolution digital images. Due to their size and high performance requirements, such Gigapixel Images (GPIs) are often challenging to process and explore compared to conventional low resolution images. To address this problem, this paper presents a service-based approach for GPI processing in a device-independent way using cloud-based processing. For it, the concept, design, and implementation of GPI processing functionality into service-based architectures is presented and evaluated with respect to advantages, limitations, and runtime performance.


2021 ◽  
Vol 13 (10) ◽  
pp. 1912
Author(s):  
Zhili Zhang ◽  
Meng Lu ◽  
Shunping Ji ◽  
Huafen Yu ◽  
Chenhui Nie

Extracting water-bodies accurately is a great challenge from very high resolution (VHR) remote sensing imagery. The boundaries of a water body are commonly hard to identify due to the complex spectral mixtures caused by aquatic vegetation, distinct lake/river colors, silts near the bank, shadows from the surrounding tall plants, and so on. The diversity and semantic information of features need to be increased for a better extraction of water-bodies from VHR remote sensing images. In this paper, we address these problems by designing a novel multi-feature extraction and combination module. This module consists of three feature extraction sub-modules based on spatial and channel correlations in feature maps at each scale, which extract the complete target information from the local space, larger space, and between-channel relationship to achieve a rich feature representation. Simultaneously, to better predict the fine contours of water-bodies, we adopt a multi-scale prediction fusion module. Besides, to solve the semantic inconsistency of feature fusion between the encoding stage and the decoding stage, we apply an encoder-decoder semantic feature fusion module to promote fusion effects. We carry out extensive experiments in VHR aerial and satellite imagery respectively. The result shows that our method achieves state-of-the-art segmentation performance, surpassing the classic and recent methods. Moreover, our proposed method is robust in challenging water-body extraction scenarios.


2021 ◽  
Author(s):  
Huan Zhang ◽  
Zhao Zhang ◽  
Haijun Zhang ◽  
Yi Yang ◽  
Shuicheng Yan ◽  
...  

<div>Deep learning based image inpainting methods have improved the performance greatly due to powerful representation ability of deep learning. However, current deep inpainting methods still tend to produce unreasonable structure and blurry texture, implying that image inpainting is still a challenging topic due to the ill-posed property of the task. To address these issues, we propose a novel deep multi-resolution learning-based progressive image inpainting method, termed MR-InpaintNet, which takes the damaged images of different resolutions as input and then fuses the multi-resolution features for repairing the damaged images. The idea is motivated by the fact that images of different resolutions can provide different levels of feature information. Specifically, the low-resolution image provides strong semantic information and the high-resolution image offers detailed texture information. The middle-resolution image can be used to reduce the gap between low-resolution and high-resolution images, which can further refine the inpainting result. To fuse and improve the multi-resolution features, a novel multi-resolution feature learning (MRFL) process is designed, which is consisted of a multi-resolution feature fusion (MRFF) module, an adaptive feature enhancement (AFE) module and a memory enhanced mechanism (MEM) module for information preservation. Then, the refined multi-resolution features contain both rich semantic information and detailed texture information from multiple resolutions. We further handle the refined multiresolution features by the decoder to obtain the recovered image. Extensive experiments on the Paris Street View, Places2 and CelebA-HQ datasets demonstrate that our proposed MRInpaintNet can effectively recover the textures and structures, and performs favorably against state-of-the-art methods.</div>


Author(s):  
Yangliu Kuai ◽  
Dongdong Li ◽  
Que Qian

Visual object tracking works as a key component for many instrumentation and measurement applications such as UAV systems, optical tracking and measuring systems. This paper investigates how to implement accurate RGB-T tracking by integrating the complementary information from RGB and thermal sources. Inspired by the success of Siamese networks in the RGB tracking field, we design a twofold Siamese network for RGB-T tracking, which is composed of an RGB branch and a thermal branch. Each branch is a Siamese network, which can be utilized to compute similarities between the search image and the exemplar image. The parameters in the RGB branch are kept the same as SiamFC. The thermal branch is initialized with parameter weights of the trained network in SiamFC and fine-tuned with constructed thermal image pairs to better capture the target characteristics in the thermal data. Two criteria are further proposed to measure the confidence degrees of response maps obtained by these two modalities. The final response map is computed by adaptively fusing them according to their confidence degrees. The maximum location on the final response map is identified as the target location, and the target scale is obtained through a simple multi-scale search. Experiments on the recently public benchmark RGB-T234 demonstrate the effectiveness of our proposed method when compared to other state-of-the-art trackers. The high performance of our proposed trakcer makes it easy to be implemented in embedded system.


Sign in / Sign up

Export Citation Format

Share Document