Unsupervised monocular depth estimation with aggregating image features and wavelet SSIM (Structural SIMilarity) loss

Monocular depth estimation is an essential task for scene understanding. The underlying structure of objects and stuff in a complex scene is critical to recovering accurate and visually-pleasing depth maps. Global structure conveys scene layouts, while local structure reflects shape details. Recently developed approaches based on convolutional neural networks (CNNs) significantly improve the performance of depth estimation. However, few of them take into account multi-scale structures in complex scenes. In this paper, we propose a Structure-Aware Residual Pyramid Network (SARPN) to exploit multi-scale structures for accurate depth prediction. We propose a Residual Pyramid Decoder (RPD) which expresses global scene structure in upper levels to represent layouts, and local structure in lower levels to present shape details. At each level, we propose Residual Refinement Modules (RRM) that predict residual maps to progressively add finer structures on the coarser structure predicted at the upper level. In order to fully exploit multi-scale image features, an Adaptive Dense Feature Fusion (ADFF) module, which adaptively fuses effective features from all scales for inferring structures of each scale, is introduced. Experiment results on the challenging NYU-Depth v2 dataset demonstrate that our proposed approach achieves state-of-the-art performance in both qualitative and quantitative evaluation. The code is available at https://github.com/Xt-Chen/SARPN.

Download Full-text

Unsupervised Monocular Depth Estimation for Autonomous Driving

Proceedings of the International Display Workshops ◽

10.36463/idw.2019.3dsap2_3dp2-2 ◽

2019 ◽

pp. 128

Author(s):

Chih-Shuan Huang ◽

Wan-Nung Tsung ◽

Wei-Jong Yang ◽

Chin-Hsing Chen

Keyword(s):

Depth Estimation ◽

Autonomous Driving ◽

Monocular Depth

Download Full-text

On the Uncertainty of Self-Supervised Monocular Depth Estimation

2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) ◽

10.1109/cvpr42600.2020.00329 ◽

2020 ◽

Cited By ~ 1

Author(s):

Matteo Poggi ◽

Filippo Aleotti ◽

Fabio Tosi ◽

Stefano Mattoccia

Keyword(s):

Depth Estimation ◽

Monocular Depth

Download Full-text

Constant Velocity Constraints for Self-Supervised Monocular Depth Estimation

European Conference on Visual Media Production ◽

10.1145/3429341.3429355 ◽

2020 ◽

Author(s):

Hang Zhou ◽

David Greenwood ◽

Sarah Taylor ◽

Han Gong

Keyword(s):

Constant Velocity ◽

Depth Estimation ◽

Monocular Depth ◽

Velocity Constraints

Download Full-text

Hierarchical Object Relationship Constrained Monocular Depth Estimation.

Pattern Recognition ◽

10.1016/j.patcog.2021.108116 ◽

2021 ◽

pp. 108116

Author(s):

Shuai Li ◽

Jiaying Shi ◽

Wenfeng Song ◽

Aimin Hao ◽

Hong Qin

Keyword(s):

Depth Estimation ◽

Monocular Depth ◽

Object Relationship

Download Full-text

Monocular Depth Estimation with Joint Attention Feature Distillation and Wavelet-Based Loss Function

Sensors ◽

10.3390/s21010054 ◽

2020 ◽

Vol 21 (1) ◽

pp. 54

Author(s):

Peng Liu ◽

Zonghua Zhang ◽

Zhaozong Meng ◽

Nan Gao

Keyword(s):

Joint Attention ◽

Loss Function ◽

Depth Estimation ◽

Depth Information ◽

3D Vision ◽

Network Training ◽

Crucial Component ◽

Benchmark Datasets ◽

Ill Posed ◽

Monocular Depth

Depth estimation is a crucial component in many 3D vision applications. Monocular depth estimation is gaining increasing interest due to flexible use and extremely low system requirements, but inherently ill-posed and ambiguous characteristics still cause unsatisfactory estimation results. This paper proposes a new deep convolutional neural network for monocular depth estimation. The network applies joint attention feature distillation and wavelet-based loss function to recover the depth information of a scene. Two improvements were achieved, compared with previous methods. First, we combined feature distillation and joint attention mechanisms to boost feature modulation discrimination. The network extracts hierarchical features using a progressive feature distillation and refinement strategy and aggregates features using a joint attention operation. Second, we adopted a wavelet-based loss function for network training, which improves loss function effectiveness by obtaining more structural details. The experimental results on challenging indoor and outdoor benchmark datasets verified the proposed method’s superiority compared with current state-of-the-art methods.

Download Full-text

Time- and Resource-Efficient Time-to-Collision Forecasting for Indoor Pedestrian Obstacles Avoidance

Journal of Imaging ◽

10.3390/jimaging7040061 ◽

2021 ◽

Vol 7 (4) ◽

pp. 61

Author(s):

David Urban ◽

Alice Caplier

Keyword(s):

Neural Network ◽

Autonomous Vehicles ◽

Depth Estimation ◽

Video Camera ◽

Obstacle Detection ◽

Navigation Systems ◽

Time To Collision ◽

Static Data ◽

Monocular Depth ◽

Fully Connected

As difficult vision-based tasks like object detection and monocular depth estimation are making their way in real-time applications and as more light weighted solutions for autonomous vehicles navigation systems are emerging, obstacle detection and collision prediction are two very challenging tasks for small embedded devices like drones. We propose a novel light weighted and time-efficient vision-based solution to predict Time-to-Collision from a monocular video camera embedded in a smartglasses device as a module of a navigation system for visually impaired pedestrians. It consists of two modules: a static data extractor made of a convolutional neural network to predict the obstacle position and distance and a dynamic data extractor that stacks the obstacle data from multiple frames and predicts the Time-to-Collision with a simple fully connected neural network. This paper focuses on the Time-to-Collision network’s ability to adapt to new sceneries with different types of obstacles with supervised learning.

Download Full-text

Monocular Depth Estimation Based on Multi-Scale Depth Map Fusion

IEEE Access ◽

10.1109/access.2021.3076346 ◽

2021 ◽

pp. 1-1

Author(s):

Xin Yang ◽

Qingling Chang ◽

Xinglin Liu ◽

Siyuan He ◽

Yan Cui

Keyword(s):

Depth Map ◽

Depth Estimation ◽

Multi Scale ◽

Monocular Depth

Download Full-text

MonoER - A Edge Refined Self-Supervised Monocular Depth Estimation Method

2020 Chinese Automation Congress (CAC) ◽

10.1109/cac51589.2020.9326510 ◽

2020 ◽

Author(s):

Tianyu Xiang ◽

Lingzhe Zhao ◽

Hao Zhang ◽

Zhuping Wang

Keyword(s):

Estimation Method ◽

Depth Estimation ◽

Monocular Depth

Download Full-text

PDANet: Self-Supervised Monocular Depth Estimation Using Perceptual and Data Augmentation Consistency

Applied Sciences ◽

10.3390/app11125383 ◽

2021 ◽

Vol 11 (12) ◽

pp. 5383

Author(s):

Huachen Gao ◽

Xiaoyu Liu ◽

Meixia Qu ◽

Shijie Huang

Keyword(s):

Data Augmentation ◽

State Of The Art ◽

Depth Estimation ◽

Input Image ◽

Depth Information ◽

Disparity Map ◽

Estimation Model ◽

Absolute Relative Error ◽

Texture Region ◽

Monocular Depth

In recent studies, self-supervised learning methods have been explored for monocular depth estimation. They minimize the reconstruction loss of images instead of depth information as a supervised signal. However, existing methods usually assume that the corresponding points in different views should have the same color, which leads to unreliable unsupervised signals and ultimately damages the reconstruction loss during the training. Meanwhile, in the low texture region, it is unable to predict the disparity value of pixels correctly because of the small number of extracted features. To solve the above issues, we propose a network—PDANet—that integrates perceptual consistency and data augmentation consistency, which are more reliable unsupervised signals, into a regular unsupervised depth estimation model. Specifically, we apply a reliable data augmentation mechanism to minimize the loss of the disparity map generated by the original image and the augmented image, respectively, which will enhance the robustness of the image in the prediction of color fluctuation. At the same time, we aggregate the features of different layers extracted by a pre-trained VGG16 network to explore the higher-level perceptual differences between the input image and the generated one. Ablation studies demonstrate the effectiveness of each components, and PDANet shows high-quality depth estimation results on the KITTI benchmark, which optimizes the state-of-the-art method from 0.114 to 0.084, measured by absolute relative error for depth estimation.

Download Full-text