Texture Synthesis Repair of RealSense D435i Depth Images with Object-Oriented RGB Image Segmentation

Longyu Zhang; Hao Xia; Yanyou Qiao

doi:10.3390/s20236725

RobotP: A Benchmark Dataset for 6D Object Pose Estimation

Sensors ◽

10.3390/s21041299 ◽

2021 ◽

Vol 21 (4) ◽

pp. 1299

Author(s):

Honglin Yuan ◽

Tim Hoogenkamp ◽

Remco C. Veltkamp

Keyword(s):

Pose Estimation ◽

Ground Truth ◽

3D Models ◽

Depth Image ◽

Great Success ◽

Estimation Algorithms ◽

Depth Images ◽

Object Pose Estimation ◽

Image Pairs ◽

Bounding Boxes

Deep learning has achieved great success on robotic vision tasks. However, when compared with other vision-based tasks, it is difficult to collect a representative and sufficiently large training set for six-dimensional (6D) object pose estimation, due to the inherent difficulty of data collection. In this paper, we propose the RobotP dataset consisting of commonly used objects for benchmarking in 6D object pose estimation. To create the dataset, we apply a 3D reconstruction pipeline to produce high-quality depth images, ground truth poses, and 3D models for well-selected objects. Subsequently, based on the generated data, we produce object segmentation masks and two-dimensional (2D) bounding boxes automatically. To further enrich the data, we synthesize a large number of photo-realistic color-and-depth image pairs with ground truth 6D poses. Our dataset is freely distributed to research groups by the Shape Retrieval Challenge benchmark on 6D pose estimation. Based on our benchmark, different learning-based approaches are trained and tested by the unified dataset. The evaluation results indicate that there is considerable room for improvement in 6D object pose estimation, particularly for objects with dark colors, and photo-realistic images are helpful in increasing the performance of pose estimation algorithms.

Download Full-text

Multiple Classifiers-Based Feature Fusion for RGB-D Object Recognition

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001417500148 ◽

2017 ◽

Vol 31 (05) ◽

pp. 1750014 ◽

Cited By ~ 3

Author(s):

Yan Wu ◽

Jiqian Li ◽

Jing Bai

Keyword(s):

Object Recognition ◽

Feature Fusion ◽

Classification Performance ◽

Depth Image ◽

Depth Information ◽

The Past ◽

Depth Images ◽

Comparable Performance ◽

Accuracy Difference ◽

Rgb Image

RGB-D-based object recognition has been enthusiastically investigated in the past few years. RGB and depth images provide useful and complementary information. Fusing RGB and depth features can significantly increase the accuracy of object recognition. However, previous works just simply take the depth image as the fourth channel of the RGB image and concatenate the RGB and depth features, ignoring the different power of RGB and depth information for different objects. In this paper, a new method which contains three different classifiers is proposed to fuse features extracted from RGB image and depth image for RGB-D-based object recognition. Firstly, a RGB classifier and a depth classifier are trained by cross-validation to get the accuracy difference between RGB and depth features for each object. Then a variant RGB-D classifier is trained with different initialization parameters for each class according to the accuracy difference. The variant RGB-D-classifier can result in a more robust classification performance. The proposed method is evaluated on two benchmark RGB-D datasets. Compared with previous methods, ours achieves comparable performance with the state-of-the-art method.

Download Full-text

Three-Filters-to-Normal: An Accurate and Ultrafast Surface Normal Estimator

10.36227/techrxiv.12362249.v2 ◽

2020 ◽

Author(s):

Rui Fan ◽

Hengli Wang ◽

Bohuan Xue ◽

Huaiyang Huang ◽

Yuan Wang ◽

...

Keyword(s):

Large Scale ◽

Median Filter ◽

Ground Truth ◽

Depth Image ◽

Range Data ◽

3 Dimensional ◽

Depth Images ◽

Mesh Model ◽

Surface Normal ◽

Synthetic Datasets

Over the past decade, significant efforts have been made to improve the trade-off between speed and accuracy of surface normal estimators (SNEs). This paper introduces an accurate and ultrafast SNE for structured range data. The proposed approach computes surface normals by simply performing three filtering operations, namely, two image gradient filters (in horizontal and vertical directions, respectively) and a mean/median filter, on an inverse depth image or a disparity image. Despite the simplicity of the method, no similar method already exists in the literature. In our experiments, we created three large-scale synthetic datasets (easy, medium and hard) using 24 3-dimensional (3D) mesh models. Each mesh model is used to generate 1800--2500 pairs of 480x640 pixel depth images and the corresponding surface normal ground truth from different views. The average angular errors with respect to the easy, medium and hard datasets are 1.6 degrees, 5.6 degrees and 15.3 degrees, respectively. Our C++ and CUDA implementations achieve a processing speed of over 260 Hz and 21 kHz, respectively. Our proposed SNE achieves a better overall performance than all other existing computer vision-based SNEs. Our datasets and source code are publicly available at: sites.google.com/view/3f2n.

Download Full-text

Three-Filters-to-Normal: An Accurate and Ultrafast Surface Normal Estimator

10.36227/techrxiv.12362249.v1 ◽

2020 ◽

Author(s):

Rui Fan ◽

Hengli Wang ◽

Bohuan Xue ◽

Huaiyang Huang ◽

Yuan Wang, ◽

...

Keyword(s):

Large Scale ◽

Median Filter ◽

Ground Truth ◽

Depth Image ◽

Range Data ◽

3 Dimensional ◽

Depth Images ◽

Mesh Model ◽

Surface Normal ◽

Synthetic Datasets

Over the past decade, significant efforts have been made to improve the trade-off between speed and accuracy of surface normal estimators (SNEs). This paper introduces an accurate and ultrafast SNE for structured range data. The proposed approach computes surface normals by simply performing three filtering operations, namely, two image gradient filters (in horizontal and vertical directions, respectively) and a mean/median filter, on an inverse depth image or a disparity image. Despite the simplicity of the method, no similar method already exists in the literature. In our experiments, we created three large-scale synthetic datasets (easy, medium and hard) using 24 3-dimensional (3D) mesh models. Each mesh model is used to generate 1800--2500 pairs of 480x640 pixel depth images and the corresponding surface normal ground truth from different views. The average angular errors with respect to the easy, medium and hard datasets are 1.6 degrees, 5.6 degrees and 15.3 degrees, respectively. Our C++ and CUDA implementations achieve a processing speed of over 260 Hz and 21 kHz, respectively. Our proposed SNE achieves a better overall performance than all other existing computer vision-based SNEs. Our datasets and source code are publicly available at: sites.google.com/view/3f2n.

Download Full-text

Automatic 3D Landmark Extraction System Based on an Encoder–Decoder Using Fusion of Vision and LiDAR

Remote Sensing ◽

10.3390/rs12071142 ◽

2020 ◽

Vol 12 (7) ◽

pp. 1142

Author(s):

Jeonghoon Kwak ◽

Yunsick Sung

Keyword(s):

Point Cloud ◽

Point Clouds ◽

Depth Image ◽

3D Point Cloud ◽

Digital World ◽

Depth Images ◽

3D Point Clouds ◽

Rgb Images ◽

Rgb Image ◽

3D Landmarks

To provide a realistic environment for remote sensing applications, point clouds are used to realize a three-dimensional (3D) digital world for the user. Motion recognition of objects, e.g., humans, is required to provide realistic experiences in the 3D digital world. To recognize a user’s motions, 3D landmarks are provided by analyzing a 3D point cloud collected through a light detection and ranging (LiDAR) system or a red green blue (RGB) image collected visually. However, manual supervision is required to extract 3D landmarks as to whether they originate from the RGB image or the 3D point cloud. Thus, there is a need for a method for extracting 3D landmarks without manual supervision. Herein, an RGB image and a 3D point cloud are used to extract 3D landmarks. The 3D point cloud is utilized as the relative distance between a LiDAR and a user. Because it cannot contain all information the user’s entire body due to disparities, it cannot generate a dense depth image that provides the boundary of user’s body. Therefore, up-sampling is performed to increase the density of the depth image generated based on the 3D point cloud; the density depends on the 3D point cloud. This paper proposes a system for extracting 3D landmarks using 3D point clouds and RGB images without manual supervision. A depth image provides the boundary of a user’s motion and is generated by using 3D point cloud and RGB image collected by a LiDAR and an RGB camera, respectively. To extract 3D landmarks automatically, an encoder–decoder model is trained with the generated depth images, and the RGB images and 3D landmarks are extracted from these images with the trained encoder model. The method of extracting 3D landmarks using RGB depth (RGBD) images was verified experimentally, and 3D landmarks were extracted to evaluate the user’s motions with RGBD images. In this manner, landmarks could be extracted according to the user’s motions, rather than by extracting them using the RGB images. The depth images generated by the proposed method were 1.832 times denser than the up-sampling-based depth images generated with bilateral filtering.

Download Full-text

Three-Filters-to-Normal: An Accurate and Ultrafast Surface Normal Estimator

10.36227/techrxiv.12362249.v3 ◽

2021 ◽

Author(s):

Rui Fan ◽

Hengli Wang ◽

Bohuan Xue ◽

Huaiyang Huang ◽

Yuan Wang ◽

...

Keyword(s):

Large Scale ◽

Median Filter ◽

Ground Truth ◽

Sensor Data ◽

Depth Image ◽

Depth Images ◽

Surface Normal ◽

Synthetic Datasets ◽

Normal Maps ◽

Mesh Models

This paper proposes three-filters-to-normal (3F2N), an accurate and ultrafast surface normal estimator (SNE), which is designed for structured range sensor data, e.g., depth/disparity images. 3F2N SNE computes surface normals by simply performing three filtering operations (two image gradient filters in horizontal and vertical directions, respectively, and a mean/median filter) on an inverse depth image or a disparity image. Despite the simplicity of 3F2N SNE, no similar method already exists in the literature. To evaluate the performance of our proposed SNE, we created three large-scale synthetic datasets (easy, medium and hard) using 24 3D mesh models, each of which is used to generate 1800--2500 pairs of depth images (resolution: 480X640 pixels) and the corresponding ground-truth surface normal maps from different views. 3F2N SNE demonstrates the state-of-the-art performance, outperforming all other existing geometry-based SNEs, where the average angular errors with respect to the easy, medium and hard datasets are 1.66 degrees, 5.69 degrees and 15.31 degrees, respectively. Furthermore, our C++ and CUDA implementations achieve a processing speed of over 260 Hz and 21 kHz, respectively. Our datasets and source code are publicly available at sites.google.com/view/3f2n.

Download Full-text

Synthetic Depth Transfer for Monocular 3D Object Pose Estimation in the Wild

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i07.6781 ◽

2020 ◽

Vol 34 (07) ◽

pp. 11221-11228

Author(s):

Yueying Kao ◽

Weiming Li ◽

Qiang Wang ◽

Zhouchen Lin ◽

Wooshik Kim ◽

...

Keyword(s):

Pose Estimation ◽

Large Scale ◽

Synthetic Data ◽

Real Data ◽

Depth Image ◽

Depth Images ◽

In The Wild ◽

Object Pose Estimation ◽

Image Pairs ◽

Rgb Image

Monocular object pose estimation is an important yet challenging computer vision problem. Depth features can provide useful information for pose estimation. However, existing methods rely on real depth images to extract depth features, leading to its difficulty on various applications. In this paper, we aim at extracting RGB and depth features from a single RGB image with the help of synthetic RGB-depth image pairs for object pose estimation. Specifically, a deep convolutional neural network is proposed with an RGB-to-Depth Embedding module and a Synthetic-Real Adaptation module. The embedding module is trained with synthetic pair data to learn a depth-oriented embedding space between RGB and depth images optimized for object pose estimation. The adaptation module is to further align distributions from synthetic to real data. Compared to existing methods, our method does not need any real depth images and can be trained easily with large-scale synthetic data. Extensive experiments and comparisons show that our method achieves best performance on a challenging public PASCAL 3D+ dataset in all the metrics, which substantiates the superiority of our method and the above modules.

Download Full-text

Real-Time Height Measurement for Moving Pedestrians

Complexity ◽

10.1155/2020/5708593 ◽

2020 ◽

Vol 2020 ◽

pp. 1-15

Author(s):

Wenju Zhou ◽

Fulong Yao ◽

Wei Feng ◽

Haikuan Wang

Keyword(s):

Real Time ◽

Ground Truth ◽

Pso Algorithm ◽

Depth Image ◽

Head Region ◽

Height Measurement ◽

The Real ◽

Depth Images ◽

Criminal Suspect ◽

Tof Camera

Height measurement for moving pedestrians is quite significant in many scenarios, such as pedestrian positioning, criminal suspect tracking, and virtual reality. Although some existing height measurement methods can detect the height of the static people, it is hard to measure height accurately for moving pedestrians. Considering the height fluctuations in dynamic situation, this paper proposes a real-time height measurement based on the Time-of-Flight (TOF) camera. Depth images in a continuous sequence are addressed to obtain the real-time height of the pedestrian with moving. Firstly, a normalization equation is presented to convert the depth image into the grey image for a lower time cost and better performance. Secondly, a difference-particle swarm optimization (D-PSO) algorithm is proposed to remove the complex background and reduce the noises. Thirdly, a segmentation algorithm based on the maximally stable extremal regions (MSERs) is introduced to extract the pedestrian head region. Then, a novel multilayer iterative average algorithm (MLIA) is developed for obtaining the height of dynamic pedestrians. Finally, Kalman filtering is used to improve the measurement accuracy by combining the current measurement and the height at the last moment. In addition, the VICON system is adopted as the ground truth to verify the proposed method, and the result shows that our method can accurately measure the real-time height of moving pedestrians.

Download Full-text

Locating Mechanical Switches Using RGB-D Sensor Mounted on a Disaster Response Robot

Electronic Imaging ◽

10.2352/issn.2470-1173.2020.6.iriacv-016 ◽

2020 ◽

Vol 2020 (6) ◽

pp. 16-1-16-7

Author(s):

Takuya Kanda ◽

Kazuya Miyakawa ◽

Jeonghwang Hayashi ◽

Jun Ohya ◽

Hiroyuki Ogata ◽

...

Keyword(s):

Disaster Response ◽

Point Cloud ◽

Real Space ◽

Depth Image ◽

3D Point Cloud ◽

Point Cloud Data ◽

Cloud Data ◽

Depth Images ◽

Bounding Box ◽

Rgb Image

To achieve one of the tasks required for disaster response robots, this paper proposes a method for locating 3D structured switches’ points to be pressed by the robot in disaster sites using RGBD images acquired by Kinect sensor attached to our disaster response robot. Our method consists of the following five steps: 1)Obtain RGB and depth images using an RGB-D sensor. 2) Detect the bounding box of switch area from the RGB image using YOLOv3. 3)Generate 3D point cloud data of the target switch by combining the bounding box and the depth image.4)Detect the center position of the switch button from the RGB image in the bounding box using Convolutional Neural Network (CNN). 5)Estimate the center of the button’s face in real space from the detection result in step 4) and the 3D point cloud data generated in step3) In the experiment, the proposed method is applied to two types of 3D structured switch boxes to evaluate the effectiveness. The results show that our proposed method can locate the switch button accurately enough for the robot operation.

Download Full-text

Three-Filters-to-Normal: An Accurate and Ultrafast Surface Normal Estimator

10.36227/techrxiv.12362249 ◽

2020 ◽

Author(s):

Rui Fan ◽

Hengli Wang ◽

Bohuan Xue ◽

Huaiyang Huang ◽

Yuan Wang ◽

...

Keyword(s):

Large Scale ◽

Median Filter ◽

Ground Truth ◽

Depth Image ◽

Range Data ◽

3 Dimensional ◽

Depth Images ◽

Mesh Model ◽

Surface Normal ◽

Synthetic Datasets

Over the past decade, significant efforts have been made to improve the trade-off between speed and accuracy of surface normal estimators (SNEs). This paper introduces an accurate and ultrafast SNE for structured range data. The proposed approach computes surface normals by simply performing three filtering operations, namely, two image gradient filters (in horizontal and vertical directions, respectively) and a mean/median filter, on an inverse depth image or a disparity image. Despite the simplicity of the method, no similar method already exists in the literature. In our experiments, we created three large-scale synthetic datasets (easy, medium and hard) using 24 3-dimensional (3D) mesh models. Each mesh model is used to generate 1800--2500 pairs of 480x640 pixel depth images and the corresponding surface normal ground truth from different views. The average angular errors with respect to the easy, medium and hard datasets are 1.6 degrees, 5.6 degrees and 15.3 degrees, respectively. Our C++ and CUDA implementations achieve a processing speed of over 260 Hz and 21 kHz, respectively. Our proposed SNE achieves a better overall performance than all other existing computer vision-based SNEs. Our datasets and source code are publicly available at: sites.google.com/view/3f2n.

Download Full-text