scholarly journals Birds Eye View Look-Up Table Estimation with Semantic Segmentation

2021 ◽  
Vol 11 (17) ◽  
pp. 8047
Author(s):  
Dongkyu Lee ◽  
Wee Peng Tay ◽  
Seok-Cheol Kee

In this work, a study was carried out to estimate a look-up table (LUT) that converts a camera image plane to a birds eye view (BEV) plane using a single camera. The traditional camera pose estimation fields require high costs in researching and manufacturing autonomous vehicles for the future and may require pre-configured infra. This paper proposes an autonomous vehicle driving camera calibration system that is low cost and utilizes low infra. A network that outputs an image in the form of an LUT that converts the image into a BEV by estimating the camera pose under urban road driving conditions using a single camera was studied. We propose a network that predicts human-like poses from a single image. We collected synthetic data using a simulator, made BEV and LUT as ground truth, and utilized the proposed network and ground truth to train pose estimation function. In the progress, it predicts the pose by deciphering the semantic segmentation feature and increases its performance by attaching a layer that handles the overall direction of the network. The network outputs camera angle (roll/pitch/yaw) on the 3D coordinate system so that the user can monitor learning. Since the network's output is a LUT, there is no need for additional calculation, and real-time performance is improved.

2019 ◽  
Vol 39 (9) ◽  
pp. 0915004
Author(s):  
张雄锋 Xiongfeng Zhang ◽  
刘海波 Haibo Liu ◽  
尚洋 Yang Shang

Sensors ◽  
2019 ◽  
Vol 19 (17) ◽  
pp. 3784 ◽  
Author(s):  
Jameel Malik ◽  
Ahmed Elhayek ◽  
Didier Stricker

Hand shape and pose recovery is essential for many computer vision applications such as animation of a personalized hand mesh in a virtual environment. Although there are many hand pose estimation methods, only a few deep learning based algorithms target 3D hand shape and pose from a single RGB or depth image. Jointly estimating hand shape and pose is very challenging because none of the existing real benchmarks provides ground truth hand shape. For this reason, we propose a novel weakly-supervised approach for 3D hand shape and pose recovery (named WHSP-Net) from a single depth image by learning shapes from unlabeled real data and labeled synthetic data. To this end, we propose a novel framework which consists of three novel components. The first is the Convolutional Neural Network (CNN) based deep network which produces 3D joints positions from learned 3D bone vectors using a new layer. The second is a novel shape decoder that recovers dense 3D hand mesh from sparse joints. The third is a novel depth synthesizer which reconstructs 2D depth image from 3D hand mesh. The whole pipeline is fine-tuned in an end-to-end manner. We demonstrate that our approach recovers reasonable hand shapes from real world datasets as well as from live stream of depth camera in real-time. Our algorithm outperforms state-of-the-art methods that output more than the joint positions and shows competitive performance on 3D pose estimation task.


2020 ◽  
Vol 10 (24) ◽  
pp. 8866
Author(s):  
Sangyoon Lee ◽  
Hyunki Hong ◽  
Changkyoung Eem

Deep learning has been utilized in end-to-end camera pose estimation. To improve the performance, we introduce a camera pose estimation method based on a 2D-3D matching scheme with two convolutional neural networks (CNNs). The scene is divided into voxels, whose size and number are computed according to the scene volume and the number of 3D points. We extract inlier points from the 3D point set in a voxel using random sample consensus (RANSAC)-based plane fitting to obtain a set of interest points consisting of a major plane. These points are subsequently reprojected onto the image using the ground truth camera pose, following which a polygonal region is identified in each voxel using the convex hull. We designed a training dataset for 2D–3D matching, consisting of inlier 3D points, correspondence across image pairs, and the voxel regions in the image. We trained the hierarchical learning structure with two CNNs on the dataset architecture to detect the voxel regions and obtain the location/description of the interest points. Following successful 2D–3D matching, the camera pose was estimated using n-point pose solver in RANSAC. The experiment results show that our method can estimate the camera pose more precisely than previous end-to-end estimators.


2021 ◽  
Vol 0 (0) ◽  
Author(s):  
Frank Bieder ◽  
Sascha Wirges ◽  
Sven Richter ◽  
Christoph Stiller

Abstract In this work, we improve the semantic segmentation of multi-layer top-view grid maps in the context of LiDAR-based perception for autonomous vehicles. To achieve this goal, we fuse sequential information from multiple consecutive LiDAR measurements with respect to the driven trajectory of an autonomous vehicle. By doing so, we enrich the multi-layer grid maps which are subsequently used as the input of a neural network. Our approach can be used for LiDAR-only 360 ° surround view semantic scene segmentation while being suitable for real-time critical systems. We evaluate the benefit of fusing sequential information using a dense semantic ground truth and discuss the effect on different classes.


Sensors ◽  
2019 ◽  
Vol 19 (17) ◽  
pp. 3787
Author(s):  
Łukasz Chechliński ◽  
Barbara Siemiątkowska ◽  
Michał Majewski

Automated weeding is an important research area in agrorobotics. Weeds can be removed mechanically or with the precise usage of herbicides. Deep Learning techniques achieved state of the art results in many computer vision tasks, however their deployment on low-cost mobile computers is still challenging. The described system contains several novelties, compared both with its previous version and related work. It is a part of a project of the automatic weeding machine, developed by the Warsaw University of Technology and MCMS Warka Ltd. Obtained models reach satisfying accuracy (detecting 47–67% of weed area, misclasifing as weed 0.1–0.9% of crop area) at over 10 FPS on the Raspberry Pi 3B+ computer. It was tested for four different plant species at different growth stadiums and lighting conditions. The system performing semantic segmentation is based on Convolutional Neural Networks. Its custom architecture combines U-Net, MobileNets, DenseNet and ResNet concepts. Amount of needed manual ground truth labels was significantly decreased by the usage of the knowledge distillation process, learning final model which mimics an ensemble of complex models on a large database of unlabeled data. Further decrease of the inference time was obtained by two custom modifications: in the usage of separable convolutions in DenseNet block and in the number of channels in each layer. In the authors’ opinion, the described novelties can be easily transferred to other agrorobotics tasks.


Author(s):  
Robert D. Leary ◽  
Sean Brennan

Currently, there is a lack of low-cost, real-time solutions for accurate autonomous vehicle localization. The fusion of a precise a priori map and a forward-facing camera can provide an alternative low-cost method for achieving centimeter-level localization. This paper analyzes the position and orientation bounds, or region of attraction, with which a real-time vehicle pose estimator can localize using monocular vision and a lane marker map. A pose estimation algorithm minimizes the residual pixel-level error between the estimated and detected lane marker features via Gauss-Newton nonlinear least-squares. Simulations of typical road scenes were used as ground truth to ensure the pose estimator will converge to the true vehicle pose. A successful convergence was defined as a pose estimate that fell within 5 cm and 0.25 degrees of the true vehicle pose. The results show that the longitudinal vehicle state is weakly observable with the smallest region of attraction. Estimating the remaining five vehicle states gives repeatable convergence within the prescribed convergence bounds over a relatively large region of attraction, even for the simple lane detection methods used herein. A main contribution of this paper is to demonstrate a repeatable and verifiable method to assess and compare lane-based vehicle localization strategies.


Author(s):  
E. V. Shalnov ◽  
A. S. Konushin

Known scene geometry and camera calibration parameters give important information to video content analysis systems. In this paper, we propose a novel method for camera pose estimation based on people observation in the input video captured by static camera. As opposed to previous techniques, our method can deal with false positive detections and inaccurate localization results. Specifically, the proposed method does not make any assumption about the utilized object detector and takes it as a parameter. Moreover, we do not require a huge labeled dataset of real data and train on the synthetic data only. We apply the proposed technique for camera pose estimation based on head observations. Our experiments show that the algorithm trained on the synthetic dataset generalizes to real data and is robust to false positive detections.


Author(s):  
Jonas Hein ◽  
Matthias Seibold ◽  
Federica Bogo ◽  
Mazda Farshad ◽  
Marc Pollefeys ◽  
...  

Abstract Purpose:  Tracking of tools and surgical activity is becoming more and more important in the context of computer assisted surgery. In this work, we present a data generation framework, dataset and baseline methods to facilitate further research in the direction of markerless hand and instrument pose estimation in realistic surgical scenarios. Methods:  We developed a rendering pipeline to create inexpensive and realistic synthetic data for model pretraining. Subsequently, we propose a pipeline to capture and label real data with hand and object pose ground truth in an experimental setup to gather high-quality real data. We furthermore present three state-of-the-art RGB-based pose estimation baselines. Results:  We evaluate three baseline models on the proposed datasets. The best performing baseline achieves an average tool 3D vertex error of 16.7 mm on synthetic data as well as 13.8 mm on real data which is comparable to the state-of-the art in RGB-based hand/object pose estimation. Conclusion:  To the best of our knowledge, we propose the first synthetic and real data generation pipelines to generate hand and object pose labels for open surgery. We present three baseline models for RGB based object and object/hand pose estimation based on RGB frames. Our realistic synthetic data generation pipeline may contribute to overcome the data bottleneck in the surgical domain and can easily be transferred to other medical applications.


2021 ◽  
Vol 102 (4) ◽  
Author(s):  
Chenhao Yang ◽  
Yuyi Liu ◽  
Andreas Zell

AbstractLearning-based visual localization has become prospective over the past decades. Since ground truth pose labels are difficult to obtain, recent methods try to learn pose estimation networks using pixel-perfect synthetic data. However, this also introduces the problem of domain bias. In this paper, we first build a Tuebingen Buildings dataset of RGB images collected by a drone in urban scenes and create a 3D model for each scene. A large number of synthetic images are generated based on these 3D models. We take advantage of image style transfer and cycle-consistent adversarial training to predict the relative camera poses of image pairs based on training over synthetic environment data. We propose a relative camera pose estimation approach to solve the continuous localization problem for autonomous navigation of unmanned systems. Unlike those existing learning-based camera pose estimation methods that train and test in a single scene, our approach successfully estimates the relative camera poses of multiple city locations with a single trained model. We use the Tuebingen Buildings and the Cambridge Landmarks datasets to evaluate the performance of our approach in a single scene and across-scenes. For each dataset, we compare the performance between real images and synthetic images trained models. We also test our model in the indoor dataset 7Scenes to demonstrate its generalization ability.


Author(s):  
Łukasz Chechliński ◽  
Barbara Siemiątkowska ◽  
Michał Majewski

Automated weeding is an important research area in agrorobotics. Weeds can be removed mechanically or with the precise usage of herbicides. Deep Learning techniques achieved state of the art results in many computer vision tasks, however their deployment on low-cost mobile computers is still challenging. These paper present an advanced version of the system presented in [1]. The described system contains several novelties, compared both with its previous version and related work. It is a part of a project of the automatic weeding machine, developed by Warsaw University of Technology and MCMS Warka Ltd. The obtained model reaches satisfying accuracy at over 10~FPS on the Raspberry Pi 3B+ computer. It was tested for four different plant species at different growth stadiums and lighting conditions. The system performing semantic segmentation is based on Convolutional Neural Networks. Its custom architecture mixes U-Net, MobileNets, DenseNet and ResNet concepts. Amount of needed manual ground truth labels was significantly decreased by the usage of knowledge distillation process, learning final model to mimic an ensemble of complex models on the large database of unlabeled data. Further decrease of the inference time was obtained by two custom modifications: in the usage of separable convolutions in DenseNet block and in the number of channels in each layer. In the authors’ opinion, described novelties can be easily transferred to other agrorobotics tasks.


Sign in / Sign up

Export Citation Format

Share Document