MeshLifter: Weakly Supervised Approach for 3D Human Mesh Reconstruction from a Single 2D Pose Based on Loop Structure

Sunwon Jeong; Ju Yong Chang

doi:10.3390/s20154257

MeshLifter: Weakly Supervised Approach for 3D Human Mesh Reconstruction from a Single 2D Pose Based on Loop Structure

Sensors ◽

10.3390/s20154257 ◽

2020 ◽

Vol 20 (15) ◽

pp. 4257

Author(s):

Sunwon Jeong ◽

Ju Yong Chang

Keyword(s):

Ground Truth ◽

Reconstruction Error ◽

Loop Structure ◽

Ground Truth Data ◽

3D Data ◽

Reconstruction Performance ◽

Human Pose ◽

Mesh Reconstruction ◽

Weakly Supervised ◽

2D And 3D

In this paper, we address the problem of 3D human mesh reconstruction from a single 2D human pose based on deep learning. We propose MeshLifter, a network that estimates a 3D human mesh from an input 2D human pose. Unlike most existing 3D human mesh reconstruction studies that train models using paired 2D and 3D data, we propose a weakly supervised learning method based on a loop structure to train the MeshLifter. The proposed method alleviates the difficulty of obtaining ground-truth 3D data to ensure that the MeshLifter can be trained successfully from a 2D human pose dataset and an unpaired 3D motion capture dataset. We compare the proposed method with recent state-of-the-art studies through various experiments and show that the proposed method achieves effective 3D human mesh reconstruction performance. Notably, our proposed method achieves a reconstruction error of 59.1 mm without using the 3D ground-truth data of Human3.6M, the standard dataset for 3D human mesh reconstruction.

Download Full-text

3D Human Pose Estimation Using Spatio-Temporal Networks with Explicit Occlusion Training

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i07.6689 ◽

2020 ◽

Vol 34 (07) ◽

pp. 10631-10638

Author(s):

Yu Cheng ◽

Bo Yang ◽

Bo Wang ◽

Robby T. Tan

Keyword(s):

Pose Estimation ◽

Ground Truth ◽

Video Data ◽

Training Data ◽

Human Pose Estimation ◽

Ground Truth Data ◽

Public Data ◽

Spatio Temporal ◽

Human Pose ◽

3D Human Pose Estimation

Estimating 3D poses from a monocular video is still a challenging task, despite the significant progress that has been made in the recent years. Generally, the performance of existing methods drops when the target person is too small/large, or the motion is too fast/slow relative to the scale and speed of the training data. Moreover, to our knowledge, many of these methods are not designed or trained under severe occlusion explicitly, making their performance on handling occlusion compromised. Addressing these problems, we introduce a spatio-temporal network for robust 3D human pose estimation. As humans in videos may appear in different scales and have various motion speeds, we apply multi-scale spatial features for 2D joints or keypoints prediction in each individual frame, and multi-stride temporal convolutional networks (TCNs) to estimate 3D joints or keypoints. Furthermore, we design a spatio-temporal discriminator based on body structures as well as limb motions to assess whether the predicted pose forms a valid pose and a valid movement. During training, we explicitly mask out some keypoints to simulate various occlusion cases, from minor to severe occlusion, so that our network can learn better and becomes robust to various degrees of occlusion. As there are limited 3D ground truth data, we further utilize 2D video data to inject a semi-supervised learning capability to our network. Experiments on public data sets validate the effectiveness of our method, and our ablation studies show the strengths of our network's individual submodules.

Download Full-text

Fuzzy-Clustering-Based Discriminant Method of Multiple Quadric Surfaces for Noisy and Sparse Range Data

Journal of Advanced Computational Intelligence and Intelligent Informatics ◽

10.20965/jaciii.2010.p0160 ◽

2010 ◽

Vol 14 (2) ◽

pp. 160-166

Author(s):

Hideaki Kawano ◽

◽

Hiroshi Maeda ◽

Norikazu Ikoma

Keyword(s):

Fuzzy Clustering ◽

Stereo Matching ◽

Principal Component ◽

Ground Truth ◽

Stereo Image ◽

Range Data ◽

Multiple Objects ◽

Ground Truth Data ◽

Quadric Surfaces ◽

3D Data

In this paper, a fuzzy-clustering-based discriminant method of multiple quadric surfaces in a scene is proposed. This method is intended for scenes involving multiple objects, where each object is approximated by a primitive model. The proposed method is composed of three steps. In the first step, 3D data is reconstructed using a stereo matching technique from a stereo image whose scene involves multiple objects. Next, the 3D data is divided into a single object by employing Fuzzy c-Means accompanied by Principal Component Analysis (PCA) and a criterion with respect to the number of clusters. Finally, the shape of each object is extracted by Fuzzy c-Varieties with noise clustering. The proposed method was evaluated with respect to some pilot scenes whose ground truth data is known, and it was shown to specify each location and each shape for multiple objects very well.

Download Full-text

An Efficient 3D Human Pose Retrieval and Reconstruction from 2D Image-Based Landmarks

Sensors ◽

10.3390/s21072415 ◽

2021 ◽

Vol 21 (7) ◽

pp. 2415

Author(s):

Hashim Yasin ◽

Björn Krüger

Keyword(s):

Feature Space ◽

Ground Truth ◽

Reconstruction Error ◽

Synthetic Image ◽

Camera Model ◽

Retrieval Task ◽

In The Wild ◽

Internet Images ◽

Human Pose ◽

Image Planes

We propose an efficient and novel architecture for 3D articulated human pose retrieval and reconstruction from 2D landmarks extracted from a 2D synthetic image, an annotated 2D image, an in-the-wild real RGB image or even a hand-drawn sketch. Given 2D joint positions in a single image, we devise a data-driven framework to infer the corresponding 3D human pose. To this end, we first normalize 3D human poses from Motion Capture (MoCap) dataset by eliminating translation, orientation, and the skeleton size discrepancies from the poses and then build a knowledge-base by projecting a subset of joints of the normalized 3D poses onto 2D image-planes by fully exploiting a variety of virtual cameras. With this approach, we not only transform 3D pose space to the normalized 2D pose space but also resolve the 2D-3D cross-domain retrieval task efficiently. The proposed architecture searches for poses from a MoCap dataset that are near to a given 2D query pose in a definite feature space made up of specific joint sets. These retrieved poses are then used to construct a weak perspective camera and a final 3D posture under the camera model that minimizes the reconstruction error. To estimate unknown camera parameters, we introduce a nonlinear, two-fold method. We exploit the retrieved similar poses and the viewing directions at which the MoCap dataset was sampled to minimize the projection error. Finally, we evaluate our approach thoroughly on a large number of heterogeneous 2D examples generated synthetically, 2D images with ground-truth, a variety of real in-the-wild internet images, and a proof of concept using 2D hand-drawn sketches of human poses. We conduct a pool of experiments to perform a quantitative study on PARSE dataset. We also show that the proposed system yields competitive, convincing results in comparison to other state-of-the-art methods.

Download Full-text

Towards Single 2D Image-Level Self-Supervision for 3D Human Pose and Shape Estimation

Applied Sciences ◽

10.3390/app11209724 ◽

2021 ◽

Vol 11 (20) ◽

pp. 9724

Author(s):

Junuk Cha ◽

Muhammad Saqlain ◽

Changhwa Lee ◽

Seongyeong Lee ◽

Seungeun Lee ◽

...

Keyword(s):

Large Scale ◽

Three Dimensional ◽

Ground Truth ◽

Small Subset ◽

Learning Approaches ◽

Shape Estimation ◽

Camera System ◽

Ground Truth Data ◽

Human Pose ◽

2D Images

Three-dimensional human pose and shape estimation is an important problem in the computer vision community, with numerous applications such as augmented reality, virtual reality, human computer interaction, and so on. However, training accurate 3D human pose and shape estimators based on deep learning approaches requires a large number of images and corresponding 3D ground-truth pose pairs, which are costly to collect. To relieve this constraint, various types of weakly or self-supervised pose estimation approaches have been proposed. Nevertheless, these methods still involve supervision signals, which require effort to collect, such as unpaired large-scale 3D ground truth data, a small subset of 3D labeled data, video priors, and so on. Often, they require installing equipment such as a calibrated multi-camera system to acquire strong multi-view priors. In this paper, we propose a self-supervised learning framework for 3D human pose and shape estimation that does not require other forms of supervision signals while using only single 2D images. Our framework inputs single 2D images, estimates human 3D meshes in the intermediate layers, and is trained to solve four types of self-supervision tasks (i.e., three image manipulation tasks and one neural rendering task) whose ground-truths are all based on the single 2D images themselves. Through experiments, we demonstrate the effectiveness of our approach on 3D human pose benchmark datasets (i.e., Human3.6M, 3DPW, and LSP), where we present the new state-of-the-art among weakly/self-supervised methods.

Download Full-text

StormGraph: A graph-based algorithm for quantitative clustering analysis of diverse single-molecule localization microscopy data

10.1101/515627 ◽

2019 ◽

Author(s):

Joshua M. Scurll ◽

Libin Abraham ◽

Da Wei Zheng ◽

Reza Tafteh ◽

Keng C. Chou ◽

...

Keyword(s):

Single Molecule ◽

Input Parameter ◽

Surface Protein ◽

Ground Truth ◽

Antigen Receptors ◽

Localization Microscopy ◽

3D Data ◽

Microscopy Data ◽

2D And 3D ◽

Single Molecule Localization Microscopy

AbstractClustering of proteins is crucial for many cellular processes and can be imaged at nanoscale resolution using single-molecule localization microscopy (SMLM). Ideally, molecular clustering in regions of interest (ROIs) from SMLM images would be assessed using computational methods that are robust to sample and experimental heterogeneity, account for uncertainties in localization data, can analyze both 2D and 3D data, and have practical computational requirements in terms of time and hardware. While analyzing surface protein clustering on B lymphocytes using SMLM, we encountered limitations with existing cluster analysis methods. This inspired us to develop StormGraph, an algorithm using graph theory and community detection to identify clusters in heterogeneous sets of 2D and 3D SMLM data while accounting for localization uncertainties. StormGraph generates both multi-level and single-level clusterings and can quantify cluster overlap for two-color SMLM data. Importantly, StormGraph automatically determines scale-dependent thresholds from the data using scale-independent input parameters. This makes identical choices of input parameter values suitable for disparate ROIs, eliminating the need to tune parameters for different ROIs in heterogeneous SMLM datasets. We show that StormGraph outperforms existing algorithms at analyzing heterogeneous sets of simulated SMLM ROIs where ground-truth clusters are known. Applying StormGraph to real SMLM data in 2D, we reveal that B-cell antigen receptors (BCRs) reside in a heterogeneous combination of small and large clusters following stimulation, which suggests for the first time that two conflicting models of BCR activation are not mutually exclusive. We also demonstrate application of StormGraph to real two-color and 3D SMLM data.

Download Full-text

Unsupervised Object Segmentation Based on Bi-Partitioning Image Model Integrated with Classification

Electronics ◽

10.3390/electronics10182296 ◽

2021 ◽

Vol 10 (18) ◽

pp. 2296

Author(s):

Hyun-Tae Choi ◽

Byung-Woo Hong

Keyword(s):

Image Segmentation ◽

High Performance ◽

Ground Truth ◽

Input Image ◽

Image Model ◽

Ground Truth Data ◽

Weakly Supervised ◽

Unsupervised Image Segmentation ◽

Segmentation Models ◽

Image Mask

The development of convolutional neural networks for deep learning has significantly contributed to image classification and segmentation areas. For high performance in supervised image segmentation, we need many ground-truth data. However, high costs are required to make these data, so unsupervised manners are actively being studied. The Mumford–Shah and Chan–Vese models are well-known unsupervised image segmentation models. However, the Mumford–Shah model and the Chan–Vese model cannot separate the foreground and background of the image because they are based on pixel intensities. In this paper, we propose a weakly supervised model for image segmentation based on the segmentation models (Mumford–Shah model and Chan–Vese model) and classification. The segmentation model (i.e., Mumford–Shah model or Chan–Vese model) is to find a base image mask for classification, and the classification network uses the mask from the segmentation models. With the classifcation network, the output mask of the segmentation model changes in the direction of increasing the performance of the classification network. In addition, the mask can distinguish the foreground and background of images naturally. Our experiment shows that our segmentation model, integrated with a classifier, can segment the input image to the foreground and the background only with the image’s class label, which is the image-level label.

Download Full-text

Geometry-Driven Self-Supervised Method for 3D Human Pose Estimation

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i07.6808 ◽

2020 ◽

Vol 34 (07) ◽

pp. 11442-11449 ◽

Cited By ~ 1

Author(s):

Yang Li ◽

Kan Li ◽

Shuai Jiang ◽

Ziyue Zhang ◽

Congzhentao Huang ◽

...

Keyword(s):

Pose Estimation ◽

Ground Truth ◽

Human Pose Estimation ◽

Ground Truth Data ◽

The Neural Network ◽

Geometric Knowledge ◽

Supervised Methods ◽

Human Pose ◽

3D Human Pose Estimation ◽

2D To 3D

The neural network based approach for 3D human pose estimation from monocular images has attracted growing interest. However, annotating 3D poses is a labor-intensive and expensive process. In this paper, we propose a novel self-supervised approach to avoid the need of manual annotations. Different from existing weakly/self-supervised methods that require extra unpaired 3D ground-truth data to alleviate the depth ambiguity problem, our method trains the network only relying on geometric knowledge without any additional 3D pose annotations. The proposed method follows the two-stage pipeline: 2D pose estimation and 2D-to-3D pose lifting. We design the transform re-projection loss that is an effective way to explore multi-view consistency for training the 2D-to-3D lifting network. Besides, we adopt the confidences of 2D joints to integrate losses from different views to alleviate the influence of noises caused by the self-occlusion problem. Finally, we design a two-branch training architecture, which helps to preserve the scale information of re-projected 2D poses during training, resulting in accurate 3D pose predictions. We demonstrate the effectiveness of our method on two popular 3D human pose datasets, Human3.6M and MPI-INF-3DHP. The results show that our method significantly outperforms recent weakly/self-supervised approaches.

Download Full-text

Assessing Wildfire Burn Severity and Its Relationship with Environmental Factors: A Case Study in Interior Alaska Boreal Forest

Remote Sensing ◽

10.3390/rs13101966 ◽

2021 ◽

Vol 13 (10) ◽

pp. 1966

Author(s):

Christopher W Smith ◽

Santosh K Panda ◽

Uma S Bhatt ◽

Franz J Meyer ◽

Anushree Badola ◽

...

Keyword(s):

Boreal Forest ◽

Ground Truth ◽

Burn Severity ◽

Classification Methods ◽

Spectral Indices ◽

Ground Truth Data ◽

Burn Scar ◽

Interior Alaska ◽

Remote Sensing Methods ◽

The Relationship

In recent years, there have been rapid improvements in both remote sensing methods and satellite image availability that have the potential to massively improve burn severity assessments of the Alaskan boreal forest. In this study, we utilized recent pre- and post-fire Sentinel-2 satellite imagery of the 2019 Nugget Creek and Shovel Creek burn scars located in Interior Alaska to both assess burn severity across the burn scars and test the effectiveness of several remote sensing methods for generating accurate map products: Normalized Difference Vegetation Index (NDVI), Normalized Burn Ratio (NBR), and Random Forest (RF) and Support Vector Machine (SVM) supervised classification. We used 52 Composite Burn Index (CBI) plots from the Shovel Creek burn scar and 28 from the Nugget Creek burn scar for training classifiers and product validation. For the Shovel Creek burn scar, the RF and SVM machine learning (ML) classification methods outperformed the traditional spectral indices that use linear regression to separate burn severity classes (RF and SVM accuracy, 83.33%, versus NBR accuracy, 73.08%). However, for the Nugget Creek burn scar, the NDVI product (accuracy: 96%) outperformed the other indices and ML classifiers. In this study, we demonstrated that when sufficient ground truth data is available, the ML classifiers can be very effective for reliable mapping of burn severity in the Alaskan boreal forest. Since the performance of ML classifiers are dependent on the quantity of ground truth data, when sufficient ground truth data is available, the ML classification methods would be better at assessing burn severity, whereas with limited ground truth data the traditional spectral indices would be better suited. We also looked at the relationship between burn severity, fuel type, and topography (aspect and slope) and found that the relationship is site-dependent.

Download Full-text

Automatic Evaluation of Wheat Resistance to Fusarium Head Blight Using Dual Mask-RCNN Deep Learning Frameworks in Computer Vision

Remote Sensing ◽

10.3390/rs13010026 ◽

2020 ◽

Vol 13 (1) ◽

pp. 26

Author(s):

Wen-Hao Su ◽

Jiajing Zhang ◽

Ce Yang ◽

Rae Page ◽

Tamas Szinyei ◽

...

Keyword(s):

Fusarium Head Blight ◽

Ground Truth ◽

Wheat Breeding ◽

Head Blight ◽

Detection Rates ◽

Ground Truth Data ◽

Resistant Cultivars ◽

Feature Pyramid ◽

Rater Error ◽

Wheat Lines

In many regions of the world, wheat is vulnerable to severe yield and quality losses from the fungus disease of Fusarium head blight (FHB). The development of resistant cultivars is one means of ameliorating the devastating effects of this disease, but the breeding process requires the evaluation of hundreds of lines each year for reaction to the disease. These field evaluations are laborious, expensive, time-consuming, and are prone to rater error. A phenotyping cart that can quickly capture images of the spikes of wheat lines and their level of FHB infection would greatly benefit wheat breeding programs. In this study, mask region convolutional neural network (Mask-RCNN) allowed for reliable identification of the symptom location and the disease severity of wheat spikes. Within a wheat line planted in the field, color images of individual wheat spikes and their corresponding diseased areas were labeled and segmented into sub-images. Images with annotated spikes and sub-images of individual spikes with labeled diseased areas were used as ground truth data to train Mask-RCNN models for automatic image segmentation of wheat spikes and FHB diseased areas, respectively. The feature pyramid network (FPN) based on ResNet-101 network was used as the backbone of Mask-RCNN for constructing the feature pyramid and extracting features. After generating mask images of wheat spikes from full-size images, Mask-RCNN was performed to predict diseased areas on each individual spike. This protocol enabled the rapid recognition of wheat spikes and diseased areas with the detection rates of 77.76% and 98.81%, respectively. The prediction accuracy of 77.19% was achieved by calculating the ratio of the wheat FHB severity value of prediction over ground truth. This study demonstrates the feasibility of rapidly determining levels of FHB in wheat spikes, which will greatly facilitate the breeding of resistant cultivars.

Download Full-text

Classification of Cattle Behaviours Using Neck-Mounted Accelerometer-Equipped Collars and Convolutional Neural Networks

Sensors ◽

10.3390/s21124050 ◽

2021 ◽

Vol 21 (12) ◽

pp. 4050

Author(s):

Dejan Pavlovic ◽

Christopher Davison ◽

Andrew Hamilton ◽

Oskar Marko ◽

Robert Atkinson ◽

...

Keyword(s):

Neural Network ◽

Model Performance ◽

Ground Truth ◽

Practical Implementation ◽

Ground Truth Data ◽

Battery Lifetime ◽

Implementation Challenges ◽

Memory Footprint ◽

Commercial Farms ◽

Using Data

Monitoring cattle behaviour is core to the early detection of health and welfare issues and to optimise the fertility of large herds. Accelerometer-based sensor systems that provide activity profiles are now used extensively on commercial farms and have evolved to identify behaviours such as the time spent ruminating and eating at an individual animal level. Acquiring this information at scale is central to informing on-farm management decisions. The paper presents the development of a Convolutional Neural Network (CNN) that classifies cattle behavioural states (`rumination’, `eating’ and `other’) using data generated from neck-mounted accelerometer collars. During three farm trials in the United Kingdom (Easter Howgate Farm, Edinburgh, UK), 18 steers were monitored to provide raw acceleration measurements, with ground truth data provided by muzzle-mounted pressure sensor halters. A range of neural network architectures are explored and rigorous hyper-parameter searches are performed to optimise the network. The computational complexity and memory footprint of CNN models are not readily compatible with deployment on low-power processors which are both memory and energy constrained. Thus, progressive reductions of the CNN were executed with minimal loss of performance in order to address the practical implementation challenges, defining the trade-off between model performance versus computation complexity and memory footprint to permit deployment on micro-controller architectures. The proposed methodology achieves a compression of 14.30 compared to the unpruned architecture but is nevertheless able to accurately classify cattle behaviours with an overall F1 score of 0.82 for both FP32 and FP16 precision while achieving a reasonable battery lifetime in excess of 5.7 years.

Download Full-text