scholarly journals A Unified Deep Framework for Joint 3D Pose Estimation and Action Recognition from a Single RGB Camera

Sensors ◽  
2020 ◽  
Vol 20 (7) ◽  
pp. 1825 ◽  
Author(s):  
Huy Hieu Pham ◽  
Houssam Salmane ◽  
Louahdi Khoudour ◽  
Alain Crouzil ◽  
Sergio A. Velastin ◽  
...  

We present a deep learning-based multitask framework for joint 3D human pose estimation and action recognition from RGB sensors using simple cameras. The approach proceeds along two stages. In the first, a real-time 2D pose detector is run to determine the precise pixel location of important keypoints of the human body. A two-stream deep neural network is then designed and trained to map detected 2D keypoints into 3D poses. In the second stage, the Efficient Neural Architecture Search (ENAS) algorithm is deployed to find an optimal network architecture that is used for modeling the spatio-temporal evolution of the estimated 3D poses via an image-based intermediate representation and performing action recognition. Experiments on Human3.6M, MSR Action3D and SBU Kinect Interaction datasets verify the effectiveness of the proposed method on the targeted tasks. Moreover, we show that the method requires a low computational budget for training and inference. In particular, the experimental results show that by using a monocular RGB sensor, we can develop a 3D pose estimation and human action recognition approach that reaches the performance of RGB-depth sensors. This opens up many opportunities for leveraging RGB cameras (which are much cheaper than depth cameras and extensively deployed in private and public places) to build intelligent recognition systems.

Inventions ◽  
2019 ◽  
Vol 4 (1) ◽  
pp. 9 ◽  
Author(s):  
Panagiotis Barmpoutis ◽  
Tania Stathaki ◽  
Stephanos Camarinopoulos

Given the broad range of applications from video surveillance to human–computer interaction, human action learning and recognition analysis based on 3D skeleton data are currently a popular area of research. In this paper, we propose a method for action recognition using depth sensors and representing the skeleton time series sequences as higher-order sparse structure tensors to exploit the dependencies among skeleton joints and to overcome the limitations of methods that use joint coordinates as input signals. To this end, we estimate their decompositions based on randomized subspace iteration that enables the computation of singular values and vectors of large sparse matrices with high accuracy. Specifically, we attempt to extract different feature representations containing spatio-temporal complementary information and extracting the mode-n singular values with regards to the correlations of skeleton joints. Then, the extracted features are combined using discriminant correlation analysis, and a neural network is used to recognize the action patterns. The experimental results presented use three widely used action datasets and confirm the great potential of the proposed action learning and recognition method.


Author(s):  
Mohammad Farhad Bulbul ◽  
Yunsheng Jiang ◽  
Jinwen Ma

The emerging cost-effective depth sensors have facilitated the action recognition task significantly. In this paper, the authors address the action recognition problem using depth video sequences combining three discriminative features. More specifically, the authors generate three Depth Motion Maps (DMMs) over the entire video sequence corresponding to the front, side, and top projection views. Contourlet-based Histogram of Oriented Gradients (CT-HOG), Local Binary Patterns (LBP), and Edge Oriented Histograms (EOH) are then computed from the DMMs. To merge these features, the authors consider decision-level fusion, where a soft decision-fusion rule, Logarithmic Opinion Pool (LOGP), is used to combine the classification outcomes from multiple classifiers each with an individual set of features. Experimental results on two datasets reveal that the fusion scheme achieves superior action recognition performance over the situations when using each feature individually.


2020 ◽  
Vol 79 (17-18) ◽  
pp. 12349-12371
Author(s):  
Qingshan She ◽  
Gaoyuan Mu ◽  
Haitao Gan ◽  
Yingle Fan

2020 ◽  
Vol 10 (12) ◽  
pp. 4412
Author(s):  
Ammar Mohsin Butt ◽  
Muhammad Haroon Yousaf ◽  
Fiza Murtaza ◽  
Saima Nazir ◽  
Serestina Viriri ◽  
...  

Human action recognition has gathered significant attention in recent years due to its high demand in various application domains. In this work, we propose a novel codebook generation and hybrid encoding scheme for classification of action videos. The proposed scheme develops a discriminative codebook and a hybrid feature vector by encoding the features extracted from CNNs (convolutional neural networks). We explore different CNN architectures for extracting spatio-temporal features. We employ an agglomerative clustering approach for codebook generation, which intends to combine the advantages of global and class-specific codebooks. We propose a Residual Vector of Locally Aggregated Descriptors (R-VLAD) and fuse it with locality-based coding to form a hybrid feature vector. It provides a compact representation along with high order statistics. We evaluated our work on two publicly available standard benchmark datasets HMDB-51 and UCF-101. The proposed method achieves 72.6% and 96.2% on HMDB51 and UCF101, respectively. We conclude that the proposed scheme is able to boost recognition accuracy for human action recognition.


Sign in / Sign up

Export Citation Format

Share Document