A Unified Deep Framework for Joint 3D Pose Estimation and Action Recognition from a Single RGB Camera

Huy Hieu Pham; Houssam Salmane; Louahdi Khoudour; Alain Crouzil; Sergio A. Velastin; Pablo Zegers

doi:10.3390/s20071825

A Unified Deep Framework for Joint 3D Pose Estimation and Action Recognition from a Single RGB Camera

Sensors ◽

10.3390/s20071825 ◽

2020 ◽

Vol 20 (7) ◽

pp. 1825 ◽

Cited By ~ 5

Author(s):

Huy Hieu Pham ◽

Houssam Salmane ◽

Louahdi Khoudour ◽

Alain Crouzil ◽

Sergio A. Velastin ◽

...

Keyword(s):

Action Recognition ◽

Pose Estimation ◽

Network Architecture ◽

Human Action Recognition ◽

Human Action ◽

3D Pose Estimation ◽

Depth Cameras ◽

Depth Sensors ◽

Private And Public ◽

Spatio Temporal

We present a deep learning-based multitask framework for joint 3D human pose estimation and action recognition from RGB sensors using simple cameras. The approach proceeds along two stages. In the first, a real-time 2D pose detector is run to determine the precise pixel location of important keypoints of the human body. A two-stream deep neural network is then designed and trained to map detected 2D keypoints into 3D poses. In the second stage, the Efficient Neural Architecture Search (ENAS) algorithm is deployed to find an optimal network architecture that is used for modeling the spatio-temporal evolution of the estimated 3D poses via an image-based intermediate representation and performing action recognition. Experiments on Human3.6M, MSR Action3D and SBU Kinect Interaction datasets verify the effectiveness of the proposed method on the targeted tasks. Moreover, we show that the method requires a low computational budget for training and inference. In particular, the experimental results show that by using a monocular RGB sensor, we can develop a 3D pose estimation and human action recognition approach that reaches the performance of RGB-depth sensors. This opens up many opportunities for leveraging RGB cameras (which are much cheaper than depth cameras and extensively deployed in private and public places) to build intelligent recognition systems.

Download Full-text

Skeleton-Based Human Action Recognition through Third-Order Tensor Representation and Spatio-Temporal Analysis

Inventions ◽

10.3390/inventions4010009 ◽

2019 ◽

Vol 4 (1) ◽

pp. 9 ◽

Cited By ~ 2

Author(s):

Panagiotis Barmpoutis ◽

Tania Stathaki ◽

Stephanos Camarinopoulos

Keyword(s):

Action Recognition ◽

Sparse Matrices ◽

Action Learning ◽

Human Action Recognition ◽

Singular Values ◽

Temporal Analysis ◽

Human Action ◽

Depth Sensors ◽

Subspace Iteration ◽

Spatio Temporal

Given the broad range of applications from video surveillance to human–computer interaction, human action learning and recognition analysis based on 3D skeleton data are currently a popular area of research. In this paper, we propose a method for action recognition using depth sensors and representing the skeleton time series sequences as higher-order sparse structure tensors to exploit the dependencies among skeleton joints and to overcome the limitations of methods that use joint coordinates as input signals. To this end, we estimate their decompositions based on randomized subspace iteration that enables the computation of singular values and vectors of large sparse matrices with high accuracy. Specifically, we attempt to extract different feature representations containing spatio-temporal complementary information and extracting the mode-n singular values with regards to the correlations of skeleton joints. Then, the extracted features are combined using discriminant correlation analysis, and a neural network is used to recognize the action patterns. The experimental results presented use three widely used action datasets and confirm the great potential of the proposed action learning and recognition method.

Download Full-text

Improvement of Human Action Recognition Using 3D Pose Estimation

Smart Innovation, Systems and Technologies - Activity and Behavior Computing ◽

10.1007/978-981-15-8944-7_2 ◽

2020 ◽

pp. 21-37

Author(s):

Kohei Adachi ◽

Paula Lago ◽

Tsuyoshi Okita ◽

Sozo Inoue

Keyword(s):

Action Recognition ◽

Pose Estimation ◽

Human Action Recognition ◽

Human Action ◽

3D Pose Estimation

Download Full-text

Human action recognition based on spatio-temporal three-dimensional scattering transform descriptor and an improved VLAD feature encoding algorithm

Neurocomputing ◽

10.1016/j.neucom.2018.05.121 ◽

2019 ◽

Vol 348 ◽

pp. 145-157 ◽

Cited By ~ 1

Author(s):

Bo Lin ◽

Bin Fang ◽

Weibin Yang ◽

Jiye Qian

Keyword(s):

Action Recognition ◽

Three Dimensional ◽

Human Action Recognition ◽

Human Action ◽

Scattering Transform ◽

Feature Encoding ◽

Spatio Temporal

Download Full-text

VIEW-ROBUST HUMAN ACTION RECOGNITION BASED ON SPATIO-TEMPORAL SELF SIMILARITIES

JOURNAL OF MECHANICS OF CONTINUA AND MATHEMATICAL SCIENCES ◽

10.26782/jmcms.2020.01.00010 ◽

2020 ◽

Vol 15 (1) ◽

Author(s):

K. Pradeep Reddy

Keyword(s):

Action Recognition ◽

Human Action Recognition ◽

Human Action ◽

Spatio Temporal

Download Full-text

Spatio-Temporal VLAD Encoding for Human Action Recognition in Videos

MultiMedia Modeling - Lecture Notes in Computer Science ◽

10.1007/978-3-319-51811-4_30 ◽

2016 ◽

pp. 365-378 ◽

Cited By ~ 13

Author(s):

Ionut C. Duta ◽

Bogdan Ionescu ◽

Kiyoharu Aizawa ◽

Nicu Sebe

Keyword(s):

Action Recognition ◽

Human Action Recognition ◽

Human Action ◽

Spatio Temporal

Download Full-text

DMMs-Based Multiple Features Fusion for Human Action Recognition

International Journal of Multimedia Data Engineering and Management ◽

10.4018/ijmdem.2015100102 ◽

2015 ◽

Vol 6 (4) ◽

pp. 23-39 ◽

Cited By ~ 18

Author(s):

Mohammad Farhad Bulbul ◽

Yunsheng Jiang ◽

Jinwen Ma

Keyword(s):

Action Recognition ◽

Recognition Performance ◽

Recognition Task ◽

Human Action Recognition ◽

Fusion Rule ◽

Local Binary Patterns ◽

Human Action ◽

Decision Fusion ◽

Soft Decision ◽

Depth Sensors

The emerging cost-effective depth sensors have facilitated the action recognition task significantly. In this paper, the authors address the action recognition problem using depth video sequences combining three discriminative features. More specifically, the authors generate three Depth Motion Maps (DMMs) over the entire video sequence corresponding to the front, side, and top projection views. Contourlet-based Histogram of Oriented Gradients (CT-HOG), Local Binary Patterns (LBP), and Edge Oriented Histograms (EOH) are then computed from the DMMs. To merge these features, the authors consider decision-level fusion, where a soft decision-fusion rule, Logarithmic Opinion Pool (LOGP), is used to combine the classification outcomes from multiple classifiers each with an individual set of features. Experimental results on two datasets reveal that the fusion scheme achieves superior action recognition performance over the situations when using each feature individually.

Download Full-text

Spatio-temporal SRU with global context-aware attention for 3D human action recognition

Multimedia Tools and Applications ◽

10.1007/s11042-019-08587-w ◽

2020 ◽

Vol 79 (17-18) ◽

pp. 12349-12371

Author(s):

Qingshan She ◽

Gaoyuan Mu ◽

Haitao Gan ◽

Yingle Fan

Keyword(s):

Action Recognition ◽

Human Action Recognition ◽

Human Action ◽

Context Aware ◽

Global Context ◽

Spatio Temporal

Download Full-text

Agglomerative Clustering and Residual-VLAD Encoding for Human Action Recognition

Applied Sciences ◽

10.3390/app10124412 ◽

2020 ◽

Vol 10 (12) ◽

pp. 4412

Author(s):

Ammar Mohsin Butt ◽

Muhammad Haroon Yousaf ◽

Fiza Murtaza ◽

Saima Nazir ◽

Serestina Viriri ◽

...

Keyword(s):

Action Recognition ◽

Feature Vector ◽

Human Action Recognition ◽

Human Action ◽

Compact Representation ◽

Agglomerative Clustering ◽

Residual Vector ◽

Benchmark Datasets ◽

Codebook Generation ◽

Spatio Temporal

Human action recognition has gathered significant attention in recent years due to its high demand in various application domains. In this work, we propose a novel codebook generation and hybrid encoding scheme for classification of action videos. The proposed scheme develops a discriminative codebook and a hybrid feature vector by encoding the features extracted from CNNs (convolutional neural networks). We explore different CNN architectures for extracting spatio-temporal features. We employ an agglomerative clustering approach for codebook generation, which intends to combine the advantages of global and class-specific codebooks. We propose a Residual Vector of Locally Aggregated Descriptors (R-VLAD) and fuse it with locality-based coding to form a hybrid feature vector. It provides a compact representation along with high order statistics. We evaluated our work on two publicly available standard benchmark datasets HMDB-51 and UCF-101. The proposed method achieves 72.6% and 96.2% on HMDB51 and UCF101, respectively. We conclude that the proposed scheme is able to boost recognition accuracy for human action recognition.

Download Full-text

Spatio-temporal feature extraction and representation for RGB-D human action recognition

Pattern Recognition Letters ◽

10.1016/j.patrec.2014.03.024 ◽

2014 ◽

Vol 50 ◽

pp. 139-148 ◽

Cited By ~ 36

Author(s):

Jiajia Luo ◽

Wei Wang ◽

Hairong Qi

Keyword(s):

Feature Extraction ◽

Action Recognition ◽

Human Action Recognition ◽

Human Action ◽

Spatio Temporal ◽

Temporal Feature

Download Full-text

Study of Human Action Recognition Based on Improved Spatio-temporal Features

International Journal of Automation and Computing ◽

10.1007/s11633-014-0831-4 ◽

2014 ◽

Vol 11 (5) ◽

pp. 500-509 ◽

Cited By ~ 12

Author(s):

Xiao-Fei Ji ◽

Qian-Qian Wu ◽

Zhao-Jie Ju ◽

Yang-Yang Wang

Keyword(s):

Action Recognition ◽

Human Action Recognition ◽

Human Action ◽

Temporal Features ◽

Spatio Temporal

Download Full-text