Dual-Stream Structured Graph Convolution Network for Skeleton-Based Action Recognition

Chunyan Xu; Rong Liu; Tong Zhang; Zhen Cui; Jian Yang; Chunlong Hu

doi:10.1145/3450410

Dual-Stream Structured Graph Convolution Network for Skeleton-Based Action Recognition

ACM Transactions on Multimedia Computing Communications and Applications ◽

10.1145/3450410 ◽

2021 ◽

Vol 17 (4) ◽

pp. 1-22

Author(s):

Chunyan Xu ◽

Rong Liu ◽

Tong Zhang ◽

Zhen Cui ◽

Jian Yang ◽

...

Keyword(s):

Action Recognition ◽

Temporal Dynamics ◽

Recognition Task ◽

Human Movement ◽

Body Part ◽

Human Action ◽

Human Motion ◽

Body Parts ◽

Dynamic Interactions ◽

Spatio Temporal

In this work, we propose a dual-stream structured graph convolution network ( DS-SGCN ) to solve the skeleton-based action recognition problem. The spatio-temporal coordinates and appearance contexts of the skeletal joints are jointly integrated into the graph convolution learning process on both the video and skeleton modalities. To effectively represent the skeletal graph of discrete joints, we create a structured graph convolution module specifically designed to encode partitioned body parts along with their dynamic interactions in the spatio-temporal sequence. In more detail, we build a set of structured intra-part graphs, each of which can be adopted to represent a distinctive body part (e.g., left arm, right leg, head). The inter-part graph is then constructed to model the dynamic interactions across different body parts; here each node corresponds to an intra-part graph built above, while an edge between two nodes is used to express these internal relationships of human movement. We implement the graph convolution learning on both intra- and inter-part graphs in order to obtain the inherent characteristics and dynamic interactions, respectively, of human action. After integrating the intra- and inter-levels of spatial context/coordinate cues, a convolution filtering process is conducted on time slices to capture these temporal dynamics of human motion. Finally, we fuse two streams of graph convolution responses in order to predict the category information of human action in an end-to-end fashion. Comprehensive experiments on five single/multi-modal benchmark datasets (including NTU RGB+D 60, NTU RGB+D 120, MSR-Daily 3D, N-UCLA, and HDM05) demonstrate that the proposed DS-SGCN framework achieves encouraging performance on the skeleton-based action recognition task.

Download Full-text

A Hierarchical Model for Human Action Recognition From Body-Parts

IEEE Transactions on Circuits and Systems for Video Technology ◽

10.1109/tcsvt.2018.2871660 ◽

2019 ◽

Vol 29 (10) ◽

pp. 2986-3000 ◽

Cited By ~ 1

Author(s):

Zhanpeng Shao ◽

Youfu Li ◽

Yao Guo ◽

Xiaolong Zhou ◽

Shengyong Chen

Keyword(s):

Action Recognition ◽

Hierarchical Model ◽

Human Action Recognition ◽

Human Action ◽

Body Parts

Download Full-text

Human action recognition based on spatio-temporal three-dimensional scattering transform descriptor and an improved VLAD feature encoding algorithm

Neurocomputing ◽

10.1016/j.neucom.2018.05.121 ◽

2019 ◽

Vol 348 ◽

pp. 145-157 ◽

Cited By ~ 1

Author(s):

Bo Lin ◽

Bin Fang ◽

Weibin Yang ◽

Jiye Qian

Keyword(s):

Action Recognition ◽

Three Dimensional ◽

Human Action Recognition ◽

Human Action ◽

Scattering Transform ◽

Feature Encoding ◽

Spatio Temporal

Download Full-text

VIEW-ROBUST HUMAN ACTION RECOGNITION BASED ON SPATIO-TEMPORAL SELF SIMILARITIES

JOURNAL OF MECHANICS OF CONTINUA AND MATHEMATICAL SCIENCES ◽

10.26782/jmcms.2020.01.00010 ◽

2020 ◽

Vol 15 (1) ◽

Author(s):

K. Pradeep Reddy

Keyword(s):

Action Recognition ◽

Human Action Recognition ◽

Human Action ◽

Spatio Temporal

Download Full-text

Spatio-Temporal VLAD Encoding for Human Action Recognition in Videos

MultiMedia Modeling - Lecture Notes in Computer Science ◽

10.1007/978-3-319-51811-4_30 ◽

2016 ◽

pp. 365-378 ◽

Cited By ~ 13

Author(s):

Ionut C. Duta ◽

Bogdan Ionescu ◽

Kiyoharu Aizawa ◽

Nicu Sebe

Keyword(s):

Action Recognition ◽

Human Action Recognition ◽

Human Action ◽

Spatio Temporal

Download Full-text

DMMs-Based Multiple Features Fusion for Human Action Recognition

International Journal of Multimedia Data Engineering and Management ◽

10.4018/ijmdem.2015100102 ◽

2015 ◽

Vol 6 (4) ◽

pp. 23-39 ◽

Cited By ~ 18

Author(s):

Mohammad Farhad Bulbul ◽

Yunsheng Jiang ◽

Jinwen Ma

Keyword(s):

Action Recognition ◽

Recognition Performance ◽

Recognition Task ◽

Human Action Recognition ◽

Fusion Rule ◽

Local Binary Patterns ◽

Human Action ◽

Decision Fusion ◽

Soft Decision ◽

Depth Sensors

The emerging cost-effective depth sensors have facilitated the action recognition task significantly. In this paper, the authors address the action recognition problem using depth video sequences combining three discriminative features. More specifically, the authors generate three Depth Motion Maps (DMMs) over the entire video sequence corresponding to the front, side, and top projection views. Contourlet-based Histogram of Oriented Gradients (CT-HOG), Local Binary Patterns (LBP), and Edge Oriented Histograms (EOH) are then computed from the DMMs. To merge these features, the authors consider decision-level fusion, where a soft decision-fusion rule, Logarithmic Opinion Pool (LOGP), is used to combine the classification outcomes from multiple classifiers each with an individual set of features. Experimental results on two datasets reveal that the fusion scheme achieves superior action recognition performance over the situations when using each feature individually.

Download Full-text

Using a Multilearner to Fuse Multimodal Features for Human Action Recognition

Mathematical Problems in Engineering ◽

10.1155/2020/4358728 ◽

2020 ◽

Vol 2020 ◽

pp. 1-18

Author(s):

Chao Tang ◽

Huosheng Hu ◽

Wenjian Wang ◽

Wei Li ◽

Hua Peng ◽

...

Keyword(s):

Action Recognition ◽

Feature Fusion ◽

Human Action Recognition ◽

Human Action ◽

Human Motion ◽

Image Features ◽

Depth Image ◽

Good Ability ◽

Multimodal Features ◽

Feature Based

The representation and selection of action features directly affect the recognition effect of human action recognition methods. Single feature is often affected by human appearance, environment, camera settings, and other factors. Aiming at the problem that the existing multimodal feature fusion methods cannot effectively measure the contribution of different features, this paper proposed a human action recognition method based on RGB-D image features, which makes full use of the multimodal information provided by RGB-D sensors to extract effective human action features. In this paper, three kinds of human action features with different modal information are proposed: RGB-HOG feature based on RGB image information, which has good geometric scale invariance; D-STIP feature based on depth image, which maintains the dynamic characteristics of human motion and has local invariance; and S-JRPF feature-based skeleton information, which has good ability to describe motion space structure. At the same time, multiple K-nearest neighbor classifiers with better generalization ability are used to integrate decision-making classification. The experimental results show that the algorithm achieves ideal recognition results on the public G3D and CAD60 datasets.

Download Full-text

Spatio-temporal SRU with global context-aware attention for 3D human action recognition

Multimedia Tools and Applications ◽

10.1007/s11042-019-08587-w ◽

2020 ◽

Vol 79 (17-18) ◽

pp. 12349-12371

Author(s):

Qingshan She ◽

Gaoyuan Mu ◽

Haitao Gan ◽

Yingle Fan

Keyword(s):

Action Recognition ◽

Human Action Recognition ◽

Human Action ◽

Context Aware ◽

Global Context ◽

Spatio Temporal

Download Full-text

Agglomerative Clustering and Residual-VLAD Encoding for Human Action Recognition

Applied Sciences ◽

10.3390/app10124412 ◽

2020 ◽

Vol 10 (12) ◽

pp. 4412

Author(s):

Ammar Mohsin Butt ◽

Muhammad Haroon Yousaf ◽

Fiza Murtaza ◽

Saima Nazir ◽

Serestina Viriri ◽

...

Keyword(s):

Action Recognition ◽

Feature Vector ◽

Human Action Recognition ◽

Human Action ◽

Compact Representation ◽

Agglomerative Clustering ◽

Residual Vector ◽

Benchmark Datasets ◽

Codebook Generation ◽

Spatio Temporal

Human action recognition has gathered significant attention in recent years due to its high demand in various application domains. In this work, we propose a novel codebook generation and hybrid encoding scheme for classification of action videos. The proposed scheme develops a discriminative codebook and a hybrid feature vector by encoding the features extracted from CNNs (convolutional neural networks). We explore different CNN architectures for extracting spatio-temporal features. We employ an agglomerative clustering approach for codebook generation, which intends to combine the advantages of global and class-specific codebooks. We propose a Residual Vector of Locally Aggregated Descriptors (R-VLAD) and fuse it with locality-based coding to form a hybrid feature vector. It provides a compact representation along with high order statistics. We evaluated our work on two publicly available standard benchmark datasets HMDB-51 and UCF-101. The proposed method achieves 72.6% and 96.2% on HMDB51 and UCF101, respectively. We conclude that the proposed scheme is able to boost recognition accuracy for human action recognition.

Download Full-text

View-Invariant Deep Architecture for Human Action Recognition Using Two-Stream Motion and Shape Temporal Dynamics

IEEE Transactions on Image Processing ◽

10.1109/tip.2020.2965299 ◽

2020 ◽

Vol 29 ◽

pp. 3835-3844 ◽

Cited By ~ 2

Author(s):

Chhavi Dhiman ◽

Dinesh Kumar Vishwakarma

Keyword(s):

Action Recognition ◽

Temporal Dynamics ◽

Human Action Recognition ◽

Human Action ◽

Deep Architecture

Download Full-text

Multi-Instance Multi-Label Action Recognition and Localization Based on Spatio-Temporal Pre-Trimming for Untrimmed Videos

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i07.6986 ◽

2020 ◽

Vol 34 (07) ◽

pp. 12886-12893

Author(s):

Xiao-Yu Zhang ◽

Haichao Shi ◽

Changsheng Li ◽

Peng Li

Keyword(s):

Action Recognition ◽

Human Action ◽

Learning Problem ◽

The Arts ◽

Key Factor ◽

Benchmark Datasets ◽

Spatio Temporal ◽

Weakly Supervised ◽

Temporal Localization ◽

Coarse To Fine

Weakly supervised action recognition and localization for untrimmed videos is a challenging problem with extensive applications. The overwhelming irrelevant background contents in untrimmed videos severely hamper effective identification of actions of interest. In this paper, we propose a novel multi-instance multi-label modeling network based on spatio-temporal pre-trimming to recognize actions and locate corresponding frames in untrimmed videos. Motivated by the fact that person is the key factor in a human action, we spatially and temporally segment each untrimmed video into person-centric clips with pose estimation and tracking techniques. Given the bag-of-instances structure associated with video-level labels, action recognition is naturally formulated as a multi-instance multi-label learning problem. The network is optimized iteratively with selective coarse-to-fine pre-trimming based on instance-label activation. After convergence, temporal localization is further achieved with local-global temporal class activation map. Extensive experiments are conducted on two benchmark datasets, i.e. THUMOS14 and ActivityNet1.3, and experimental results clearly corroborate the efficacy of our method when compared with the state-of-the-arts.

Download Full-text