An Attention Enhanced Spatial–Temporal Graph Convolutional LSTM Network for Action Recognition in Karate

Jianping Guo; Hong Liu; Xi Li; Dahong Xu; Yihan Zhang

doi:10.3390/app11188641

An Attention Enhanced Spatial–Temporal Graph Convolutional LSTM Network for Action Recognition in Karate

Applied Sciences ◽

10.3390/app11188641 ◽

2021 ◽

Vol 11 (18) ◽

pp. 8641

Author(s):

Jianping Guo ◽

Hong Liu ◽

Xi Li ◽

Dahong Xu ◽

Yihan Zhang

Keyword(s):

Artificial Intelligence ◽

Action Recognition ◽

Structural Information ◽

Human Action Recognition ◽

Human Action ◽

Competitive Sports ◽

Convolutional Networks ◽

Convolution Model ◽

Artificial Intelligence Technology ◽

Temporal Graph

With the increasing popularity of artificial intelligence applications, artificial intelligence technology has begun to be applied in competitive sports. These applications have promoted the improvement of athletes’ competitive ability, as well as the fitness of the masses. Human action recognition technology, based on deep learning, has gradually been applied to the analysis of the technical actions of competitive sports athletes, as well as the analysis of tactics. In this paper, a new graph convolution model is proposed. Delaunay’s partitioning algorithm was used to construct a new spatiotemporal topology which can effectively obtain the structural information and spatiotemporal features of athletes’ technical actions. At the same time, the attention mechanism was integrated into the model, and different weight coefficients were assigned to the joints, which significantly improved the accuracy of technical action recognition. First, a comparison between the current state-of-the-art methods was undertaken using the general datasets of Kinect and NTU-RGB + D. The performance of the new algorithm model was slightly improved in comparison to the general dataset. Then, the performance of our algorithm was compared with spatial temporal graph convolutional networks (ST-GCN) for the karate technique action dataset. We found that the accuracy of our algorithm was significantly improved.

Download Full-text

Spatio-Temporal Graph Routing for Skeleton-Based Action Recognition

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33018561 ◽

2019 ◽

Vol 33 ◽

pp. 8561-8568 ◽

Cited By ~ 8

Author(s):

Bin Li ◽

Xi Li ◽

Zhongfei Zhang ◽

Fei Wu

Keyword(s):

Action Recognition ◽

Structural Information ◽

Human Action Recognition ◽

Spatial Dimension ◽

Human Action ◽

High Order ◽

Research Attention ◽

Wide Range ◽

Temporal Graph ◽

Spatio Temporal

With the representation effectiveness, skeleton-based human action recognition has received considerable research attention, and has a wide range of real applications. In this area, many existing methods typically rely on fixed physicalconnectivity skeleton structure for recognition, which is incapable of well capturing the intrinsic high-order correlations among skeleton joints. In this paper, we propose a novel spatio-temporal graph routing (STGR) scheme for skeletonbased action recognition, which adaptively learns the intrinsic high-order connectivity relationships for physicallyapart skeleton joints. Specifically, the scheme is composed of two components: spatial graph router (SGR) and temporal graph router (TGR). The SGR aims to discover the connectivity relationships among the joints based on sub-group clustering along the spatial dimension, while the TGR explores the structural information by measuring the correlation degrees between temporal joint node trajectories. The proposed scheme is naturally and seamlessly incorporated into the framework of graph convolutional networks (GCNs) to produce a set of skeleton-joint-connectivity graphs, which are further fed into the classification networks. Moreover, an insightful analysis on receptive field of graph node is provided to explain the necessity of our method. Experimental results on two benchmark datasets (NTU-RGB+D and Kinetics) demonstrate the effectiveness against the state-of-the-art.

Download Full-text

On the spatial attention in spatio-temporal graph convolutional networks for skeleton-based human action recognition

10.1109/ijcnn52387.2021.9534440 ◽

2021 ◽

Author(s):

Negar Heidari ◽

Alexandros Iosifidis

Keyword(s):

Spatial Attention ◽

Action Recognition ◽

Human Action Recognition ◽

Human Action ◽

Convolutional Networks ◽

Temporal Graph ◽

Spatio Temporal

Download Full-text

Skeleton-Based Action Recognition Based on Distance Vector and Multihigh View Adaptive Networks

Computational Intelligence and Neuroscience ◽

10.1155/2021/1507770 ◽

2021 ◽

Vol 2021 ◽

pp. 1-10

Author(s):

Min Zhang ◽

Haijie Yang ◽

Pengfei Li ◽

Ming Jiang

Keyword(s):

Action Recognition ◽

Human Action Recognition ◽

Human Action ◽

Adaptive Networks ◽

Convolutional Network ◽

Current Frame ◽

Convolutional Networks ◽

Distance Vector ◽

Temporal Graph ◽

Ablation Study

Skeleton-based human action recognition has attracted much attention in the field of computer vision. Most of the previous studies are based on fixed skeleton graphs so that only the local physical dependencies among joints can be captured, resulting in the omission of implicit joint correlations. In addition, under different views, the content of the same action is very different. In some views, keypoints will be blocked, which will cause recognition errors. In this paper, an action recognition method based on distance vector and multihigh view adaptive network (DV-MHNet) is proposed to address this challenging task. Among the mentioned techniques, the multihigh (MH) view adaptive networks are constructed to automatically determine the best observation view at different heights, obtain complete keypoints information of the current frame image, and enhance the robustness and generalization of the model to recognize actions at different heights. Then, the distance vector (DV) mechanism is introduced on this basis to establish the relative distance and relative orientation between different keypoints in the same frame and the same keypoints in different frame to obtain the global potential relationship of each keypoint, and finally by constructing the spatial temporal graph convolutional network to take into account the information in space and time, the characteristics of the action are learned. This paper has done the ablation study with traditional spatial temporal graph convolutional networks and with or without multihigh view adaptive networks, which reasonably proves the effectiveness of the model. The model is evaluated on two widely used action recognition benchmarks (NTU-RGB + D and PKU-MMD). Our method achieves better performance on both datasets.

Download Full-text

Distinct Two-Stream Convolutional Networks for Human Action Recognition in Videos Using Segment-Based Temporal Modeling

Data ◽

10.3390/data5040104 ◽

2020 ◽

Vol 5 (4) ◽

pp. 104

Author(s):

Ashok Sarabu ◽

Ajit Kumar Santra

Keyword(s):

Action Recognition ◽

Data Augmentation ◽

Main Idea ◽

Human Action Recognition ◽

Human Action ◽

Great Success ◽

Temporal Modeling ◽

Convolutional Networks ◽

Temporal Features ◽

Augmentation Techniques

The Two-stream convolution neural network (CNN) has proven a great success in action recognition in videos. The main idea is to train the two CNNs in order to learn spatial and temporal features separately, and two scores are combined to obtain final scores. In the literature, we observed that most of the methods use similar CNNs for two streams. In this paper, we design a two-stream CNN architecture with different CNNs for the two streams to learn spatial and temporal features. Temporal Segment Networks (TSN) is applied in order to retrieve long-range temporal features, and to differentiate the similar type of sub-action in videos. Data augmentation techniques are employed to prevent over-fitting. Advanced cross-modal pre-training is discussed and introduced to the proposed architecture in order to enhance the accuracy of action recognition. The proposed two-stream model is evaluated on two challenging action recognition datasets: HMDB-51 and UCF-101. The findings of the proposed architecture shows the significant performance increase and it outperforms the existing methods.

Download Full-text

A Set of New Hermite Kernel Functions in Kernel Extreme Learning Machine and Application in Human Action Recognition

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001419550140 ◽

2019 ◽

Vol 33 (12) ◽

pp. 1955014 ◽

Cited By ~ 1

Author(s):

Xueping Liu ◽

Xingzuo Yue

Keyword(s):

Extreme Learning Machine ◽

Action Recognition ◽

Structural Information ◽

Image Data ◽

Human Action Recognition ◽

Human Action ◽

Kernel Functions ◽

Support Vector ◽

Learning Speed ◽

Learning Machine

The kernel function has been successfully utilized in the extreme learning machine (ELM) that provides a stabilized and generalized performance and greatly reduces the computational complexity. However, the selection and optimization of the parameters constituting the most common kernel functions are tedious and time-consuming. In this study, a set of new Hermit kernel functions derived from the generalized Hermit polynomials has been proposed. The significant contributions of the proposed kernel include only one parameter selected from a small set of natural numbers; thus, the parameter optimization is greatly facilitated and excessive structural information of the sample data is retained. Consequently, the new kernel functions can be used as optimal alternatives to other common kernel functions for ELM at a rapid learning speed. The experimental results showed that the proposed kernel ELM method tends to have similar or better robustness and generalized performance at a faster learning speed than the other common kernel ELM and support vector machine methods. Consequently, when applied to human action recognition by depth video sequence, the method also achieves excellent performance, demonstrating its time-based advantage on the video image data.

Download Full-text

Enhanced Spatial and Extended Temporal Graph Convolutional Network for Skeleton-Based Action Recognition

Sensors ◽

10.3390/s20185260 ◽

2020 ◽

Vol 20 (18) ◽

pp. 5260 ◽

Cited By ~ 1

Author(s):

Fanjia Li ◽

Juanjuan Li ◽

Aichun Zhu ◽

Yonggang Xu ◽

Hongsheng Yin ◽

...

Keyword(s):

Action Recognition ◽

Large Scale ◽

Optimal Solution ◽

Human Action Recognition ◽

Human Action ◽

Convolutional Network ◽

Spatial Graph ◽

Serial Connection ◽

In Series ◽

Temporal Graph

In the skeleton-based human action recognition domain, the spatial-temporal graph convolution networks (ST-GCNs) have made great progress recently. However, they use only one fixed temporal convolution kernel, which is not enough to extract the temporal cues comprehensively. Moreover, simply connecting the spatial graph convolution layer (GCL) and the temporal GCL in series is not the optimal solution. To this end, we propose a novel enhanced spatial and extended temporal graph convolutional network (EE-GCN) in this paper. Three convolution kernels with different sizes are chosen to extract the discriminative temporal features from shorter to longer terms. The corresponding GCLs are then concatenated by a powerful yet efficient one-shot aggregation (OSA) + effective squeeze-excitation (eSE) structure. The OSA module aggregates the features from each layer once to the output, and the eSE module explores the interdependency between the channels of the output. Besides, we propose a new connection paradigm to enhance the spatial features, which expand the serial connection to a combination of serial and parallel connections by adding a spatial GCL in parallel with the temporal GCLs. The proposed method is evaluated on three large scale datasets, and the experimental results show that the performance of our method exceeds previous state-of-the-art methods.

Download Full-text

Pixel Convolutional Networks for Skeleton-Based Human Action Recognition

Communications in Computer and Information Science - Methods and Applications for Modeling and Simulation of Complex Systems ◽

10.1007/978-981-13-2853-4_40 ◽

2018 ◽

pp. 513-523

Author(s):

Zhichao Chang ◽

Jiangyun Wang ◽

Liang Han

Keyword(s):

Action Recognition ◽

Human Action Recognition ◽

Human Action ◽

Convolutional Networks

Download Full-text

Deep Residual Temporal Convolutional Networks for Skeleton-Based Human Action Recognition

Lecture Notes in Computer Science - Computer Vision Systems ◽

10.1007/978-3-030-34995-0_34 ◽

2019 ◽

pp. 376-385

Author(s):

R. Khamsehashari ◽

K. Gadzicki ◽

C. Zetzsche

Keyword(s):

Action Recognition ◽

Human Action Recognition ◽

Human Action ◽

Convolutional Networks

Download Full-text

Dual Attention-Guided Multiscale Dynamic Aggregate Graph Convolutional Networks for Skeleton-Based Human Action Recognition

Symmetry ◽

10.3390/sym12101589 ◽

2020 ◽

Vol 12 (10) ◽

pp. 1589

Author(s):

Zeyuan Hu ◽

Eung-Joo Lee

Keyword(s):

Action Recognition ◽

Human Action Recognition ◽

Human Action ◽

Great Success ◽

Semantic Features ◽

Convolutional Networks ◽

Temporal Correlations ◽

Semantic Relevance ◽

High Level ◽

Relationship Of

Traditional convolution neural networks have achieved great success in human action recognition. However, it is challenging to establish effective associations between different human bone nodes to capture detailed information. In this paper, we propose a dual attention-guided multiscale dynamic aggregate graph convolution neural network (DAG-GCN) for skeleton-based human action recognition. Our goal is to explore the best correlation and determine high-level semantic features. First, a multiscale dynamic aggregate GCN module is used to capture important semantic information and to establish dependence relationships for different bone nodes. Second, the higher level semantic feature is further refined, and the semantic relevance is emphasized through a dual attention guidance module. In addition, we exploit the relationship of joints hierarchically and the spatial temporal correlations through two modules. Experiments with the DAG-GCN method result in good performance on the NTU-60-RGB+D and NTU-120-RGB+D datasets. The accuracy is 95.76% and 90.01%, respectively, for the cross (X)-View and X-Subon the NTU60dataset.

Download Full-text

Exploring hybrid spatio-temporal convolutional networks for human action recognition

Multimedia Tools and Applications ◽

10.1007/s11042-017-4514-3 ◽

2017 ◽

Vol 76 (13) ◽

pp. 15065-15081 ◽

Cited By ~ 11

Author(s):

Hao Wang ◽

Yanhua Yang ◽

Erkun Yang ◽

Cheng Deng

Keyword(s):

Action Recognition ◽

Human Action Recognition ◽

Human Action ◽

Convolutional Networks ◽

Spatio Temporal

Download Full-text