Action Recognition Based on the Fusion of Graph Convolutional Networks with High Order Features

Jiuqing Dong; Yongbin Gao; Hyo Jong Lee; Heng Zhou; Yifan Yao; Zhijun Fang; Bo Huang

doi:10.3390/app10041482

Action Recognition Based on the Fusion of Graph Convolutional Networks with High Order Features

Applied Sciences ◽

10.3390/app10041482 ◽

2020 ◽

Vol 10 (4) ◽

pp. 1482 ◽

Cited By ~ 2

Author(s):

Jiuqing Dong ◽

Yongbin Gao ◽

Hyo Jong Lee ◽

Heng Zhou ◽

Yifan Yao ◽

...

Keyword(s):

Neural Network ◽

Action Recognition ◽

Feature Fusion ◽

State Of The Art ◽

Recognition Task ◽

High Order ◽

Recognition Method ◽

Related Research ◽

Convolutional Networks ◽

Temporal Features

Skeleton-based action recognition is a widely used task in action related research because of its clear features and the invariance of human appearances and illumination. Furthermore, it can also effectively improve the robustness of the action recognition. Graph convolutional networks have been implemented on those skeletal data to recognize actions. Recent studies have shown that the graph convolutional neural network works well in the action recognition task using spatial and temporal features of skeleton data. The prevalent methods to extract the spatial and temporal features purely rely on a deep network to learn from primitive 3D position. In this paper, we propose a novel action recognition method applying high-order spatial and temporal features from skeleton data, such as velocity features, acceleration features, and relative distance between 3D joints. Meanwhile, a method of multi-stream feature fusion is adopted to fuse these high-order features we proposed. Extensive experiments on Two large and challenging datasets, NTU-RGBD and NTU-RGBD-120, indicate that our model achieves the state-of-the-art performance.

Download Full-text

Multi-Term Attention Networks for Skeleton-Based Action Recognition

Applied Sciences ◽

10.3390/app10155326 ◽

2020 ◽

Vol 10 (15) ◽

pp. 5326

Author(s):

Xiaolei Diao ◽

Xiaoqiang Li ◽

Chen Huang

Keyword(s):

Neural Network ◽

Time Scales ◽

Action Recognition ◽

State Of The Art ◽

Attention Networks ◽

Weighted Fusion ◽

Temporal Features ◽

Benchmark Datasets ◽

Spatio Temporal ◽

Different Time Scales

The same action takes different time in different cases. This difference will affect the accuracy of action recognition to a certain extent. We propose an end-to-end deep neural network called “Multi-Term Attention Networks” (MTANs), which solves the above problem by extracting temporal features with different time scales. The network consists of a Multi-Term Attention Recurrent Neural Network (MTA-RNN) and a Spatio-Temporal Convolutional Neural Network (ST-CNN). In MTA-RNN, a method for fusing multi-term temporal features are proposed to extract the temporal dependence of different time scales, and the weighted fusion temporal feature is recalibrated by the attention mechanism. Ablation research proves that this network has powerful spatio-temporal dynamic modeling capabilities for actions with different time scales. We perform extensive experiments on four challenging benchmark datasets, including the NTU RGB+D dataset, UT-Kinect dataset, Northwestern-UCLA dataset, and UWA3DII dataset. Our method achieves better results than the state-of-the-art benchmarks, which demonstrates the effectiveness of MTANs.

Download Full-text

Shallow Graph Convolutional Network for Skeleton-Based Action Recognition

Sensors ◽

10.3390/s21020452 ◽

2021 ◽

Vol 21 (2) ◽

pp. 452

Author(s):

Wenjie Yang ◽

Jianlin Zhang ◽

Jingju Cai ◽

Zhiyong Xu

Keyword(s):

Action Recognition ◽

State Of The Art ◽

Computational Cost ◽

Receptive Fields ◽

Recognition Task ◽

Convolutional Network ◽

Convolutional Networks ◽

Spatial Graph ◽

Graph Size ◽

Skeleton Graph

Graph convolutional networks (GCNs) have brought considerable improvement to the skeleton-based action recognition task. Existing GCN-based methods usually use the fixed spatial graph size among all the layers. It severely affects the model’s abilities to exploit the global and semantic discriminative information due to the limits of receptive fields. Furthermore, the fixed graph size would cause many redundancies in the representation of actions, which is inefficient for the model. The redundancies could also hinder the model from focusing on beneficial features. To address those issues, we proposed a plug-and-play channel adaptive merging module (CAMM) specific for the human skeleton graph, which can merge the vertices from the same part of the skeleton graph adaptively and efficiently. The merge weights are different across the channels, so every channel has its flexibility to integrate the joints. Then, we build a novel shallow graph convolutional network (SGCN) based on the module, which achieves state-of-the-art performance with less computational cost. Experimental results on NTU-RGB+D and Kinetics-Skeleton illustrates the superiority of our methods.

Download Full-text

Relation Selective Graph Convolutional Network for Skeleton-Based Action Recognition

Symmetry ◽

10.3390/sym13122275 ◽

2021 ◽

Vol 13 (12) ◽

pp. 2275

Author(s):

Wenjie Yang ◽

Jianlin Zhang ◽

Jingju Cai ◽

Zhiyong Xu

Keyword(s):

Action Recognition ◽

State Of The Art ◽

Recognition Task ◽

Context Modeling ◽

Selection Mechanism ◽

Temporal Attention ◽

Significant Progress ◽

Convolutional Network ◽

Convolutional Networks ◽

Art Performance

Graph convolutional networks (GCNs) have made significant progress in the skeletal action recognition task. However, the graphs constructed by these methods are too densely connected, and the same graphs are used repeatedly among channels. Redundant connections will blur the useful interdependencies of joints, and the overly repetitive graphs among channels cannot handle changes in joint relations between different actions. In this work, we propose a novel relation selective graph convolutional network (RS-GCN). We also design a trainable relation selection mechanism. It encourages the model to choose solid edges to work and build a stable and sparse topology of joints. The channel-wise graph convolution and multiscale temporal convolution are proposed to strengthening the model’s representative power. Furthermore, we introduce an asymmetrical module named the spatial-temporal attention module for more stable context modeling. Combining those changes, our model achieves state-of-the-art performance on three public benchmarks, namely NTU-RGB+D, NTU-RGB+D 120, and Northwestern-UCLA.

Download Full-text

I3D-Shufflenet Based Human Action Recognition

Algorithms ◽

10.3390/a13110301 ◽

2020 ◽

Vol 13 (11) ◽

pp. 301

Author(s):

Guocheng Liu ◽

Caixia Zhang ◽

Qingyang Xu ◽

Ruoshi Cheng ◽

Yong Song ◽

...

Keyword(s):

Neural Network ◽

Action Recognition ◽

Human Action Recognition ◽

Human Action ◽

Recognition Algorithm ◽

Convolution Kernel ◽

Histogram Of Oriented Gradients ◽

Temporal Features ◽

Convolution Kernels

In view of difficulty in application of optical flow based human action recognition due to large amount of calculation, a human action recognition algorithm I3D-shufflenet model is proposed combining the advantages of I3D neural network and lightweight model shufflenet. The 5 × 5 convolution kernel of I3D is replaced by a double 3 × 3 convolution kernels, which reduces the amount of calculations. The shuffle layer is adopted to achieve feature exchange. The recognition and classification of human action is performed based on trained I3D-shufflenet model. The experimental results show that the shuffle layer improves the composition of features in each channel which can promote the utilization of useful information. The Histogram of Oriented Gradients (HOG) spatial-temporal features of the object are extracted for training, which can significantly improve the ability of human action expression and reduce the calculation of feature extraction. The I3D-shufflenet is testified on the UCF101 dataset, and compared with other models. The final result shows that the I3D-shufflenet has higher accuracy than the original I3D with an accuracy of 96.4%.

Download Full-text

Distinct Two-Stream Convolutional Networks for Human Action Recognition in Videos Using Segment-Based Temporal Modeling

Data ◽

10.3390/data5040104 ◽

2020 ◽

Vol 5 (4) ◽

pp. 104

Author(s):

Ashok Sarabu ◽

Ajit Kumar Santra

Keyword(s):

Action Recognition ◽

Data Augmentation ◽

Main Idea ◽

Human Action Recognition ◽

Human Action ◽

Great Success ◽

Temporal Modeling ◽

Convolutional Networks ◽

Temporal Features ◽

Augmentation Techniques

The Two-stream convolution neural network (CNN) has proven a great success in action recognition in videos. The main idea is to train the two CNNs in order to learn spatial and temporal features separately, and two scores are combined to obtain final scores. In the literature, we observed that most of the methods use similar CNNs for two streams. In this paper, we design a two-stream CNN architecture with different CNNs for the two streams to learn spatial and temporal features. Temporal Segment Networks (TSN) is applied in order to retrieve long-range temporal features, and to differentiate the similar type of sub-action in videos. Data augmentation techniques are employed to prevent over-fitting. Advanced cross-modal pre-training is discussed and introduced to the proposed architecture in order to enhance the accuracy of action recognition. The proposed two-stream model is evaluated on two challenging action recognition datasets: HMDB-51 and UCF-101. The findings of the proposed architecture shows the significant performance increase and it outperforms the existing methods.

Download Full-text

A Hybrid Network for Large-Scale Action Recognition from RGB and Depth Modalities

Sensors ◽

10.3390/s20113305 ◽

2020 ◽

Vol 20 (11) ◽

pp. 3305 ◽

Cited By ~ 1

Author(s):

Huogen Wang ◽

Zhanjie Song ◽

Wanqing Li ◽

Pichao Wang

Keyword(s):

Neural Network ◽

Action Recognition ◽

Canonical Correlation ◽

Large Scale ◽

State Of The Art ◽

Hybrid Network ◽

Support Vector ◽

Multiple Modalities ◽

Large Margin ◽

Percentage Points

The paper presents a novel hybrid network for large-scale action recognition from multiple modalities. The network is built upon the proposed weighted dynamic images. It effectively leverages the strengths of the emerging Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) based approaches to specifically address the challenges that occur in large-scale action recognition and are not fully dealt with by the state-of-the-art methods. Specifically, the proposed hybrid network consists of a CNN based component and an RNN based component. Features extracted by the two components are fused through canonical correlation analysis and then fed to a linear Support Vector Machine (SVM) for classification. The proposed network achieved state-of-the-art results on the ChaLearn LAP IsoGD, NTU RGB+D and Multi-modal & Multi-view & Interactive ( M 2 I ) datasets and outperformed existing methods by a large margin (over 10 percentage points in some cases).

Download Full-text

Temporal Residual Feature Learning for Efficient 3D Convolutional Neural Network on Action Recognition Task

2020 IEEE Workshop on Signal Processing Systems (SiPS) ◽

10.1109/sips50750.2020.9195240 ◽

2020 ◽

Author(s):

Haonan Wang ◽

Yuchen Mei ◽

Jun Lin ◽

Zhongfeng Wang

Keyword(s):

Neural Network ◽

Convolutional Neural Network ◽

Action Recognition ◽

Recognition Task ◽

Feature Learning

Download Full-text

Task-driven hierarchical deep neural network models of the proprioceptive pathway

10.1101/2020.05.06.081372 ◽

2020 ◽

Author(s):

Kai J. Sandbrink ◽

Pranav Mamidanna ◽

Claudio Michaelis ◽

Mackenzie Weygandt Mathis ◽

Matthias Bethge ◽

...

Keyword(s):

Neural Network ◽

Neural Networks ◽

Somatosensory Cortex ◽

Action Recognition ◽

Large Scale ◽

Recognition Task ◽

List Type ◽

Control Mechanisms ◽

Neural Network Models ◽

3D Space

Biological motor control is versatile and efficient. Muscles are flexible and undergo continuous changes requiring distributed adaptive control mechanisms. How proprioception solves this problem in the brain is unknown. Here we pursue a task-driven modeling approach that has provided important insights into other sensory systems. However, unlike for vision and audition where large annotated datasets of raw images or sound are readily available, data of relevant proprioceptive stimuli are not. We generated a large-scale dataset of human arm trajectories as the hand is tracing the alphabet in 3D space, then using a musculoskeletal model derived the spindle firing rates during these movements. We propose an action recognition task that allows training of hierarchical models to classify the character identity from the spindle firing patterns. Artificial neural networks could robustly solve this task, and the networks’ units show directional movement tuning akin to neurons in the primate somatosensory cortex. The same architectures with random weights also show similar kinematic feature tuning but do not reproduce the diversity of preferred directional tuning nor do they have invariant tuning across 3D space. Taken together our model is the first to link tuning properties in the proprioceptive system to the behavioral level.HighlightsWe provide a normative approach to derive neural tuning of proprioceptive features from behaviorally-defined objectives.We propose a method for creating a scalable muscle spindles dataset based on kinematic data and define an action recognition task as a benchmark.Hierarchical neural networks solve the recognition task from muscle spindle inputs.Individual neural network units in middle layers resemble neurons in primate somatosensory cortex & make predictions for neurons along the proprioceptive pathway.

Download Full-text

A Deep Learning-Based Satellite Target Recognition Method Using Radar Data

Sensors ◽

10.3390/s19092008 ◽

2019 ◽

Vol 19 (9) ◽

pp. 2008

Author(s):

Lu ◽

Zhang ◽

Xu ◽

Lin ◽

Huo

Keyword(s):

Neural Network ◽

Deep Learning ◽

Target Recognition ◽

Recognition Performance ◽

Recognition Task ◽

Performance Testing ◽

Radar Data ◽

Data Partition ◽

Distance Metric ◽

Recognition Method

A novel satellite target recognition method based on radar data partition and deep learning techniques is proposed in this paper. For the radar satellite recognition task, orbital altitude is introduced as a distinct and accessible feature to divide radar data. On this basis, we design a new distance metric for HRRPs called normalized angular distance divided by correlation coefficient (NADDCC), and a hierarchical clustering method based on this distance metric is applied to segment the radar observation angular domain. Using the above technology, the radar data partition is completed and multiple HRRP data clusters are obtained. To further mine the essential features in HRRPs, a GRU-SVM model is designed and firstly applied for radar HRRP target recognition. It consists of a multi-layer GRU neural network as a deep feature extractor and linear SVM as a classifier. By training, GRU neural network successfully extracts effective and highly distinguishable features of HRRPs, and feature visualization technology shows its advantages. Furthermore, the performance testing and comparison experiments also demonstrate that GRU neural network possesses better comprehensive performance for HRRP target recognition than LSTM neural network and conventional RNN, and the recognition performance of our method is almost better than that of other several common feature extraction methods or no data partition.

Download Full-text

Pedestrian Attributes Recognition in Surveillance Scenarios Using Multi-Task Lightweight Convolutional Neural Network

Applied Sciences ◽

10.3390/app9194182 ◽

2019 ◽

Vol 9 (19) ◽

pp. 4182 ◽

Cited By ~ 2

Author(s):

Pu Yan ◽

Li Zhuo ◽

Jiafeng Li ◽

Hui Zhang ◽

Jing Zhang

Keyword(s):

Neural Network ◽

Convolutional Neural Network ◽

State Of The Art ◽

Cross Entropy ◽

Semantic Features ◽

Recognition Method ◽

Relationship Model ◽

High Level ◽

Fully Connected ◽

Video Structuring

Pedestrian attributes (such as gender, age, hairstyle, and clothing) can effectively represent the appearance of pedestrians. These are high-level semantic features that are robust to illumination, deformation, etc. Therefore, they can be widely used in person re-identification, video structuring analysis and other applications. In this paper, a pedestrian attributes recognition method for surveillance scenarios using a multi-task lightweight convolutional neural network is proposed. Firstly, the labels of the attributes for each pedestrian image are integrated into a label vector. Then, a multi-task lightweight Convolutional Neural Network (CNN) is designed, which consists of five convolutional layers, three pooling layers and two fully connected layers to extract the deep features of pedestrian images. Considering that the data distribution of the datasets is unbalanced, the loss function is improved based on the sigmoid cross-entropy, and the scale factor is added to balance the amount of various attributes data. Through training the network, the mapping relationship model between the deep features of pedestrian images and the integration label vector of their attributes is established, which can be used to predict each attribute of the pedestrian. The experiments were conducted on two public pedestrian attributes datasets in surveillance scenarios, namely PETA and RAP. The results show that, compared with the state-of-the-art pedestrian attributes recognition methods, the proposed method can achieve a superior accuracy by 91.88% on PETA and 87.44% on RAP respectively.

Download Full-text