Shallow Graph Convolutional Network for Skeleton-Based Action Recognition

Wenjie Yang; Jianlin Zhang; Jingju Cai; Zhiyong Xu

doi:10.3390/s21020452

Shallow Graph Convolutional Network for Skeleton-Based Action Recognition

Sensors ◽

10.3390/s21020452 ◽

2021 ◽

Vol 21 (2) ◽

pp. 452

Author(s):

Wenjie Yang ◽

Jianlin Zhang ◽

Jingju Cai ◽

Zhiyong Xu

Keyword(s):

Action Recognition ◽

State Of The Art ◽

Computational Cost ◽

Receptive Fields ◽

Recognition Task ◽

Convolutional Network ◽

Convolutional Networks ◽

Spatial Graph ◽

Graph Size ◽

Skeleton Graph

Graph convolutional networks (GCNs) have brought considerable improvement to the skeleton-based action recognition task. Existing GCN-based methods usually use the fixed spatial graph size among all the layers. It severely affects the model’s abilities to exploit the global and semantic discriminative information due to the limits of receptive fields. Furthermore, the fixed graph size would cause many redundancies in the representation of actions, which is inefficient for the model. The redundancies could also hinder the model from focusing on beneficial features. To address those issues, we proposed a plug-and-play channel adaptive merging module (CAMM) specific for the human skeleton graph, which can merge the vertices from the same part of the skeleton graph adaptively and efficiently. The merge weights are different across the channels, so every channel has its flexibility to integrate the joints. Then, we build a novel shallow graph convolutional network (SGCN) based on the module, which achieves state-of-the-art performance with less computational cost. Experimental results on NTU-RGB+D and Kinetics-Skeleton illustrates the superiority of our methods.

Download Full-text

Relation Selective Graph Convolutional Network for Skeleton-Based Action Recognition

Symmetry ◽

10.3390/sym13122275 ◽

2021 ◽

Vol 13 (12) ◽

pp. 2275

Author(s):

Wenjie Yang ◽

Jianlin Zhang ◽

Jingju Cai ◽

Zhiyong Xu

Keyword(s):

Action Recognition ◽

State Of The Art ◽

Recognition Task ◽

Context Modeling ◽

Selection Mechanism ◽

Temporal Attention ◽

Significant Progress ◽

Convolutional Network ◽

Convolutional Networks ◽

Art Performance

Graph convolutional networks (GCNs) have made significant progress in the skeletal action recognition task. However, the graphs constructed by these methods are too densely connected, and the same graphs are used repeatedly among channels. Redundant connections will blur the useful interdependencies of joints, and the overly repetitive graphs among channels cannot handle changes in joint relations between different actions. In this work, we propose a novel relation selective graph convolutional network (RS-GCN). We also design a trainable relation selection mechanism. It encourages the model to choose solid edges to work and build a stable and sparse topology of joints. The channel-wise graph convolution and multiscale temporal convolution are proposed to strengthening the model’s representative power. Furthermore, we introduce an asymmetrical module named the spatial-temporal attention module for more stable context modeling. Combining those changes, our model achieves state-of-the-art performance on three public benchmarks, namely NTU-RGB+D, NTU-RGB+D 120, and Northwestern-UCLA.

Download Full-text

A Novel Graph Representation for Skeleton-based Action Recognition

Signal & Image Processing An International Journal ◽

10.5121/sipij.2020.11605 ◽

2020 ◽

Vol 11 (6) ◽

pp. 65-73

Author(s):

Tingwei Li ◽

Ruiwen Zhang ◽

Qing Li

Keyword(s):

Action Recognition ◽

State Of The Art ◽

Graph Representation ◽

Convolutional Networks ◽

Multi Scale ◽

Spatial Features ◽

Skeleton Graph ◽

Temporal And Spatial ◽

Generic Representation ◽

Novel Model

Graph convolutional networks (GCNs) have been proven to be effective for processing structured data, so that it can effectively capture the features of related nodes and improve the performance of model. More attention is paid to employing GCN in Skeleton-Based action recognition. But there are some challenges with the existing methods based on GCNs. First, the consistency of temporal and spatial features is ignored due to extracting features node by node and frame by frame. We design a generic representation of skeleton sequences for action recognition and propose a novel model called Temporal Graph Networks (TGN), which can obtain spatiotemporal features simultaneously. Secondly, the adjacency matrix of graph describing the relation of joints are mostly depended on the physical connection between joints. We propose a multi-scale graph strategy to appropriately describe the relations between joints in skeleton graph, which adopts a full-scale graph, part-scale graph and core-scale graph to capture the local features of each joint and the contour features of important joints. Extensive experiments are conducted on two large datasets including NTU RGB+D and Kinetics Skeleton. And the experiments results show that TGN with our graph strategy outperforms other state-of-the-art methods.

Download Full-text

Action Recognition Based on the Fusion of Graph Convolutional Networks with High Order Features

Applied Sciences ◽

10.3390/app10041482 ◽

2020 ◽

Vol 10 (4) ◽

pp. 1482 ◽

Cited By ~ 2

Author(s):

Jiuqing Dong ◽

Yongbin Gao ◽

Hyo Jong Lee ◽

Heng Zhou ◽

Yifan Yao ◽

...

Keyword(s):

Neural Network ◽

Action Recognition ◽

Feature Fusion ◽

State Of The Art ◽

Recognition Task ◽

High Order ◽

Recognition Method ◽

Related Research ◽

Convolutional Networks ◽

Temporal Features

Skeleton-based action recognition is a widely used task in action related research because of its clear features and the invariance of human appearances and illumination. Furthermore, it can also effectively improve the robustness of the action recognition. Graph convolutional networks have been implemented on those skeletal data to recognize actions. Recent studies have shown that the graph convolutional neural network works well in the action recognition task using spatial and temporal features of skeleton data. The prevalent methods to extract the spatial and temporal features purely rely on a deep network to learn from primitive 3D position. In this paper, we propose a novel action recognition method applying high-order spatial and temporal features from skeleton data, such as velocity features, acceleration features, and relative distance between 3D joints. Meanwhile, a method of multi-stream feature fusion is adopted to fuse these high-order features we proposed. Extensive experiments on Two large and challenging datasets, NTU-RGBD and NTU-RGBD-120, indicate that our model achieves the state-of-the-art performance.

Download Full-text

Multi Scale Temporal Graph Networks for Skeleton-Based Action Recognition

10.5121/csit.2020.101605 ◽

2020 ◽

Author(s):

Tingwei Li ◽

Ruiwen Zhang ◽

Qing Li

Keyword(s):

Action Recognition ◽

State Of The Art ◽

Convolutional Networks ◽

Multi Scale ◽

Spatial Features ◽

Skeleton Graph ◽

Temporal Graph ◽

Temporal And Spatial ◽

Generic Representation ◽

Novel Model

Graph convolutional networks (GCNs) can effectively capture the features of related nodes and improve the performance of model. More attention is paid to employing GCN in Skeleton-Based action recognition. But existing methods based on GCNs have two problems. First, the consistency of temporal and spatial features is ignored for extracting features node by node and frame by frame. To obtain spatiotemporal features simultaneously, we design a generic representation of skeleton sequences for action recognition and propose a novel model called Temporal Graph Networks (TGN). Secondly, the adjacency matrix of graph describing the relation of joints are mostly depended on the physical connection between joints. To appropriate describe the relations between joints in skeleton graph, we propose a multi-scale graph strategy, adopting a full-scale graph, part-scale graph and core-scale graph to capture the local features of each joint and the contour features of important joints. Experiments were carried out on two large datasets and results show that TGN with our graph strategy outperforms state-of-the-art methods.

Download Full-text

Part-Level Graph Convolutional Network for Skeleton-Based Action Recognition

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i07.6759 ◽

2020 ◽

Vol 34 (07) ◽

pp. 11045-11052

Author(s):

Linjiang Huang ◽

Yan Huang ◽

Wanli Ouyang ◽

Liang Wang

Keyword(s):

Action Recognition ◽

State Of The Art ◽

The State ◽

Body Parts ◽

Convolutional Network ◽

Joint Level ◽

Convolutional Networks ◽

Level Information ◽

Benchmark Datasets ◽

High Level

Recently, graph convolutional networks have achieved remarkable performance for skeleton-based action recognition. In this work, we identify a problem posed by the GCNs for skeleton-based action recognition, namely part-level action modeling. To address this problem, a novel Part-Level Graph Convolutional Network (PL-GCN) is proposed to capture part-level information of skeletons. Different from previous methods, the partition of body parts is learnable rather than manually defined. We propose two part-level blocks, namely Part Relation block (PR block) and Part Attention block (PA block), which are achieved by two differentiable operations, namely graph pooling operation and graph unpooling operation. The PR block aims at learning high-level relations between body parts while the PA block aims at highlighting the important body parts in the action. Integrating the original GCN with the two blocks, the PL-GCN can learn both part-level and joint-level information of the action. Extensive experiments on two benchmark datasets show the state-of-the-art performance on skeleton-based action recognition and demonstrate the effectiveness of the proposed method.

Download Full-text

Multi-Stage Attention-Enhanced Sparse Graph Convolutional Network for Skeleton-Based Action Recognition

Electronics ◽

10.3390/electronics10182198 ◽

2021 ◽

Vol 10 (18) ◽

pp. 2198

Author(s):

Chaoyue Li ◽

Lian Zou ◽

Cien Fan ◽

Hao Jiang ◽

Yifeng Liu

Keyword(s):

Action Recognition ◽

Large Scale ◽

Feature Learning ◽

Superior Performance ◽

Sparse Graph ◽

Convolutional Network ◽

Convolutional Networks ◽

Spatial Graph ◽

Multi Stage ◽

The Time Domain

Graph convolutional networks (GCNs), which model human actions as a series of spatial-temporal graphs, have recently achieved superior performance in skeleton-based action recognition. However, the existing methods mostly use the physical connections of joints to construct a spatial graph, resulting in limited topological information of the human skeleton. In addition, the action features in the time domain have not been fully explored. To better extract spatial-temporal features, we propose a multi-stage attention-enhanced sparse graph convolutional network (MS-ASGCN) for skeleton-based action recognition. To capture more abundant joint dependencies, we propose a new strategy for constructing skeleton graphs. This simulates bidirectional information flows between neighboring joints and pays greater attention to the information transmission between sparse joints. In addition, a part attention mechanism is proposed to learn the weight of each part and enhance the part-level feature learning. We introduce multiple streams of different stages and merge them in specific layers of the network to further improve the performance of the model. Our model is finally verified on two large-scale datasets, namely NTU-RGB+D and Skeleton-Kinetics. Experiments demonstrate that the proposed MS-ASGCN outperformed the previous state-of-the-art methods on both datasets.

Download Full-text

Efficient End-to-End Sentence-Level Lipreading with Temporal Convolutional Networks

Applied Sciences ◽

10.3390/app11156975 ◽

2021 ◽

Vol 11 (15) ◽

pp. 6975

Author(s):

Tao Zhang ◽

Lun He ◽

Xudong Li ◽

Guoqing Feng

Keyword(s):

Performance Improvement ◽

State Of The Art ◽

Error Rates ◽

Convolutional Network ◽

Convolutional Networks ◽

Sentence Level ◽

End To End ◽

High Level ◽

Improved Accuracy ◽

Talking Face

Lipreading aims to recognize sentences being spoken by a talking face. In recent years, the lipreading method has achieved a high level of accuracy on large datasets and made breakthrough progress. However, lipreading is still far from being solved, and existing methods tend to have high error rates on the wild data and have the defects of disappearing training gradient and slow convergence. To overcome these problems, we proposed an efficient end-to-end sentence-level lipreading model, using an encoder based on a 3D convolutional network, ResNet50, Temporal Convolutional Network (TCN), and a CTC objective function as the decoder. More importantly, the proposed architecture incorporates TCN as a feature learner to decode feature. It can partly eliminate the defects of RNN (LSTM, GRU) gradient disappearance and insufficient performance, and this yields notable performance improvement as well as faster convergence. Experiments show that the training and convergence speed are 50% faster than the state-of-the-art method, and improved accuracy by 2.4% on the GRID dataset.

Download Full-text

MR-GCN: Multi-Relational Graph Convolutional Networks based on Generalized Tensor Product

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/175 ◽

2020 ◽

Author(s):

Zhichao Huang ◽

Xutao Li ◽

Yunming Ye ◽

Michael K. Ng

Keyword(s):

Tensor Product ◽

Convolution Operator ◽

State Of The Art ◽

Single Type ◽

Convolutional Network ◽

Convolutional Networks ◽

Node Classification ◽

Relational Graphs ◽

Eigen Decomposition ◽

Single Relation

Graph Convolutional Networks (GCNs) have been extensively studied in recent years. Most of existing GCN approaches are designed for the homogenous graphs with a single type of relation. However, heterogeneous graphs of multiple types of relations are also ubiquitous and there is a lack of methodologies to tackle such graphs. Some previous studies address the issue by performing conventional GCN on each single relation and then blending their results. However, as the convolutional kernels neglect the correlations across relations, the strategy is sub-optimal. In this paper, we propose the Multi-Relational Graph Convolutional Network (MR-GCN) framework by developing a novel convolution operator on multi-relational graphs. In particular, our multi-dimension convolution operator extends the graph spectral analysis into the eigen-decomposition of a Laplacian tensor. And the eigen-decomposition is formulated with a generalized tensor product, which can correspond to any unitary transform instead of limited merely to Fourier transform. We conduct comprehensive experiments on four real-world multi-relational graphs to solve the semi-supervised node classification task, and the results show the superiority of MR-GCN against the state-of-the-art competitors.

Download Full-text

Y-Net: Dual-branch Joint Network for Semantic Segmentation

ACM Transactions on Multimedia Computing Communications and Applications ◽

10.1145/3460940 ◽

2021 ◽

Vol 17 (4) ◽

pp. 1-22

Author(s):

Yizhen Chen ◽

Haifeng Hu

Keyword(s):

Feature Vector ◽

State Of The Art ◽

Computational Cost ◽

Receptive Fields ◽

Semantic Segmentation ◽

Global Context ◽

Multi Level ◽

The One ◽

Public Datasets ◽

High Level

Most existing segmentation networks are built upon a “ U -shaped” encoder–decoder structure, where the multi-level features extracted by the encoder are gradually aggregated by the decoder. Although this structure has been proven to be effective in improving segmentation performance, there are two main drawbacks. On the one hand, the introduction of low-level features brings a significant increase in calculations without an obvious performance gain. On the other hand, general strategies of feature aggregation such as addition and concatenation fuse features without considering the usefulness of each feature vector, which mixes the useful information with massive noises. In this article, we abandon the traditional “ U -shaped” architecture and propose Y-Net, a dual-branch joint network for accurate semantic segmentation. Specifically, it only aggregates the high-level features with low-resolution and utilizes the global context guidance generated by the first branch to refine the second branch. The dual branches are effectively connected through a Semantic Enhancing Module, which can be regarded as the combination of spatial attention and channel attention. We also design a novel Channel-Selective Decoder (CSD) to adaptively integrate features from different receptive fields by assigning specific channelwise weights, where the weights are input-dependent. Our Y-Net is capable of breaking through the limit of singe-branch network and attaining higher performance with less computational cost than “ U -shaped” structure. The proposed CSD can better integrate useful information and suppress interference noises. Comprehensive experiments are carried out on three public datasets to evaluate the effectiveness of our method. Eventually, our Y-Net achieves state-of-the-art performance on PASCAL VOC 2012, PASCAL Person-Part, and ADE20K dataset without pre-training on extra datasets.

Download Full-text

Visual-Semantic Graph Reasoning for Pedestrian Attribute Recognition

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33018634 ◽

2019 ◽

Vol 33 ◽

pp. 8634-8641 ◽

Cited By ~ 4

Author(s):

Qiaozhe Li ◽

Xin Zhao ◽

Ran He ◽

Kaiqi Huang

Keyword(s):

Large Scale ◽

State Of The Art ◽

Relational Learning ◽

Spatial Relations ◽

Semantic Relations ◽

Prediction Problem ◽

Convolutional Network ◽

Semantic Graph ◽

Spatial Graph ◽

Attribute Recognition

Pedestrian attribute recognition in surveillance is a challenging task due to poor image quality, significant appearance variations and diverse spatial distribution of different attributes. This paper treats pedestrian attribute recognition as a sequential attribute prediction problem and proposes a novel visual-semantic graph reasoning framework to address this problem. Our framework contains a spatial graph and a directed semantic graph. By performing reasoning using the Graph Convolutional Network (GCN), one graph captures spatial relations between regions and the other learns potential semantic relations between attributes. An end-to-end architecture is presented to perform mutual embedding between these two graphs to guide the relational learning for each other. We verify the proposed framework on three large scale pedestrian attribute datasets including PETA, RAP, and PA100k. Experiments show superiority of the proposed method over state-of-the-art methods and effectiveness of our joint GCN structures for sequential attribute prediction.

Download Full-text