Multi-Term Attention Networks for Skeleton-Based Action Recognition

Xiaolei Diao; Xiaoqiang Li; Chen Huang

doi:10.3390/app10155326

Multi-Term Attention Networks for Skeleton-Based Action Recognition

Applied Sciences ◽

10.3390/app10155326 ◽

2020 ◽

Vol 10 (15) ◽

pp. 5326

Author(s):

Xiaolei Diao ◽

Xiaoqiang Li ◽

Chen Huang

Keyword(s):

Neural Network ◽

Time Scales ◽

Action Recognition ◽

State Of The Art ◽

Attention Networks ◽

Weighted Fusion ◽

Temporal Features ◽

Benchmark Datasets ◽

Spatio Temporal ◽

Different Time Scales

The same action takes different time in different cases. This difference will affect the accuracy of action recognition to a certain extent. We propose an end-to-end deep neural network called “Multi-Term Attention Networks” (MTANs), which solves the above problem by extracting temporal features with different time scales. The network consists of a Multi-Term Attention Recurrent Neural Network (MTA-RNN) and a Spatio-Temporal Convolutional Neural Network (ST-CNN). In MTA-RNN, a method for fusing multi-term temporal features are proposed to extract the temporal dependence of different time scales, and the weighted fusion temporal feature is recalibrated by the attention mechanism. Ablation research proves that this network has powerful spatio-temporal dynamic modeling capabilities for actions with different time scales. We perform extensive experiments on four challenging benchmark datasets, including the NTU RGB+D dataset, UT-Kinect dataset, Northwestern-UCLA dataset, and UWA3DII dataset. Our method achieves better results than the state-of-the-art benchmarks, which demonstrates the effectiveness of MTANs.

Download Full-text

Human Action Recognition Using Spatio-Temporal Multiplier Network and Attentive Correlated Temporal Feature

International Journal of Image and Graphics ◽

10.1142/s0219467822500516 ◽

2021 ◽

Author(s):

C. Indhumathi ◽

V. Murugan ◽

G. Muthulakshmii

Keyword(s):

Action Recognition ◽

State Of The Art ◽

Human Action Recognition ◽

Human Action ◽

Regional Correlation ◽

Temporal Features ◽

Adaptive Motion ◽

Spatio Temporal ◽

Inter Frame ◽

Temporal Feature

Nowadays, action recognition has gained more attention from the computer vision community. Normally for recognizing human actions, spatial and temporal features are extracted. Two-stream convolutional neural network is used commonly for human action recognition in videos. In this paper, Adaptive motion Attentive Correlated Temporal Feature (ACTF) is used for temporal feature extractor. The temporal average pooling in inter-frame is used for extracting the inter-frame regional correlation feature and mean feature. This proposed method has better accuracy of 96.9% for UCF101 and 74.6% for HMDB51 datasets, respectively, which are higher than the other state-of-the-art methods.

Download Full-text

Memory Attention Networks for Skeleton-based Action Recognition

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2018/227 ◽

2018 ◽

Cited By ~ 20

Author(s):

Chunyu Xie ◽

Ce Li ◽

Baochang Zhang ◽

Chen Chen ◽

Jungong Han ◽

...

Keyword(s):

Neural Networks ◽

Action Recognition ◽

Network Architecture ◽

Recognition Task ◽

Temporal Variations ◽

Temporal Attention ◽

Attention Networks ◽

Benchmark Datasets ◽

End To End ◽

Spatio Temporal

Skeleton-based action recognition task is entangled with complex spatio-temporal variations of skeleton joints, and remains challenging for Recurrent Neural Networks (RNNs). In this work, we propose a temporal-then-spatial recalibration scheme to alleviate such complex variations, resulting in an end-to-end Memory Attention Networks (MANs) which consist of a Temporal Attention Recalibration Module (TARM) and a Spatio-Temporal Convolution Module (STCM). Specifically, the TARM is deployed in a residual learning module that employs a novel attention learning network to recalibrate the temporal attention of frames in a skeleton sequence. The STCM treats the attention calibrated skeleton joint sequences as images and leverages the Convolution Neural Networks (CNNs) to further model the spatial and temporal information of skeleton data. These two modules (TARM and STCM) seamlessly form a single network architecture that can be trained in an end-to-end fashion. MANs significantly boost the performance of skeleton-based action recognition and achieve the best results on four challenging benchmark datasets: NTU RGB+D, HDM05, SYSU-3D and UT-Kinect.

Download Full-text

Action Recognition Based on the Fusion of Graph Convolutional Networks with High Order Features

Applied Sciences ◽

10.3390/app10041482 ◽

2020 ◽

Vol 10 (4) ◽

pp. 1482 ◽

Cited By ~ 2

Author(s):

Jiuqing Dong ◽

Yongbin Gao ◽

Hyo Jong Lee ◽

Heng Zhou ◽

Yifan Yao ◽

...

Keyword(s):

Neural Network ◽

Action Recognition ◽

Feature Fusion ◽

State Of The Art ◽

Recognition Task ◽

High Order ◽

Recognition Method ◽

Related Research ◽

Convolutional Networks ◽

Temporal Features

Skeleton-based action recognition is a widely used task in action related research because of its clear features and the invariance of human appearances and illumination. Furthermore, it can also effectively improve the robustness of the action recognition. Graph convolutional networks have been implemented on those skeletal data to recognize actions. Recent studies have shown that the graph convolutional neural network works well in the action recognition task using spatial and temporal features of skeleton data. The prevalent methods to extract the spatial and temporal features purely rely on a deep network to learn from primitive 3D position. In this paper, we propose a novel action recognition method applying high-order spatial and temporal features from skeleton data, such as velocity features, acceleration features, and relative distance between 3D joints. Meanwhile, a method of multi-stream feature fusion is adopted to fuse these high-order features we proposed. Extensive experiments on Two large and challenging datasets, NTU-RGBD and NTU-RGBD-120, indicate that our model achieves the state-of-the-art performance.

Download Full-text

Spatio-Temporal Features in Action Recognition Using 3D Skeletal Joints

Sensors ◽

10.3390/s19020423 ◽

2019 ◽

Vol 19 (2) ◽

pp. 423 ◽

Cited By ~ 2

Author(s):

Mihai Trăscău ◽

Mihai Nan ◽

Adina Florea

Keyword(s):

Action Recognition ◽

Assisted Living ◽

State Of The Art ◽

Ambient Assisted Living ◽

Depth Image ◽

Depth Images ◽

Temporal Features ◽

Improve Accuracy ◽

Spatio Temporal ◽

Temporal Dimensions

Robust action recognition methods lie at the cornerstone of Ambient Assisted Living (AAL) systems employing optical devices. Using 3D skeleton joints extracted from depth images taken with time-of-flight (ToF) cameras has been a popular solution for accomplishing these tasks. Though seemingly scarce in terms of information availability compared to its RGB or depth image counterparts, the skeletal representation has proven to be effective in the task of action recognition. This paper explores different interpretations of both the spatial and the temporal dimensions of a sequence of frames describing an action. We show that rather intuitive approaches, often borrowed from other computer vision tasks, can improve accuracy. We report results based on these modifications and propose an architecture that uses temporal convolutions with results comparable to the state of the art.

Download Full-text

PedNet: A Spatio-Temporal Deep Convolutional Neural Network for Pedestrian Segmentation

Journal of Imaging ◽

10.3390/jimaging4090107 ◽

2018 ◽

Vol 4 (9) ◽

pp. 107 ◽

Cited By ~ 5

Author(s):

Mohib Ullah ◽

Ahmed Mohammed ◽

Faouzi Alaya Cheikh

Keyword(s):

Neural Network ◽

Convolutional Neural Network ◽

Performance Metrics ◽

State Of The Art ◽

Temporal Information ◽

Feature Maps ◽

Current Frame ◽

Low Level ◽

Art Methods ◽

Spatio Temporal

Articulation modeling, feature extraction, and classification are the important components of pedestrian segmentation. Usually, these components are modeled independently from each other and then combined in a sequential way. However, this approach is prone to poor segmentation if any individual component is weakly designed. To cope with this problem, we proposed a spatio-temporal convolutional neural network named PedNet which exploits temporal information for spatial segmentation. The backbone of the PedNet consists of an encoder–decoder network for downsampling and upsampling the feature maps, respectively. The input to the network is a set of three frames and the output is a binary mask of the segmented regions in the middle frame. Irrespective of classical deep models where the convolution layers are followed by a fully connected layer for classification, PedNet is a Fully Convolutional Network (FCN). It is trained end-to-end and the segmentation is achieved without the need of any pre- or post-processing. The main characteristic of PedNet is its unique design where it performs segmentation on a frame-by-frame basis but it uses the temporal information from the previous and the future frame for segmenting the pedestrian in the current frame. Moreover, to combine the low-level features with the high-level semantic information learned by the deeper layers, we used long-skip connections from the encoder to decoder network and concatenate the output of low-level layers with the higher level layers. This approach helps to get segmentation map with sharp boundaries. To show the potential benefits of temporal information, we also visualized different layers of the network. The visualization showed that the network learned different information from the consecutive frames and then combined the information optimally to segment the middle frame. We evaluated our approach on eight challenging datasets where humans are involved in different activities with severe articulation (football, road crossing, surveillance). The most common CamVid dataset which is used for calculating the performance of the segmentation algorithm is evaluated against seven state-of-the-art methods. The performance is shown on precision/recall, F 1 , F 2 , and mIoU. The qualitative and quantitative results show that PedNet achieves promising results against state-of-the-art methods with substantial improvement in terms of all the performance metrics.

Download Full-text

Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis

CVPR 2011 ◽

10.1109/cvpr.2011.5995496 ◽

2011 ◽

Cited By ~ 473

Author(s):

Quoc V. Le ◽

Will Y. Zou ◽

Serena Y. Yeung ◽

Andrew Y. Ng

Keyword(s):

Action Recognition ◽

Subspace Analysis ◽

Temporal Features ◽

Spatio Temporal ◽

Independent Subspace Analysis

Download Full-text

Learning Spatio-Temporal Features for Action Recognition with Modified Hidden Conditional Random Field

Computer Vision - ECCV 2014 Workshops - Lecture Notes in Computer Science ◽

10.1007/978-3-319-16178-5_55 ◽

2015 ◽

pp. 786-801

Author(s):

Wanru Xu ◽

Zhenjiang Miao ◽

Jian Zhang ◽

Yi Tian

Keyword(s):

Random Field ◽

Action Recognition ◽

Conditional Random Field ◽

Temporal Features ◽

Spatio Temporal

Download Full-text

A Robust Approach for Action Recognition Based on Spatio-Temporal Features in RGB-D Sequences

International Journal of Advanced Computer Science and Applications ◽

10.14569/ijacsa.2016.070526 ◽

2016 ◽

Vol 7 (5) ◽

Author(s):

Ly Quoc ◽

Vo Hoai ◽

Tran Thai ◽

Pham Minh

Keyword(s):

Action Recognition ◽

Robust Approach ◽

Temporal Features ◽

Spatio Temporal

Download Full-text

A Spatio-Temporal Convolutional Neural Network for Skeletal Action Recognition

Neural Information Processing - Lecture Notes in Computer Science ◽

10.1007/978-3-319-70090-8_39 ◽

2017 ◽

pp. 377-385

Author(s):

Lizhang Hu ◽

Jinhua Xu

Keyword(s):

Neural Network ◽

Convolutional Neural Network ◽

Action Recognition ◽

Spatio Temporal

Download Full-text

Residual Invertible Spatio-Temporal Network for Video Super-Resolution

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33015981 ◽

2019 ◽

Vol 33 ◽

pp. 5981-5988 ◽

Cited By ~ 12

Author(s):

Xiaobin Zhu ◽

Zhuangzi Li ◽

Xiao-Yu Zhang ◽

Changsheng Li ◽

Yaqi Liu ◽

...

Keyword(s):

Spatial Information ◽

Super Resolution ◽

Temporal Consistency ◽

Temporal Network ◽

Convolutional Network ◽

Feature Representations ◽

Video Frames ◽

Temporal Features ◽

Benchmark Datasets ◽

Spatio Temporal

Video super-resolution is a challenging task, which has attracted great attention in research and industry communities. In this paper, we propose a novel end-to-end architecture, called Residual Invertible Spatio-Temporal Network (RISTN) for video super-resolution. The RISTN can sufficiently exploit the spatial information from low-resolution to high-resolution, and effectively models the temporal consistency from consecutive video frames. Compared with existing recurrent convolutional network based approaches, RISTN is much deeper but more efficient. It consists of three major components: In the spatial component, a lightweight residual invertible block is designed to reduce information loss during feature transformation and provide robust feature representations. In the temporal component, a novel recurrent convolutional model with residual dense connections is proposed to construct deeper network and avoid feature degradation. In the reconstruction component, a new fusion method based on the sparse strategy is proposed to integrate the spatial and temporal features. Experiments on public benchmark datasets demonstrate that RISTN outperforms the state-ofthe-art methods.

Download Full-text