Hierarchy Spatial-Temporal Transformer for Action Recognition in Short Videos

Fuzzy Systems and Data Mining VI - Frontiers in Artificial Intelligence and Applications ◽

10.3233/faia200754 ◽

2020 ◽

Author(s):

Guoyong Cai ◽

Yumeng Cai

Keyword(s):

Spatial Attention ◽

Action Recognition ◽

Feature Learning ◽

Feature Maps ◽

Spatial Feature ◽

Feature Map ◽

Proposed Model ◽

Stream Architecture ◽

3D Cnn ◽

Transformer Model

Short videos action recognition based on deep learning has made a series of important progress; most of the proposed methods are based on 3D Convolution neural networks (3D CNN) and Two Stream architecture. However, 3D CNN has a large number of parameters and Two Stream networks cannot learn features well enough. This work aims to build a network to learn better features and reduce the scale of parameters. A Hierarchy Spatial-Temporal Transformer model is proposed, which is based on Two Stream architecture and hierarchy inference. The model is divided into three modules: Hierarchy Residual Reformer, Spatial Attention Module, and Temporal-Spatial Attention Module. In the model, each frame’s image is firstly transformed into a spatial visual feature map. Secondly, spatial feature learning is performed by spatial attention to generating attention spatial feature maps. Finally, the generated attention spatial feature map is incorporated with temporal feature vectors to generate a final representation for classification experiments. Experiment results in the hmdb51 and ucf101 data set showed that the proposed model achieved better accuracy than the state-of-art baseline models

Download Full-text

Global Co-Occurrence Feature and Local Spatial Feature Learning for Skeleton-Based Action Recognition

Entropy ◽

10.3390/e22101135 ◽

2020 ◽

Vol 22 (10) ◽

pp. 1135

Author(s):

Jun Xie ◽

Wentian Xin ◽

Ruyi Liu ◽

Qiguang Miao ◽

Lijie Sheng ◽

...

Keyword(s):

Spatial Structure ◽

Action Recognition ◽

Large Scale ◽

Recent Progress ◽

Feature Fusion ◽

Model Performance ◽

Feature Learning ◽

Learning Model ◽

Spatial Feature ◽

Convolutional Networks

Recent progress on skeleton-based action recognition has been substantial, benefiting mostly from the explosive development of Graph Convolutional Networks (GCN). However, prevailing GCN-based methods may not effectively capture the global co-occurrence features among joints and the local spatial structure features composed of adjacent bones. They also ignore the effect of channels unrelated to action recognition on model performance. Accordingly, to address these issues, we propose a Global Co-occurrence feature and Local Spatial feature learning model (GCLS) consisting of two branches. The first branch, based on the Vertex Attention Mechanism branch (VAM-branch), captures the global co-occurrence feature of actions effectively; the second, based on the Cross-kernel Feature Fusion branch (CFF-branch), extracts local spatial structure features composed of adjacent bones and restrains the channels unrelated to action recognition. Extensive experiments on two large-scale datasets, NTU-RGB+D and Kinetics, demonstrate that GCLS achieves the best performance when compared to the mainstream approaches.

Download Full-text

The Multidimensional Motion Features of Spatial Depth Feature Maps: An Effective Motion Information Representation Method for Video-Based Action Recognition

Mathematical Problems in Engineering ◽

10.1155/2021/6670087 ◽

2021 ◽

Vol 2021 ◽

pp. 1-15

Author(s):

Hongshi Ou ◽

Jifeng Sun

Keyword(s):

Neural Network ◽

Optical Flow ◽

Action Recognition ◽

Spatial Information ◽

Motion Information ◽

Video Classification ◽

Feature Maps ◽

Feature Map ◽

Motion Features ◽

Depth Feature

In video action recognition based on deep learning, the design of the neural network is focused on how to acquire effective spatial information and motion information quickly. This paper proposes a kind of deep network that can obtain both spatial information and motion information in video classification. It is called MDFs (the multidimensional motion features of deep feature map net). This method can be used to obtain spatial information and motion information in videos only by importing image frame data into a neural network. MDFs originate from the definition of 3D convolution. Multiple 3D convolution kernels with different information focuses are used to act on depth feature maps so as to obtain effective motion information at both spatial and temporal. On the other hand, we split the 3D convolution at space dimension and time dimension, and the spatial network feature map has reduced the dimensions of the original frame image data, which realizes the mitigation of computing resources of the multichannel grouped 3D convolutional network. In order to realize the region weight differentiation of spatial features, a spatial feature weighted pooling layer based on the spatial-temporal motion information guide is introduced to realize the attention to high recognition information. By means of multilevel LSTM, we realize the fusion between global semantic information acquisition and depth features at different levels so that the fully connected layers with rich classification information can provide frame attention mechanism for the spatial information layer. MDFs need only to act on RGB images. Through experiments on three universal experimental datasets of action recognition, UCF10, UCF11, and HMDB51, it is concluded that the MDF network can achieve an accuracy comparable to two streams (RGB and optical flow) that requires the import of both frame data and optical flow data in video classification tasks.

Download Full-text

CSA-MSO3DCNN: Multiscale Octave 3D CNN with Channel and Spatial Attention for Hyperspectral Image Classification

Remote Sensing ◽

10.3390/rs12010188 ◽

2020 ◽

Vol 12 (1) ◽

pp. 188 ◽

Cited By ~ 2

Author(s):

Qin Xu ◽

Yong Xiao ◽

Dongyue Wang ◽

Bin Luo

Keyword(s):

Spatial Attention ◽

Hyperspectral Image ◽

Image Data ◽

Classification Performance ◽

Data Sets ◽

Feature Maps ◽

Contextual Feature ◽

Hyperspectral Image Data ◽

3D Cnn ◽

Spatial Redundancy

3D convolutional neural networks (CNNs) have been demonstrated to be a powerful tool in hyperspectral images (HSIs) classification. However, using the conventional 3D CNNs to extract the spectral–spatial feature for HSIs results in too many parameters as HSIs have plenty of spatial redundancy. To address this issue, in this paper, we first design multiscale convolution to extract the contextual feature of different scales for HSIs and then propose to employ the octave 3D CNN which factorizes the mixed feature maps by their frequency to replace the normal 3D CNN in order to reduce the spatial redundancy and enlarge the receptive field. To further explore the discriminative features, a channel attention module and a spatial attention module are adopted to optimize the feature maps and improve the classification performance. The experiments on four hyperspectral image data sets demonstrate that the proposed method outperforms other state-of-the-art deep learning methods.

Download Full-text

Multi-layer attention for person re-identification

MATEC Web of Conferences ◽

10.1051/matecconf/201927702025 ◽

2019 ◽

Vol 277 ◽

pp. 02025

Author(s):

Yuele Zhang ◽

Jie Guo ◽

Zheng Huang ◽

Weidong Qiu ◽

Hexiaohui Fan

Keyword(s):

Spatial Attention ◽

State Of The Art ◽

Metric Learning ◽

Feature Maps ◽

Factors Affecting ◽

Attention Model ◽

Proposed Model ◽

Learning Functions ◽

Surveillance Analysis ◽

Different Levels

Person re-identification has been a significant application in the field of video surveillance analysis, yet it remains a challenging work to recognize the person of interest across disjoint cameras of different viewpoints. The factors affecting the identification results include the variation in background, different illumination conditions and the changes of human body poses. Existing person re-identification methods mainly focus on the feature extraction of the whole frame and metric learning functions. However, most of those algorithms treat different areas without distinction. It is worth emphasizing that different local regions make different contributions to image representaion, which exactly conforms to the attention mechanism. In this paper, we introduce a novel attention network which explores spatial attention in a convolutional neural network. Our algorithm learns the visual attention in multi-layer feature maps. The proposed model not only pays attention to the spatial probabilities of local regions, but also takes the features in different levels into consideration. We evaluate this multi-layer spatial attention model on three benchmark person re-identification datasets: Market-1501, CUHK03, and DukeMTMC-reID. The experiment results validate the advances of our adopted network by comparing with state-of-the-art baselines.

Download Full-text

Fusing Dynamic Images and Depth Motion Maps for Action Recognition in Surveillance Systems

International Journal of Sensors Wireless Communications and Control ◽

10.2174/2210327909666191209155141 ◽

2019 ◽

Vol 09 ◽

Author(s):

Rajat Khurana ◽

Alok Kumar Singh Kushwaha

Keyword(s):

Action Recognition ◽

Depth Map ◽

Good Choice ◽

Surveillance Systems ◽

Activity Detection ◽

Learning Capability ◽

Proposed Model ◽

Care Activity ◽

Increasing Demand ◽

Depth Motion Maps

Background & Objective: Identification of human actions from video has gathered much attention in past few years. Most of the computer vision tasks such as Health Care Activity Detection, Suspicious Activity detection, Human Computer Interactions etc. are based on the principle of activity detection. Automatic labelling of activity from videos frames is known as activity detection. Motivation of this work is to use most out of the data generated from sensors and use them for recognition of classes. Recognition of actions from videos sequences is a growing field with the upcoming trends of deep neural networks. Automatic learning capability of Convolutional Neural Network (CNN) make them good choice as compared to traditional handcrafted based approaches. With the increasing demand of RGB-D sensors combination of RGB and depth data is in great demand. This work comprises of the use of dynamic images generated from RGB combined with depth map for action recognition purpose. We have experimented our approach on pre trained VGG-F model using MSR Daily activity dataset and UTD MHAD Dataset. We achieve state of the art results. To support our research, we have calculated different parameters apart from accuracy such as precision, F score, recall. Conclusion: Accordingly, the investigation confirms improvement in term of accuracy, precision, F-Score and Recall. The proposed model is 4 Stream model is prone to occlusion, used in real time and also the data from the RGB-D sensor is fully utilized.

Download Full-text

A Self-Spatial Adaptive Weighting Based U-Net for Image Segmentation

Electronics ◽

10.3390/electronics10030348 ◽

2021 ◽

Vol 10 (3) ◽

pp. 348

Author(s):

Choongsang Cho ◽

Young Han Lee ◽

Jongyoul Park ◽

Sangkeun Lee

Keyword(s):

Image Segmentation ◽

Medical Image ◽

Medical Image Segmentation ◽

Feature Maps ◽

Data Set ◽

Feature Map ◽

Adaptive Weighting ◽

Spatially Adaptive ◽

Wide Range ◽

Decoder Architecture

Semantic image segmentation has a wide range of applications. When it comes to medical image segmentation, its accuracy is even more important than those of other areas because the performance gives useful information directly applicable to disease diagnosis, surgical planning, and history monitoring. The state-of-the-art models in medical image segmentation are variants of encoder-decoder architecture, which is called U-Net. To effectively reflect the spatial features in feature maps in encoder-decoder architecture, we propose a spatially adaptive weighting scheme for medical image segmentation. Specifically, the spatial feature is estimated from the feature maps, and the learned weighting parameters are obtained from the computed map, since segmentation results are predicted from the feature map through a convolutional layer. Especially in the proposed networks, the convolutional block for extracting the feature map is replaced with the widely used convolutional frameworks: VGG, ResNet, and Bottleneck Resent structures. In addition, a bilinear up-sampling method replaces the up-convolutional layer to increase the resolution of the feature map. For the performance evaluation of the proposed architecture, we used three data sets covering different medical imaging modalities. Experimental results show that the network with the proposed self-spatial adaptive weighting block based on the ResNet framework gave the highest IoU and DICE scores in the three tasks compared to other methods. In particular, the segmentation network combining the proposed self-spatially adaptive block and ResNet framework recorded the highest 3.01% and 2.89% improvements in IoU and DICE scores, respectively, in the Nerve data set. Therefore, we believe that the proposed scheme can be a useful tool for image segmentation tasks based on the encoder-decoder architecture.

Download Full-text

Low-Cost Embedded System Using Convolutional Neural Networks-Based Spatiotemporal Feature Map for Real-Time Human Action Recognition

Applied Sciences ◽

10.3390/app11114940 ◽

2021 ◽

Vol 11 (11) ◽

pp. 4940

Author(s):

Jinsoo Kim ◽

Jeongho Cho

Keyword(s):

Embedded System ◽

Real Time ◽

Action Recognition ◽

Processing Speed ◽

Recognition Accuracy ◽

Low Cost ◽

Human Action Recognition ◽

Human Action ◽

Video Data ◽

Feature Maps

The field of research related to video data has difficulty in extracting not only spatial but also temporal features and human action recognition (HAR) is a representative field of research that applies convolutional neural network (CNN) to video data. The performance for action recognition has improved, but owing to the complexity of the model, some still limitations to operation in real-time persist. Therefore, a lightweight CNN-based single-stream HAR model that can operate in real-time is proposed. The proposed model extracts spatial feature maps by applying CNN to the images that develop the video and uses the frame change rate of sequential images as time information. Spatial feature maps are weighted-averaged by frame change, transformed into spatiotemporal features, and input into multilayer perceptrons, which have a relatively lower complexity than other HAR models; thus, our method has high utility in a single embedded system connected to CCTV. The results of evaluating action recognition accuracy and data processing speed through challenging action recognition benchmark UCF-101 showed higher action recognition accuracy than the HAR model using long short-term memory with a small amount of video frames and confirmed the real-time operational possibility through fast data processing speed. In addition, the performance of the proposed weighted mean-based HAR model was verified by testing it in Jetson NANO to confirm the possibility of using it in low-cost GPU-based embedded systems.

Download Full-text

Channel-Wise Spatial Attention with Spatiotemporal Heterogeneous Framework for Action Recognition

Proceedings of the 2020 6th International Conference on Computing and Artificial Intelligence ◽

10.1145/3404555.3404592 ◽

2020 ◽

Author(s):

Yiying Li ◽

Yulin Li ◽

Yanfei Gu

Keyword(s):

Spatial Attention ◽

Action Recognition

Download Full-text

Action Recognition Network Using Stacked Short-Term Deep Features and Bidirectional Moving Average

Applied Sciences ◽

10.3390/app11125563 ◽

2021 ◽

Vol 11 (12) ◽

pp. 5563

Author(s):

Jinsol Ha ◽

Joongchol Shin ◽

Hasil Park ◽

Joonki Paik

Keyword(s):

Action Recognition ◽

High Speed ◽

Moving Average ◽

Video Clip ◽

Visual Surveillance ◽

Temporal Information ◽

Feature Maps ◽

Short Term ◽

Accurate Analysis ◽

Difference Images

Action recognition requires the accurate analysis of action elements in the form of a video clip and a properly ordered sequence of the elements. To solve the two sub-problems, it is necessary to learn both spatio-temporal information and the temporal relationship between different action elements. Existing convolutional neural network (CNN)-based action recognition methods have focused on learning only spatial or temporal information without considering the temporal relation between action elements. In this paper, we create short-term pixel-difference images from the input video, and take the difference images as an input to a bidirectional exponential moving average sub-network to analyze the action elements and their temporal relations. The proposed method consists of: (i) generation of RGB and differential images, (ii) extraction of deep feature maps using an image classification sub-network, (iii) weight assignment to extracted feature maps using a bidirectional, exponential, moving average sub-network, and (iv) late fusion with a three-dimensional convolutional (C3D) sub-network to improve the accuracy of action recognition. Experimental results show that the proposed method achieves a higher performance level than existing baseline methods. In addition, the proposed action recognition network takes only 0.075 seconds per action class, which guarantees various high-speed or real-time applications, such as abnormal action classification, human–computer interaction, and intelligent visual surveillance.

Download Full-text

A Novel Detector Based on Convolution Neural Networks for Multiscale SAR Ship Detection in Complex Background

Sensors ◽

10.3390/s20092547 ◽

2020 ◽

Vol 20 (9) ◽

pp. 2547 ◽

Cited By ~ 2

Author(s):

Wenxin Dai ◽

Yuqing Mao ◽

Rongao Yuan ◽

Yijing Liu ◽

Xuemei Pu ◽

...

Keyword(s):

Region Of Interest ◽

Feature Representation ◽

Feature Maps ◽

Sar Images ◽

Feature Map ◽

The Public ◽

Single Feature ◽

Great Performance ◽

Residual Block ◽

Fusion Feature

Convolution neural network (CNN)-based detectors have shown great performance on ship detections of synthetic aperture radar (SAR) images. However, the performance of current models has not been satisfactory enough for detecting multiscale ships and small-size ones in front of complex backgrounds. To address the problem, we propose a novel SAR ship detector based on CNN, which consist of three subnetworks: the Fusion Feature Extractor Network (FFEN), Region Proposal Network (RPN), and Refine Detection Network (RDN). Instead of using a single feature map, we fuse feature maps in bottom–up and top–down ways and generate proposals from each fused feature map in FFEN. Furthermore, we further merge features generated by the region-of-interest (RoI) pooling layer in RDN. Based on the feature representation strategy, the CNN framework constructed can significantly enhance the location and semantics information for the multiscale ships, in particular for the small ships. On the other hand, the residual block is introduced to increase the network depth, through which the detection precision could be further improved. The public SAR ship dataset (SSDD) and China Gaofen-3 satellite SAR image are used to validate the proposed method. Our method shows excellent performance for detecting the multiscale and small-size ships with respect to some competitive models and exhibits high potential in practical application.

Download Full-text