scholarly journals A Channel-Wise Spatial-Temporal Aggregation Network for Action Recognition

Mathematics ◽  
2021 ◽  
Vol 9 (24) ◽  
pp. 3226
Author(s):  
Huafeng Wang ◽  
Tao Xia ◽  
Hanlin Li ◽  
Xianfeng Gu ◽  
Weifeng Lv ◽  
...  

A very challenging task for action recognition concerns how to effectively extract and utilize the temporal and spatial information of video (especially temporal information). To date, many researchers have proposed various spatial-temporal convolution structures. Despite their success, most models are limited in further performance especially on those datasets that are highly time-dependent due to their failure to identify the fusion relationship between the spatial and temporal features inside the convolution channel. In this paper, we proposed a lightweight and efficient spatial-temporal extractor, denoted as Channel-Wise Spatial-Temporal Aggregation block (CSTA block), which could be flexibly plugged in existing 2D CNNs (denoted by CSTANet). The CSTA Block utilizes two branches to model spatial-temporal information separately. In temporal branch, It is equipped with a Motion Attention Module (MA), which is used to enhance the motion regions in a given video. Then, we introduced a Spatial-Temporal Channel Attention (STCA) module, which could aggregate spatial-temporal features of each block channel-wisely in a self-adaptive and trainable way. The final experimental results demonstrate that the proposed CSTANet achieved the state-of-the-art results on EGTEA Gaze++ and Diving48 datasets, and obtained competitive results on Something-Something V1&V2 at the less computational cost.

Author(s):  
Haoze Wu ◽  
Jiawei Liu ◽  
Zheng-Jun Zha ◽  
Zhenzhong Chen ◽  
Xiaoyan Sun

Recent works use 3D convolutional neural networks to explore spatio-temporal information for human action recognition. However, they either ignore the correlation between spatial and temporal features or suffer from high computational cost by spatio-temporal features extraction. In this work, we propose a novel and efficient Mutually Reinforced Spatio-Temporal Convolutional Tube (MRST) for human action recognition. It decomposes 3D inputs into spatial and temporal representations, mutually enhances both of them by exploiting the interaction of spatial and temporal information and selectively emphasizes informative spatial appearance and temporal motion, meanwhile reducing the complexity of structure. Moreover, we design three types of MRSTs according to the different order of spatial and temporal information enhancement, each of which contains a spatio-temporal decomposition unit, a mutually reinforced unit and a spatio-temporal fusion unit. An end-to-end deep network, MRST-Net, is also proposed based on the MRSTs to better explore spatio-temporal information in human actions. Extensive experiments show MRST-Net yields the best performance, compared to state-of-the-art approaches.


Electronics ◽  
2021 ◽  
Vol 10 (3) ◽  
pp. 325
Author(s):  
Zhihao Wu ◽  
Baopeng Zhang ◽  
Tianchen Zhou ◽  
Yan Li ◽  
Jianping Fan

In this paper, we developed a practical approach for automatic detection of discrimination actions from social images. Firstly, an image set is established, in which various discrimination actions and relations are manually labeled. To the best of our knowledge, this is the first work to create a dataset for discrimination action recognition and relationship identification. Secondly, a practical approach is developed to achieve automatic detection and identification of discrimination actions and relationships from social images. Thirdly, the task of relationship identification is seamlessly integrated with the task of discrimination action recognition into one single network called the Co-operative Visual Translation Embedding++ network (CVTransE++). We also compared our proposed method with numerous state-of-the-art methods, and our experimental results demonstrated that our proposed methods can significantly outperform state-of-the-art approaches.


2020 ◽  
Vol 10 (15) ◽  
pp. 5326
Author(s):  
Xiaolei Diao ◽  
Xiaoqiang Li ◽  
Chen Huang

The same action takes different time in different cases. This difference will affect the accuracy of action recognition to a certain extent. We propose an end-to-end deep neural network called “Multi-Term Attention Networks” (MTANs), which solves the above problem by extracting temporal features with different time scales. The network consists of a Multi-Term Attention Recurrent Neural Network (MTA-RNN) and a Spatio-Temporal Convolutional Neural Network (ST-CNN). In MTA-RNN, a method for fusing multi-term temporal features are proposed to extract the temporal dependence of different time scales, and the weighted fusion temporal feature is recalibrated by the attention mechanism. Ablation research proves that this network has powerful spatio-temporal dynamic modeling capabilities for actions with different time scales. We perform extensive experiments on four challenging benchmark datasets, including the NTU RGB+D dataset, UT-Kinect dataset, Northwestern-UCLA dataset, and UWA3DII dataset. Our method achieves better results than the state-of-the-art benchmarks, which demonstrates the effectiveness of MTANs.


2020 ◽  
Vol 34 (07) ◽  
pp. 11966-11973
Author(s):  
Hao Shao ◽  
Shengju Qian ◽  
Yu Liu

For a long time, the vision community tries to learn the spatio-temporal representation by combining convolutional neural network together with various temporal models, such as the families of Markov chain, optical flow, RNN and temporal convolution. However, these pipelines consume enormous computing resources due to the alternately learning process for spatial and temporal information. One natural question is whether we can embed the temporal information into the spatial one so the information in the two domains can be jointly learned once-only. In this work, we answer this question by presenting a simple yet powerful operator – temporal interlacing network (TIN). Instead of learning the temporal features, TIN fuses the two kinds of information by interlacing spatial representations from the past to the future, and vice versa. A differentiable interlacing target can be learned to control the interlacing process. In this way, a heavy temporal model is replaced by a simple interlacing operator. We theoretically prove that with a learnable interlacing target, TIN performs equivalently to the regularized temporal convolution network (r-TCN), but gains 4% more accuracy with 6x less latency on 6 challenging benchmarks. These results push the state-of-the-art performances of video understanding by a considerable margin. Not surprising, the ensemble model of the proposed TIN won the 1st place in the ICCV19 - Multi Moments in Time challenge. Code is made available to facilitate further research.1


Author(s):  
C. Indhumathi ◽  
V. Murugan ◽  
G. Muthulakshmii

Nowadays, action recognition has gained more attention from the computer vision community. Normally for recognizing human actions, spatial and temporal features are extracted. Two-stream convolutional neural network is used commonly for human action recognition in videos. In this paper, Adaptive motion Attentive Correlated Temporal Feature (ACTF) is used for temporal feature extractor. The temporal average pooling in inter-frame is used for extracting the inter-frame regional correlation feature and mean feature. This proposed method has better accuracy of 96.9% for UCF101 and 74.6% for HMDB51 datasets, respectively, which are higher than the other state-of-the-art methods.


2017 ◽  
Vol 26 (03) ◽  
pp. 1750015 ◽  
Author(s):  
Sotiris Batsakis ◽  
Ilias Tachmazidis ◽  
Grigoris Antoniou

Representation of temporal and spatial information for the Semantic Web often involves qualitative defined information (i.e., information described using natural language terms such as “before” or “overlaps”) since precise dates or coordinates are not always available. This work proposes several temporal representations for time points and intervals and spatial topological representations in ontologies by means of OWL properties and reasoning rules in SWRL. All representations are fully compliant with existing Semantic Web standards and W3C recommendations. Although qualitative representations for temporal interval and point relations and spatial topological relations exist, this is the first work proposing representations combining qualitative and quantitative information for the Semantic Web. In addition to this, several existing and proposed approaches are compared using different reasoners and experimental results are presented in detail. The proposed approach is applied to topological relations (RCC5 and RCC8) supporting both qualitative and quantitative (i.e., using coordinates) spatial relations. Experimental results illustrate that reasoning performance differs greatly between different representations and reasoners. To the best of our knowledge, this is the first such experimental evaluation of both qualitative and quantitative Semantic Web temporal and spatial representations. In addition to the above, querying performance using SPARQL is evaluated. Evaluation results demonstrate that extracting qualitative relations from quantitative representations using reasoning rules and querying qualitative relations instead of directly querying quantitative representations increases performance at query time.


2020 ◽  
Vol 9 (9) ◽  
pp. 538 ◽  
Author(s):  
Wenchao Li ◽  
Xin Liu ◽  
Chenggang Yan ◽  
Guiguang Ding ◽  
Yaoqi Sun ◽  
...  

The rapidly growing location-based social network (LBSN) has become a promising platform for studying users’ mobility patterns. Many online applications can be built based on such studies, among which, recommending locations is of particular interest. Previous studies have shown the importance of spatial and temporal influences on location recommendation; however, most existing approaches build a universal spatial–temporal model for all users despite the fact that users always demonstrate heterogeneous check-in behavior patterns. In order to realize truly personalized location recommendations, we propose a Gaussian process based model for each user to systematically and non-linearly combine temporal and spatial information to predict the user’s displacement from their currently checked-in location to the next one. The locations whose distances to the user’s current checked-in location are the closest to the predicted displacement are recommended. We also propose an enhancement to take into account category information of locations for semantic-aware recommendation. A unified recommendation framework called spatial–temporal–semantic (STS) is introduced to combine displacement prediction and the semantic-aware enhancement to provide final top-N recommendation. Extensive experiments over real datasets show that the proposed STS framework significantly outperforms the state-of-the-art location recommendation models in terms of precision and mean reciprocal rank (MRR).


2021 ◽  
Vol 13 (16) ◽  
pp. 3147
Author(s):  
Ziqiang Hua ◽  
Xiaorun Li ◽  
Jianfeng Jiang ◽  
Liaoying Zhao

Convolution-based autoencoder networks have yielded promising performances in exploiting spatial–contextual signatures for spectral unmixing. However, the extracted spectral and spatial features of some networks are aggregated, which makes it difficult to balance their effects on unmixing results. In this paper, we propose two gated autoencoder networks with the intention of adaptively controlling the contribution of spectral and spatial features in unmixing process. Gating mechanism is adopted in the networks to filter and regularize spatial features to construct an unmixing algorithm based on spectral information and supplemented by spatial information. In addition, abundance sparsity regularization and gating regularization are introduced to ensure the appropriate implementation. Experimental results validate the superiority of the proposed method to the state-of-the-art techniques in both synthetic and real-world scenes. This study confirms the effectiveness of gating mechanism in improving the accuracy and efficiency of utilizing spatial signatures for spectral unmixing.


2020 ◽  
Vol 10 (3) ◽  
pp. 966
Author(s):  
Zeyu Jiao ◽  
Guozhu Jia ◽  
Yingjie Cai

In this study, we consider fully automated action recognition based on deep learning in the industrial environment. In contrast to most existing methods, which rely on professional knowledge to construct complex hand-crafted features, or only use basic deep-learning methods, such as convolutional neural networks (CNNs), to extract information from images in the production process, we exploit a novel and effective method, which integrates multiple deep-learning networks including CNNs, spatial transformer networks (STNs), and graph convolutional networks (GCNs) to process video data in industrial workflows. The proposed method extracts both spatial and temporal information from video data. The spatial information is extracted by estimating the human pose of each frame, and the skeleton image of the human body in each frame is obtained. Furthermore, multi-frame skeleton images are processed by GCN to obtain temporal information, meaning the action recognition results are predicted automatically. By training on a large human action dataset, Kinetics, we apply the proposed method to the real-world industrial environment and achieve superior performance compared with the existing methods.


Sign in / Sign up

Export Citation Format

Share Document