scholarly journals Large-Scale Weakly-Supervised Pre-Training for Video Action Recognition

Author(s):  
Deepti Ghadiyaram ◽  
Du Tran ◽  
Dhruv Mahajan
2021 ◽  
Vol 11 (10) ◽  
pp. 4426
Author(s):  
Chunyan Ma ◽  
Ji Fan ◽  
Jinghao Yao ◽  
Tao Zhang

Computer vision-based action recognition of basketball players in basketball training and competition has gradually become a research hotspot. However, owing to the complex technical action, diverse background, and limb occlusion, it remains a challenging task without effective solutions or public dataset benchmarks. In this study, we defined 32 kinds of atomic actions covering most of the complex actions for basketball players and built the dataset NPU RGB+D (a large scale dataset of basketball action recognition with RGB image data and Depth data captured in Northwestern Polytechnical University) for 12 kinds of actions of 10 professional basketball players with 2169 RGB+D videos and 75 thousand frames, including RGB frame sequences, depth maps, and skeleton coordinates. Through extracting the spatial features of the distances and angles between the joint points of basketball players, we created a new feature-enhanced skeleton-based method called LSTM-DGCN for basketball player action recognition based on the deep graph convolutional network (DGCN) and long short-term memory (LSTM) methods. Many advanced action recognition methods were evaluated on our dataset and compared with our proposed method. The experimental results show that the NPU RGB+D dataset is very competitive with the current action recognition algorithms and that our LSTM-DGCN outperforms the state-of-the-art action recognition methods in various evaluation criteria on our dataset. Our action classifications and this NPU RGB+D dataset are valuable for basketball player action recognition techniques. The feature-enhanced LSTM-DGCN has a more accurate action recognition effect, which improves the motion expression ability of the skeleton data.


2020 ◽  
Vol 34 (07) ◽  
pp. 12862-12869
Author(s):  
Shiwen Zhang ◽  
Sheng Guo ◽  
Limin Wang ◽  
Weilin Huang ◽  
Matthew Scott

In this work, we propose Knowledge Integration Networks (referred as KINet) for video action recognition. KINet is capable of aggregating meaningful context features which are of great importance to identifying an action, such as human information and scene context. We design a three-branch architecture consisting of a main branch for action recognition, and two auxiliary branches for human parsing and scene recognition which allow the model to encode the knowledge of human and scene for action recognition. We explore two pre-trained models as teacher networks to distill the knowledge of human and scene for training the auxiliary tasks of KINet. Furthermore, we propose a two-level knowledge encoding mechanism which contains a Cross Branch Integration (CBI) module for encoding the auxiliary knowledge into medium-level convolutional features, and an Action Knowledge Graph (AKG) for effectively fusing high-level context information. This results in an end-to-end trainable framework where the three tasks can be trained collaboratively, allowing the model to compute strong context knowledge efficiently. The proposed KINet achieves the state-of-the-art performance on a large-scale action recognition benchmark Kinetics-400, with a top-1 accuracy of 77.8%. We further demonstrate that our KINet has strong capability by transferring the Kinetics-trained model to UCF-101, where it obtains 97.8% top-1 accuracy.


2020 ◽  
Vol 34 (03) ◽  
pp. 2669-2676 ◽  
Author(s):  
Wei Peng ◽  
Xiaopeng Hong ◽  
Haoyu Chen ◽  
Guoying Zhao

Human action recognition from skeleton data, fuelled by the Graph Convolutional Network (GCN) with its powerful capability of modeling non-Euclidean data, has attracted lots of attention. However, many existing GCNs provide a pre-defined graph structure and share it through the entire network, which can loss implicit joint correlations especially for the higher-level features. Besides, the mainstream spectral GCN is approximated by one-order hop such that higher-order connections are not well involved. All of these require huge efforts to design a better GCN architecture. To address these problems, we turn to Neural Architecture Search (NAS) and propose the first automatically designed GCN for this task. Specifically, we explore the spatial-temporal correlations between nodes and build a search space with multiple dynamic graph modules. Besides, we introduce multiple-hop modules and expect to break the limitation of representational capacity caused by one-order approximation. Moreover, a corresponding sampling- and memory-efficient evolution strategy is proposed to search in this space. The resulted architecture proves the effectiveness of the higher-order approximation and the layer-wise dynamic graph modules. To evaluate the performance of the searched model, we conduct extensive experiments on two very large scale skeleton-based action recognition datasets. The results show that our model gets the state-of-the-art results in term of given metrics.


Electronics ◽  
2019 ◽  
Vol 8 (10) ◽  
pp. 1072 ◽  
Author(s):  
Shifeng Xia ◽  
Jiexian Zeng ◽  
Lu Leng ◽  
Xiang Fu

Recently, convolutional neural networks (CNNs) have achieved great success in scene recognition. Compared with traditional hand-crafted features, CNN can be used to extract more robust and generalized features for scene recognition. However, the existing scene recognition methods based on CNN do not sufficiently take into account the relationship between image regions and categories when choosing local regions, which results in many redundant local regions and degrades recognition accuracy. In this paper, we propose an effective method for exploring discriminative regions of the scene image. Our method utilizes the gradient-weighted class activation mapping (Grad-CAM) technique and weakly supervised information to generate the attention map (AM) of scene images, dubbed WS-AM—weakly supervised attention map. The regions, where the local mean and the local center value are both large in the AM, correspond to the discriminative regions helpful for scene recognition. We sampled discriminative regions on multiple scales and extracted the features of large-scale and small-scale regions with two different pre-trained CNNs, respectively. The features from two different scales were aggregated by the improved vector of locally aggregated descriptor (VLAD) coding and max pooling, respectively. Finally, the pre-trained CNN was used to extract the global feature of the image in the fully- connected (fc) layer, and the local features were combined with the global feature to obtain the image representation. We validated the effectiveness of our method on three benchmark datasets: MIT Indoor 67, Scene 15, and UIUC Sports, and obtained 85.67%, 94.80%, and 95.12% accuracy, respectively. Compared with some state-of-the-art methods, the WS-AM method requires fewer local regions, so it has a better real-time performance.


2020 ◽  
Vol 34 (09) ◽  
pp. 13610-13611
Author(s):  
Oktie Hassanzadeh ◽  
Debarun Bhattacharjya ◽  
Mark Feblowitz ◽  
Kavitha Srinivas ◽  
Michael Perrone ◽  
...  

In this demonstration, we present a system for mining causal knowledge from large corpuses of text documents, such as millions of news articles. Our system provides a collection of APIs for causal analysis and retrieval. These APIs enable searching for the effects of a given cause and the causes of a given effect, as well as the analysis of existence of causal relation given a pair of phrases. The analysis includes a score that indicates the likelihood of the existence of a causal relation. It also provides evidence from an input corpus supporting the existence of a causal relation between input phrases. Our system uses generic unsupervised and weakly supervised methods of causal relation extraction that do not impose semantic constraints on causes and effects. We show example use cases developed for a commercial application in enterprise risk management.


Author(s):  
Junnan Li ◽  
Jianquan Liu ◽  
Yongkang Wang ◽  
Shoji Nishimura ◽  
Mohan S. Kankanhalli

2020 ◽  
Vol 34 (03) ◽  
pp. 2677-2684
Author(s):  
Marjaneh Safaei ◽  
Pooyan Balouchian ◽  
Hassan Foroosh

Action recognition in still images poses a great challenge due to (i) fewer available training data, (ii) absence of temporal information. To address the first challenge, we introduce a dataset for STill image Action Recognition (STAR), containing over $1M$ images across 50 different human body-motion action categories. UCF-STAR is the largest dataset in the literature for action recognition in still images. The key characteristics of UCF-STAR include (1) focusing on human body-motion rather than relatively static human-object interaction categories, (2) collecting images from the wild to benefit from a varied set of action representations, (3) appending multiple human-annotated labels per image rather than just the action label, and (4) inclusion of rich, structured and multi-modal set of metadata for each image. This departs from existing datasets, which typically provide single annotation in a smaller number of images and categories, with no metadata. UCF-STAR exposes the intrinsic difficulty of action recognition through its realistic scene and action complexity. To benchmark and demonstrate the benefits of UCF-STAR as a large-scale dataset, and to show the role of “latent” motion information in recognizing human actions in still images, we present a novel approach relying on predicting temporal information, yielding higher accuracy on 5 widely-used datasets.


Author(s):  
Yu-Hui Wen ◽  
Lin Gao ◽  
Hongbo Fu ◽  
Fang-Lue Zhang ◽  
Shihong Xia

Hierarchical structure and different semantic roles of joints in human skeleton convey important information for action recognition. Conventional graph convolution methods for modeling skeleton structure consider only physically connected neighbors of each joint, and the joints of the same type, thus failing to capture highorder information. In this work, we propose a novel model with motif-based graph convolution to encode hierarchical spatial structure, and a variable temporal dense block to exploit local temporal information over different ranges of human skeleton sequences. Moreover, we employ a non-local block to capture global dependencies of temporal domain in an attention mechanism. Our model achieves improvements over the stateof-the-art methods on two large-scale datasets.


Sensors ◽  
2020 ◽  
Vol 20 (11) ◽  
pp. 3305 ◽  
Author(s):  
Huogen Wang ◽  
Zhanjie Song ◽  
Wanqing Li ◽  
Pichao Wang

The paper presents a novel hybrid network for large-scale action recognition from multiple modalities. The network is built upon the proposed weighted dynamic images. It effectively leverages the strengths of the emerging Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) based approaches to specifically address the challenges that occur in large-scale action recognition and are not fully dealt with by the state-of-the-art methods. Specifically, the proposed hybrid network consists of a CNN based component and an RNN based component. Features extracted by the two components are fused through canonical correlation analysis and then fed to a linear Support Vector Machine (SVM) for classification. The proposed network achieved state-of-the-art results on the ChaLearn LAP IsoGD, NTU RGB+D and Multi-modal & Multi-view & Interactive ( M 2 I ) datasets and outperformed existing methods by a large margin (over 10 percentage points in some cases).


2019 ◽  
Vol 1 (11) ◽  
pp. 530-537 ◽  
Author(s):  
Piotr Antonik ◽  
Nicolas Marsal ◽  
Daniel Brunner ◽  
Damien Rontani

Sign in / Sign up

Export Citation Format

Share Document