Multi-view Clustering of Visual Words Using Canonical Correlation Analysis for Human Action Recognition

Author(s):  
Behrouz Saghafi ◽  
Deepu Rajan
2021 ◽  
Author(s):  
Nour Elmadany

This thesis presents three frameworks of human action recognition to facilitate better recognition performance. The first framework fuses handcrafted features from four different modalities including RGB, depth, skeleton, and accelerometer data. In addition, a new descriptor for skeleton data is proposed that provides a discriminative representation for the poses of an action. Since the goal of the first framework is to find a more discriminative subspace, a generalized fusion technique Multimodal Hybrid Centroid Canonical Correlation Analysis (MHCCCA) is proposed for two or more sets of features or modalities. The second framework fuses handcrafted and deep learning features from three modalities including RGB, depth, and skeleton. In this framework a new depth representation is introduced that extracts the final representation using Deep ConvNet. The proposed fusion technique forms the backbone of the framework: Multiset Globality Locality Preserving Canonical Correlation Analysis (MGLPCCA) for two or more sets of features or modalities. MGLPCCA aims to preserve the local and global structures of data while maximizing the correlation among different modalities or sets. The third framework uses the deep learning techniques to improve the long term temporal modelling through two proposed techniques: Temporal Relational Network (TRN) and Temporal Second Order Pooling Based Network (T-SOPN). Additionally, Global-Local Network (GLN) and Fuse-Inception Network (FIN) are proposed to encourage the network to learn complementary information about the action and scene itself. Qualitative and quantitative experiments are conducted on nine different datasets demonstrating the effectiveness of the proposed framework over state-of-the-art methods.


2021 ◽  
Author(s):  
Nour Elmadany

This thesis presents three frameworks of human action recognition to facilitate better recognition performance. The first framework fuses handcrafted features from four different modalities including RGB, depth, skeleton, and accelerometer data. In addition, a new descriptor for skeleton data is proposed that provides a discriminative representation for the poses of an action. Since the goal of the first framework is to find a more discriminative subspace, a generalized fusion technique Multimodal Hybrid Centroid Canonical Correlation Analysis (MHCCCA) is proposed for two or more sets of features or modalities. The second framework fuses handcrafted and deep learning features from three modalities including RGB, depth, and skeleton. In this framework a new depth representation is introduced that extracts the final representation using Deep ConvNet. The proposed fusion technique forms the backbone of the framework: Multiset Globality Locality Preserving Canonical Correlation Analysis (MGLPCCA) for two or more sets of features or modalities. MGLPCCA aims to preserve the local and global structures of data while maximizing the correlation among different modalities or sets. The third framework uses the deep learning techniques to improve the long term temporal modelling through two proposed techniques: Temporal Relational Network (TRN) and Temporal Second Order Pooling Based Network (T-SOPN). Additionally, Global-Local Network (GLN) and Fuse-Inception Network (FIN) are proposed to encourage the network to learn complementary information about the action and scene itself. Qualitative and quantitative experiments are conducted on nine different datasets demonstrating the effectiveness of the proposed framework over state-of-the-art methods.


2014 ◽  
Vol 599-601 ◽  
pp. 1571-1574
Author(s):  
Jia Ding ◽  
Yang Yi ◽  
Ze Min Qiu ◽  
Jun Shi Liu

Human action recognition in videos plays an important role in the field of computer vision and image understanding. A novel method of multi-channel bag of visual words and multiple kernel learning is proposed in this paper. The videos are described by multi-channel bag of visual words, and a multiple kernel learning classifier is used for action classification, in which each kernel function of the classifier corresponds to a video channel in order to avoid the noise interference from other channels. The proposed approach improves the ability in distinguishing easily confused actions. Experiments on KTH show that the presented method achieves remarkable performance on the average recognition rate, and obtains comparable recognition rate with state-of-the-art methods.


Sign in / Sign up

Export Citation Format

Share Document