Histogram of Oriented Gradient-Based Fusion of Features for Human Action Recognition in Action Video Sequences

Chirag I. Patel; Dileep Labana; Sharnil Pandya; Kirit Modi; Hemant Ghayvat; Muhammad Awais

doi:10.3390/s20247299

Histogram of Oriented Gradient-Based Fusion of Features for Human Action Recognition in Action Video Sequences

Sensors ◽

10.3390/s20247299 ◽

2020 ◽

Vol 20 (24) ◽

pp. 7299

Author(s):

Chirag I. Patel ◽

Dileep Labana ◽

Sharnil Pandya ◽

Kirit Modi ◽

Hemant Ghayvat ◽

...

Keyword(s):

Neural Network ◽

Action Recognition ◽

State Of The Art ◽

Human Action Recognition ◽

Human Action ◽

Moving Object ◽

Video Sequences ◽

Feature Descriptor ◽

Histogram Of Oriented Gradient ◽

Fusion Technique

Human Action Recognition (HAR) is the classification of an action performed by a human. The goal of this study was to recognize human actions in action video sequences. We present a novel feature descriptor for HAR that involves multiple features and combining them using fusion technique. The major focus of the feature descriptor is to exploits the action dissimilarities. The key contribution of the proposed approach is to built robust features descriptor that can work for underlying video sequences and various classification models. To achieve the objective of the proposed work, HAR has been performed in the following manner. First, moving object detection and segmentation are performed from the background. The features are calculated using the histogram of oriented gradient (HOG) from a segmented moving object. To reduce the feature descriptor size, we take an averaging of the HOG features across non-overlapping video frames. For the frequency domain information we have calculated regional features from the Fourier hog. Moreover, we have also included the velocity and displacement of moving object. Finally, we use fusion technique to combine these features in the proposed work. After a feature descriptor is prepared, it is provided to the classifier. Here, we have used well-known classifiers such as artificial neural networks (ANNs), support vector machine (SVM), multiple kernel learning (MKL), Meta-cognitive Neural Network (McNN), and the late fusion methods. The main objective of the proposed approach is to prepare a robust feature descriptor and to show the diversity of our feature descriptor. Though we are using five different classifiers, our feature descriptor performs relatively well across the various classifiers. The proposed approach is performed and compared with the state-of-the-art methods for action recognition on two publicly available benchmark datasets (KTH and Weizmann) and for cross-validation on the UCF11 dataset, HMDB51 dataset, and UCF101 dataset. Results of the control experiments, such as a change in the SVM classifier and the effects of the second hidden layer in ANN, are also reported. The results demonstrate that the proposed method performs reasonably compared with the majority of existing state-of-the-art methods, including the convolutional neural network-based feature extractors.

Download Full-text

Feature Fusion of Deep Spatial Features and Handcrafted Spatiotemporal Features for Human Action Recognition

Sensors ◽

10.3390/s19071599 ◽

2019 ◽

Vol 19 (7) ◽

pp. 1599 ◽

Cited By ~ 6

Author(s):

Md Uddin ◽

Young-Koo Lee

Keyword(s):

Action Recognition ◽

State Of The Art ◽

Human Action Recognition ◽

Human Action ◽

Support Vector ◽

Feature Descriptor ◽

Weber’S Law ◽

Weber's Law ◽

Spatiotemporal Features ◽

Spatial Features

Human action recognition plays a significant part in the research community due to its emerging applications. A variety of approaches have been proposed to resolve this problem, however, several issues still need to be addressed. In action recognition, effectively extracting and aggregating the spatial-temporal information plays a vital role to describe a video. In this research, we propose a novel approach to recognize human actions by considering both deep spatial features and handcrafted spatiotemporal features. Firstly, we extract the deep spatial features by employing a state-of-the-art deep convolutional network, namely Inception-Resnet-v2. Secondly, we introduce a novel handcrafted feature descriptor, namely Weber’s law based Volume Local Gradient Ternary Pattern (WVLGTP), which brings out the spatiotemporal features. It also considers the shape information by using gradient operation. Furthermore, Weber’s law based threshold value and the ternary pattern based on an adaptive local threshold is presented to effectively handle the noisy center pixel value. Besides, a multi-resolution approach for WVLGTP based on an averaging scheme is also presented. Afterward, both these extracted features are concatenated and feed to the Support Vector Machine to perform the classification. Lastly, the extensive experimental analysis shows that our proposed method outperforms state-of-the-art approaches in terms of accuracy.

Download Full-text

End-to-end learning of deep convolutional neural network for 3D human action recognition

2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW) ◽

10.1109/icmew.2017.8026281 ◽

2017 ◽

Author(s):

Chao Li ◽

Shouqian Sun ◽

Xin Min ◽

Wenqian Lin ◽

Binling Nie ◽

...

Keyword(s):

Neural Network ◽

Convolutional Neural Network ◽

Action Recognition ◽

Human Action Recognition ◽

Human Action ◽

Deep Convolutional Neural Network ◽

End To End

Download Full-text

Human action recognition based on quaternion spatial-temporal convolutional neural network and LSTM in RGB videos

Multimedia Tools and Applications ◽

10.1007/s11042-018-5893-9 ◽

2018 ◽

Vol 77 (20) ◽

pp. 26901-26918 ◽

Cited By ~ 8

Author(s):

Bo Meng ◽

XueJun Liu ◽

Xiaolin Wang

Keyword(s):

Neural Network ◽

Convolutional Neural Network ◽

Action Recognition ◽

Human Action Recognition ◽

Human Action

Download Full-text

I3D-Shufflenet Based Human Action Recognition

Algorithms ◽

10.3390/a13110301 ◽

2020 ◽

Vol 13 (11) ◽

pp. 301

Author(s):

Guocheng Liu ◽

Caixia Zhang ◽

Qingyang Xu ◽

Ruoshi Cheng ◽

Yong Song ◽

...

Keyword(s):

Neural Network ◽

Action Recognition ◽

Human Action Recognition ◽

Human Action ◽

Recognition Algorithm ◽

Convolution Kernel ◽

Histogram Of Oriented Gradients ◽

Temporal Features ◽

Convolution Kernels

In view of difficulty in application of optical flow based human action recognition due to large amount of calculation, a human action recognition algorithm I3D-shufflenet model is proposed combining the advantages of I3D neural network and lightweight model shufflenet. The 5 × 5 convolution kernel of I3D is replaced by a double 3 × 3 convolution kernels, which reduces the amount of calculations. The shuffle layer is adopted to achieve feature exchange. The recognition and classification of human action is performed based on trained I3D-shufflenet model. The experimental results show that the shuffle layer improves the composition of features in each channel which can promote the utilization of useful information. The Histogram of Oriented Gradients (HOG) spatial-temporal features of the object are extracted for training, which can significantly improve the ability of human action expression and reduce the calculation of feature extraction. The I3D-shufflenet is testified on the UCF101 dataset, and compared with other models. The final result shows that the I3D-shufflenet has higher accuracy than the original I3D with an accuracy of 96.4%.

Download Full-text

Human action recognition based on two-stream Ind recurrent neural network

Tenth International Conference on Graphics and Image Processing (ICGIP 2018) ◽

10.1117/12.2524322 ◽

2019 ◽

Author(s):

Penghua Ge ◽

Min Zhi

Keyword(s):

Neural Network ◽

Action Recognition ◽

Recurrent Neural Network ◽

Human Action Recognition ◽

Human Action

Download Full-text

A Robust Deep Model for Human Action Recognition in Restricted Video Sequences

2020 43rd International Conference on Telecommunications and Signal Processing (TSP) ◽

10.1109/tsp49548.2020.9163464 ◽

2020 ◽

Author(s):

Vahid Ashkani Chenarlogh ◽

Hossein B Jond ◽

Jan Platos

Keyword(s):

Action Recognition ◽

Human Action Recognition ◽

Human Action ◽

Video Sequences ◽

Deep Model

Download Full-text

F-E3D: FPGA-based Acceleration of an Efficient 3D Convolutional Neural Network for Human Action Recognition

2019 IEEE 30th International Conference on Application-specific Systems, Architectures and Processors (ASAP) ◽

10.1109/asap.2019.00-44 ◽

2019 ◽

Cited By ~ 3

Author(s):

Hongxiang Fan ◽

Cheng Luo ◽

Chenglong Zeng ◽

Martin Ferianc ◽

Zhiqiang Que ◽

...

Keyword(s):

Neural Network ◽

Convolutional Neural Network ◽

Action Recognition ◽

Human Action Recognition ◽

Human Action

Download Full-text

Feature Extraction and Representation for Distributed Multi-View Human Action Recognition

IEEE Journal on Emerging and Selected Topics in Circuits and Systems ◽

10.1109/jetcas.2013.2256824 ◽

2013 ◽

Vol 3 (2) ◽

pp. 145-154 ◽

Cited By ~ 7

Author(s):

Jiajia Luo ◽

Wei Wang ◽

Hairong Qi

Keyword(s):

Action Recognition ◽

Approximation Error ◽

Human Action Recognition ◽

Human Action ◽

Base Station ◽

Feature Representation ◽

Superior Performance ◽

Feature Descriptor ◽

Testing Stage ◽

New Feature

Multi-view human action recognition has gained a lot of attention in recent years for its superior performance as compared to single view recognition. In this paper, we propose a new framework for the real-time realization of human action recognition in distributed camera networks (DCNs). We first present a new feature descriptor (Mltp-hist) that is tolerant to illumination change, robust in homogeneous region and computationally efficient. Taking advantage of the proposed Mltp-hist, the noninformative 3-D patches generated from the background can be further removed automatically that effectively highlights the foreground patches. Next, a new feature representation method based on sparse coding is presented to generate the histogram representation of local videos to be transmitted to the base station for classification. Due to the sparse representation of extracted features, the approximation error is reduced. Finally, at the base station, a probability model is produced to fuse the information from various views and a class label is assigned accordingly. Compared to the existing algorithms, the proposed framework has three advantages while having less requirements on memory and bandwidth consumption: 1) no preprocessing is required; 2) communication among cameras is unnecessary; and 3) positions and orientations of cameras do not need to be fixed. We further evaluate the proposed framework on the most popular multi-view action dataset IXMAS. Experimental results indicate that our proposed framework repeatedly achieves state-of-the-art results when various numbers of views are tested. In addition, our approach is tolerant to the various combination of views and benefit from introducing more views at the testing stage. Especially, our results are still satisfactory even when large misalignment exists between the training and testing samples.

Download Full-text

Human action recognition with hidden Markov models and neural network derived poses

2017 IEEE 15th International Symposium on Intelligent Systems and Informatics (SISY) ◽

10.1109/sisy.2017.8080544 ◽

2017 ◽

Cited By ~ 2

Author(s):

Egbert Gedat ◽

Pascal Fechner ◽

Richard Fiebelkorn ◽

Ralf Vandenhouten

Keyword(s):

Neural Network ◽

Hidden Markov Models ◽

Action Recognition ◽

Markov Models ◽

Hidden Markov ◽

Human Action Recognition ◽

Human Action

Download Full-text

Human Action Recognition Using Improved Salient Dense Trajectories

Computational Intelligence and Neuroscience ◽

10.1155/2016/6750459 ◽

2016 ◽

Vol 2016 ◽

pp. 1-11 ◽

Cited By ~ 3

Author(s):

Qingwu Li ◽

Haisu Cheng ◽

Yan Zhou ◽

Guanying Huo

Keyword(s):

Action Recognition ◽

State Of The Art ◽

Human Action Recognition ◽

Human Action ◽

Interest Points ◽

Dense Trajectories ◽

Dense Trajectory ◽

Sparse Coefficient ◽

Active Research ◽

Motion Saliency

Human action recognition in videos is a topic of active research in computer vision. Dense trajectory (DT) features were shown to be efficient for representing videos in state-of-the-art approaches. In this paper, we present a more effective approach of video representation using improved salient dense trajectories: first, detecting the motion salient region and extracting the dense trajectories by tracking interest points in each spatial scale separately and then refining the dense trajectories via the analysis of the motion saliency. Then, we compute several descriptors (i.e., trajectory displacement, HOG, HOF, and MBH) in the spatiotemporal volume aligned with the trajectories. Finally, in order to represent the videos better, we optimize the framework of bag-of-words according to the motion salient intensity distribution and the idea of sparse coefficient reconstruction. Our architecture is trained and evaluated on the four standard video actions datasets of KTH, UCF sports, HMDB51, and UCF50, and the experimental results show that our approach performs competitively comparing with the state-of-the-art results.

Download Full-text