Deep Learning Based Human Activity Recognition Using Spatio-Temporal Image Formation of Skeleton Joints

Nusrat Tasnim; Mohammad Khairul Islam; Joong-Hwan Baek

doi:10.3390/app11062675

Deep Learning Based Human Activity Recognition Using Spatio-Temporal Image Formation of Skeleton Joints

Applied Sciences ◽

10.3390/app11062675 ◽

2021 ◽

Vol 11 (6) ◽

pp. 2675

Author(s):

Nusrat Tasnim ◽

Mohammad Khairul Islam ◽

Joong-Hwan Baek

Keyword(s):

Deep Learning ◽

Activity Recognition ◽

Action Recognition ◽

Human Activity ◽

Image Formation ◽

Human Action Recognition ◽

Human Action ◽

Human Activity Recognition ◽

3D Skeleton ◽

Spatio Temporal

Human activity recognition has become a significant research trend in the fields of computer vision, image processing, and human–machine or human–object interaction due to cost-effectiveness, time management, rehabilitation, and the pandemic of diseases. Over the past years, several methods published for human action recognition using RGB (red, green, and blue), depth, and skeleton datasets. Most of the methods introduced for action classification using skeleton datasets are constrained in some perspectives including features representation, complexity, and performance. However, there is still a challenging problem of providing an effective and efficient method for human action discrimination using a 3D skeleton dataset. There is a lot of room to map the 3D skeleton joint coordinates into spatio-temporal formats to reduce the complexity of the system, to provide a more accurate system to recognize human behaviors, and to improve the overall performance. In this paper, we suggest a spatio-temporal image formation (STIF) technique of 3D skeleton joints by capturing spatial information and temporal changes for action discrimination. We conduct transfer learning (pretrained models- MobileNetV2, DenseNet121, and ResNet18 trained with ImageNet dataset) to extract discriminative features and evaluate the proposed method with several fusion techniques. We mainly investigate the effect of three fusion methods such as element-wise average, multiplication, and maximization on the performance variation to human action recognition. Our deep learning-based method outperforms prior works using UTD-MHAD (University of Texas at Dallas multi-modal human action dataset) and MSR-Action3D (Microsoft action 3D), publicly available benchmark 3D skeleton datasets with STIF representation. We attain accuracies of approximately 98.93%, 99.65%, and 98.80% for UTD-MHAD and 96.00%, 98.75%, and 97.08% for MSR-Action3D skeleton datasets using MobileNetV2, DenseNet121, and ResNet18, respectively.

Download Full-text

Patient Monitoring by Abnormal Human Activity Recognition Based on CNN Architecture

Electronics ◽

10.3390/electronics9121993 ◽

2020 ◽

Vol 9 (12) ◽

pp. 1993

Author(s):

Malik Ali Gul ◽

Muhammad Haroon Yousaf ◽

Shah Nawaz ◽

Zaka Ur Rehman ◽

HyungWon Kim

Keyword(s):

Activity Recognition ◽

Action Recognition ◽

Human Activity ◽

Patient Monitoring ◽

Human Action Recognition ◽

Confidence Score ◽

Human Action ◽

Human Activity Recognition ◽

Video Sequences ◽

Human Actions

Human action recognition has emerged as a challenging research domain for video understanding and analysis. Subsequently, extensive research has been conducted to achieve the improved performance for recognition of human actions. Human activity recognition has various real time applications, such as patient monitoring in which patients are being monitored among a group of normal people and then identified based on their abnormal activities. Our goal is to render a multi class abnormal action detection in individuals as well as in groups from video sequences to differentiate multiple abnormal human actions. In this paper, You Look only Once (YOLO) network is utilized as a backbone CNN model. For training the CNN model, we constructed a large dataset of patient videos by labeling each frame with a set of patient actions and the patient’s positions. We retrained the back-bone CNN model with 23,040 labeled images of patient’s actions for 32 epochs. Across each frame, the proposed model allocated a unique confidence score and action label for video sequences by finding the recurrent action label. The present study shows that the accuracy of abnormal action recognition is 96.8%. Our proposed approach differentiated abnormal actions with improved F1-Score of 89.2% which is higher than state-of-the-art techniques. The results indicate that the proposed framework can be beneficial to hospitals and elder care homes for patient monitoring.

Download Full-text

Human Action Recognition Using Median Background and Max Pool Convolution with Nearest Neighbor

International Journal of Ambient Computing and Intelligence ◽

10.4018/ijaci.2019040103 ◽

2019 ◽

Vol 10 (2) ◽

pp. 34-47 ◽

Cited By ~ 1

Author(s):

Bagavathi Lakshmi ◽

S.Parthasarathy

Keyword(s):

Machine Learning ◽

Activity Recognition ◽

Action Recognition ◽

Human Activity ◽

Nearest Neighbor ◽

Human Action Recognition ◽

Human Action ◽

Human Activity Recognition ◽

Machine Learning Algorithms ◽

Support Vector

Discovering human activities on mobile devices is a challenging task for human action recognition. The ability of a device to recognize its user's activity is important because it enables context-aware applications and behavior. Recently, machine learning algorithms have been increasingly used for human action recognition. During the past few years, principal component analysis and support vector machines is widely used for robust human activity recognition. However, with global dynamic tendency and complex tasks involved, this robust human activity recognition (HAR) results in error and complexity. To deal with this problem, a machine learning algorithm is proposed and explores its application on HAR. In this article, a Max Pool Convolution Neural Network based on Nearest Neighbor (MPCNN-NN) is proposed to perform efficient and effective HAR using smartphone sensors by exploiting the inherent characteristics. The MPCNN-NN framework for HAR consists of three steps. In the first step, for each activity, the features of interest or foreground frame are detected using Median Background Subtraction. The second step consists of organizing the features (i.e. postures) that represent the strongest generic discriminating features (i.e. postures) based on Max Pool. The third and the final step is the HAR based on Nearest Neighbor that postures which maximizes the probability. Experiments have been conducted to demonstrate the superiority of the proposed MPCNN-NN framework on human action dataset, KARD (Kinect Activity Recognition Dataset).

Download Full-text

A spatio-temporal deep learning approach for human action recognition in infrared videos

Optics and Photonics for Information Processing XII ◽

10.1117/12.2502993 ◽

2018 ◽

Cited By ~ 1

Author(s):

Naga Vara Aparna Akula ◽

Anuj K. Shah ◽

Ripul Ghosh

Keyword(s):

Deep Learning ◽

Action Recognition ◽

Human Action Recognition ◽

Human Action ◽

Learning Approach ◽

Spatio Temporal

Download Full-text

Sensor-Based Human Activity Recognition with Spatio-Temporal Deep Learning

Sensors ◽

10.3390/s21062141 ◽

2021 ◽

Vol 21 (6) ◽

pp. 2141

Author(s):

Ohoud Nafea ◽

Wadood Abdul ◽

Ghulam Muhammad ◽

Mansour Alsulaiman

Keyword(s):

Deep Learning ◽

Activity Recognition ◽

Human Activity ◽

Short Term Memory ◽

Human Activity Recognition ◽

Sensor Data ◽

Effective Selection ◽

Spatio Temporal ◽

High Level ◽

Improved Accuracy

Human activity recognition (HAR) remains a challenging yet crucial problem to address in computer vision. HAR is primarily intended to be used with other technologies, such as the Internet of Things, to assist in healthcare and eldercare. With the development of deep learning, automatic high-level feature extraction has become a possibility and has been used to optimize HAR performance. Furthermore, deep-learning techniques have been applied in various fields for sensor-based HAR. This study introduces a new methodology using convolution neural networks (CNN) with varying kernel dimensions along with bi-directional long short-term memory (BiLSTM) to capture features at various resolutions. The novelty of this research lies in the effective selection of the optimal video representation and in the effective extraction of spatial and temporal features from sensor data using traditional CNN and BiLSTM. Wireless sensor data mining (WISDM) and UCI datasets are used for this proposed methodology in which data are collected through diverse methods, including accelerometers, sensors, and gyroscopes. The results indicate that the proposed scheme is efficient in improving HAR. It was thus found that unlike other available methods, the proposed method improved accuracy, attaining a higher score in the WISDM dataset compared to the UCI dataset (98.53% vs. 97.05%).

Download Full-text

Chord-Length Shape Features for Human Activity Recognition

ISRN Machine Vision ◽

10.5402/2012/872131 ◽

2012 ◽

Vol 2012 ◽

pp. 1-9 ◽

Cited By ~ 3

Author(s):

Samy Sadek ◽

Ayoub Al-Hamadi ◽

Bernd Michaelis ◽

Usama Sayed

Keyword(s):

Activity Recognition ◽

Human Activity ◽

Human Action Recognition ◽

Shape Descriptor ◽

Human Action ◽

Human Activity Recognition ◽

Chord Length ◽

Shape Features ◽

Computationally Efficient ◽

Fuzzy Membership Functions

Despite their high stability and compactness, chord-length shape features have received relatively little attention in the human action recognition literature. In this paper, we present a new approach for human activity recognition, based on chord-length shape features. The most interesting contribution of this paper is twofold. We first show how a compact, computationally efficient shape descriptor; the chord-length shape features are constructed using 1-D chord-length functions. Second, we unfold how to use fuzzy membership functions to partition action snippets into a number of temporal states. On two benchmark action datasets (KTH and WEIZMANN), the approach yields promising results that compare favorably with those previously reported in the literature, while maintaining real-time performance.

Download Full-text

Remarkable Skeleton Based Human Action Recognition

Artificial Intelligence Evolution ◽

10.37256/aie.122020562 ◽

2020 ◽

pp. 109-122

Author(s):

Sushma Jaiswal ◽

Tarun Jaiswal

Keyword(s):

Deep Learning ◽

Cognitive Science ◽

Action Recognition ◽

Human Action Recognition ◽

Human Action ◽

Video Frame ◽

Learning Method ◽

Long Time ◽

3D Skeleton

Skeleton-based human-action-recognition (SBHAR) has wide applications in cognitive science and automatic surveillance. However, the most challenging and crucial task of the skeleton-based human-action-recognition (SBHAR) is a significant view variation while capturing the data. In this area, a significant amount of satisfactory work has already been done, which include the Red Green Blue (RGB) data method. The performance of the SBHAR is also affected by the various factors such as video frame setting, view variations in motion, different backgrounds and inter-personal differences. In this survey, we explicitly address these challenges and provide a complete overview of advancement in this field. The deep learning method has been used in this field for a long time, but so far, no research has fully demonstrated its usefulness. In this paper, we first highlight the need for action recognition and significance of 3D skeleton data and finally, we survey the largest 3D skeleton dataset, i.e. NTU-RGB+D and its new version NTU-RGB+D 120.

Download Full-text