Deep Temporal–Spatial Aggregation for Video-Based Facial Expression Recognition

Xianzhang Pan; Wenping Guo; Xiaoying Guo; Wenshu Li; Junjie Xu; Jinzhao Wu

doi:10.3390/sym11010052

Deep Temporal–Spatial Aggregation for Video-Based Facial Expression Recognition

Symmetry ◽

10.3390/sym11010052 ◽

2019 ◽

Vol 11 (1) ◽

pp. 52 ◽

Cited By ~ 5

Author(s):

Xianzhang Pan ◽

Wenping Guo ◽

Xiaoying Guo ◽

Wenshu Li ◽

Junjie Xu ◽

...

Keyword(s):

Facial Expression ◽

Facial Expression Recognition ◽

State Of The Art ◽

Spatial Aggregation ◽

Expression Recognition ◽

Temporal Features ◽

Visual Descriptors ◽

Feature Aggregation ◽

Temporal Feature ◽

Video Descriptor

The proposed method has 30 streams, i.e., 15 spatial streams and 15 temporal streams. Each spatial stream corresponds to each temporal stream. Therefore, this work correlates with the symmetry concept. It is a difficult task to classify video-based facial expression owing to the gap between the visual descriptors and the emotions. In order to bridge the gap, a new video descriptor for facial expression recognition is presented to aggregate spatial and temporal convolutional features across the entire extent of a video. The designed framework integrates a state-of-the-art 30 stream and has a trainable spatial–temporal feature aggregation layer. This framework is end-to-end trainable for video-based facial expression recognition. Thus, this framework can effectively avoid overfitting to the limited emotional video datasets, and the trainable strategy can learn to better represent an entire video. The different schemas for pooling spatial–temporal features are investigated, and the spatial and temporal streams are best aggregated by utilizing the proposed method. The extensive experiments on two public databases, BAUM-1s and eNTERFACE05, show that this framework has promising performance and outperforms the state-of-the-art strategies.

Download Full-text

Facial expression recognition from videos using CNN and feature aggregation

Materials Today Proceedings ◽

10.1016/j.matpr.2020.11.795 ◽

2021 ◽

Author(s):

Ratnalata Gupta ◽

L.K. Vishwamitra

Keyword(s):

Facial Expression ◽

Facial Expression Recognition ◽

Expression Recognition ◽

Feature Aggregation

Download Full-text

Hybrid Attention Cascade Network for Facial Expression Recognition

Sensors ◽

10.3390/s21062003 ◽

2021 ◽

Vol 21 (6) ◽

pp. 2003 ◽

Cited By ~ 1

Author(s):

Xiaoliang Zhu ◽

Shihao Ye ◽

Liang Zhao ◽

Zhicheng Dai

Keyword(s):

Facial Expression ◽

Facial Expressions ◽

Facial Expression Recognition ◽

Expression Recognition ◽

Spatial Features ◽

Face Images ◽

Temporal Features ◽

The Face ◽

In The Wild ◽

Fusion Features

As a sub-challenge of EmotiW (the Emotion Recognition in the Wild challenge), how to improve performance on the AFEW (Acted Facial Expressions in the wild) dataset is a popular benchmark for emotion recognition tasks with various constraints, including uneven illumination, head deflection, and facial posture. In this paper, we propose a convenient facial expression recognition cascade network comprising spatial feature extraction, hybrid attention, and temporal feature extraction. First, in a video sequence, faces in each frame are detected, and the corresponding face ROI (range of interest) is extracted to obtain the face images. Then, the face images in each frame are aligned based on the position information of the facial feature points in the images. Second, the aligned face images are input to the residual neural network to extract the spatial features of facial expressions corresponding to the face images. The spatial features are input to the hybrid attention module to obtain the fusion features of facial expressions. Finally, the fusion features are input in the gate control loop unit to extract the temporal features of facial expressions. The temporal features are input to the fully connected layer to classify and recognize facial expressions. Experiments using the CK+ (the extended Cohn Kanade), Oulu-CASIA (Institute of Automation, Chinese Academy of Sciences) and AFEW datasets obtained recognition accuracy rates of 98.46%, 87.31%, and 53.44%, respectively. This demonstrated that the proposed method achieves not only competitive performance comparable to state-of-the-art methods but also greater than 2% performance improvement on the AFEW dataset, proving the significant outperformance of facial expression recognition in the natural environment.

Download Full-text

Fusing HOG and convolutional neural network spatial–temporal features for video-based facial expression recognition

IET Image Processing ◽

10.1049/iet-ipr.2019.0293 ◽

2020 ◽

Vol 14 (1) ◽

pp. 176-182 ◽

Cited By ~ 3

Author(s):

Xianzhang Pan

Keyword(s):

Neural Network ◽

Facial Expression ◽

Convolutional Neural Network ◽

Facial Expression Recognition ◽

Expression Recognition ◽

Temporal Features

Download Full-text

Facial Expression Recognition Based on Fused Spatio-Temporal Features

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.347-350.3780 ◽

2013 ◽

Vol 347-350 ◽

pp. 3780-3785

Author(s):

Jing Jie Yan ◽

Ming Han Xin

Keyword(s):

Facial Expression ◽

Facial Expression Recognition ◽

Feature Vector ◽

Positional Information ◽

Expression Recognition ◽

Linear Filters ◽

Temporal Features ◽

Novel Method ◽

Spatio Temporal ◽

And Behavior

Although spatio-temporal features (ST) have recently been developed and shown to be available for facial expression recognition and behavior recognition in videos, it utilizes the method of directly flattening the cuboid into a vector as a feature vector for recognition which causes the obtained vector is likely to potentially sensitive to small cuboid perturbations or noises. To overcome the drawback of spatio-temporal features, we propose a novel method called fused spatio-temporal features (FST) method utilizing the separable linear filters to detect interesting points and fusing two cuboids representation methods including local histogrammed gradient descriptor and flattening the cuboid into a vector for cuboids descriptor. The proposed FST method may robustness to small cuboid perturbations or noises and also preserve both spatial and temporal positional information. The experimental results on two video-based facial expression databases demonstrate the effectiveness of the proposed method.

Download Full-text

3-Dimensional facial expression recognition in human using multi-points warping

BMC Bioinformatics ◽

10.1186/s12859-019-3153-2 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 4

Author(s):

Olalekan Agbolade ◽

Azree Nazri ◽

Razali Yaakob ◽

Abdul Azim Ghani ◽

Yoke Kqueen Cheah

Keyword(s):

Facial Expression ◽

Facial Expression Recognition ◽

Recognition Accuracy ◽

State Of The Art ◽

The State ◽

Superior Performance ◽

Reference Object ◽

Expression Recognition ◽

Localization Error ◽

Template Mesh

Abstract Background Expression in H-sapiens plays a remarkable role when it comes to social communication. The identification of this expression by human beings is relatively easy and accurate. However, achieving the same result in 3D by machine remains a challenge in computer vision. This is due to the current challenges facing facial data acquisition in 3D; such as lack of homology and complex mathematical analysis for facial point digitization. This study proposes facial expression recognition in human with the application of Multi-points Warping for 3D facial landmark by building a template mesh as a reference object. This template mesh is thereby applied to each of the target mesh on Stirling/ESRC and Bosphorus datasets. The semi-landmarks are allowed to slide along tangents to the curves and surfaces until the bending energy between a template and a target form is minimal and localization error is assessed using Procrustes ANOVA. By using Principal Component Analysis (PCA) for feature selection, classification is done using Linear Discriminant Analysis (LDA). Result The localization error is validated on the two datasets with superior performance over the state-of-the-art methods and variation in the expression is visualized using Principal Components (PCs). The deformations show various expression regions in the faces. The results indicate that Sad expression has the lowest recognition accuracy on both datasets. The classifier achieved a recognition accuracy of 99.58 and 99.32% on Stirling/ESRC and Bosphorus, respectively. Conclusion The results demonstrate that the method is robust and in agreement with the state-of-the-art results.

Download Full-text

Facial Expression Recognition Based on Auxiliary Models

Algorithms ◽

10.3390/a12110227 ◽

2019 ◽

Vol 12 (11) ◽

pp. 227 ◽

Cited By ~ 1

Author(s):

Yingying Wang ◽

Yibin Li ◽

Yong Song ◽

Xuewen Rong

Keyword(s):

Facial Expression ◽

Facial Expressions ◽

Facial Expression Recognition ◽

State Of The Art ◽

Face Image ◽

Great Success ◽

Expression Recognition ◽

Input Information ◽

The Face ◽

Feature Information

In recent years, with the development of artificial intelligence and human–computer interaction, more attention has been paid to the recognition and analysis of facial expressions. Despite much great success, there are a lot of unsatisfying problems, because facial expressions are subtle and complex. Hence, facial expression recognition is still a challenging problem. In most papers, the entire face image is often chosen as the input information. In our daily life, people can perceive other’s current emotions only by several facial components (such as eye, mouth and nose), and other areas of the face (such as hair, skin tone, ears, etc.) play a smaller role in determining one’s emotion. If the entire face image is used as the only input information, the system will produce some unnecessary information and miss some important information in the process of feature extraction. To solve the above problem, this paper proposes a method that combines multiple sub-regions and the entire face image by weighting, which can capture more important feature information that is conducive to improving the recognition accuracy. Our proposed method was evaluated based on four well-known publicly available facial expression databases: JAFFE, CK+, FER2013 and SFEW. The new method showed better performance than most state-of-the-art methods.

Download Full-text

Deep spatial-temporal feature fusion for facial expression recognition in static images

Pattern Recognition Letters ◽

10.1016/j.patrec.2017.10.022 ◽

2019 ◽

Vol 119 ◽

pp. 49-61 ◽

Cited By ~ 15

Author(s):

Ning Sun ◽

Qi Li ◽

Ruizhi Huan ◽

Jixin Liu ◽

Guang Han

Keyword(s):

Facial Expression ◽

Facial Expression Recognition ◽

Feature Fusion ◽

Expression Recognition ◽

Static Images ◽

Temporal Feature

Download Full-text

Active AU Based Patch Weighting for Facial Expression Recognition

10.20944/preprints201701.0120.v1 ◽

2017 ◽

Author(s):

Weicheng Xie ◽

Linlin Shen ◽

Meng Yang ◽

Zhihui Lai

Keyword(s):

Facial Expression ◽

Facial Expression Recognition ◽

State Of The Art ◽

Expression Recognition ◽

Weight Optimization ◽

Expression Variation ◽

Feature Extraction And Selection ◽

Art Works ◽

Feature Optimization ◽

New Feature

Facial expression has lots of applications in human-computer interaction. Although feature extraction and selection have been well studied, the specificity of each expression variation is not fully explored in state-of-the-art works. In this work, the problem of multiclass expression recognition is converted into triplet-wise expression recognition. For each expression triplet, a new feature optimization model based on Action Unit (AU) weighting and patch weight optimization is proposed to represent the specificity of the expression triplet. Sparse representation based approach is then proposed to detect the active AUs of testing sample for better generalization. The algorithm achieved competitive accuracies of 89.67% and 94.09% for Jaffe and CK+ databases, respectively. Better cross-database performance has also been observed.

Download Full-text

Evaluating real-life performance of the state-of-the-art in facial expression recognition using a novel YouTube-based datasets

Multimedia Tools and Applications ◽

10.1007/s11042-016-4321-2 ◽

2017 ◽

Vol 77 (1) ◽

pp. 917-937 ◽

Cited By ~ 4

Author(s):

Muhammad Hameed Siddiqi ◽

Maqbool Ali ◽

Mohamed Elsayed Abdelrahman Eldib ◽

Asfandyar Khan ◽

Oresti Banos ◽

...

Keyword(s):

Facial Expression ◽

Facial Expression Recognition ◽

State Of The Art ◽

Real Life ◽

The State ◽

Expression Recognition

Download Full-text

Visual Scene-Aware Hybrid and Multi-Modal Feature Aggregation for Facial Expression Recognition

Sensors ◽

10.3390/s20185184 ◽

2020 ◽

Vol 20 (18) ◽

pp. 5184

Author(s):

Min Kyu Lee ◽

Dae Ha Kim ◽

Byung Cheol Song

Keyword(s):

Neural Network ◽

Facial Expression ◽

Facial Expression Recognition ◽

State Of The Art ◽

Rapid Development ◽

Expression Recognition ◽

Fusion Method ◽

Facial Landmark ◽

Latent Features ◽

Inter Frame

Facial expression recognition (FER) technology has made considerable progress with the rapid development of deep learning. However, conventional FER techniques are mainly designed and trained for videos that are artificially acquired in a limited environment, so they may not operate robustly on videos acquired in a wild environment suffering from varying illuminations and head poses. In order to solve this problem and improve the ultimate performance of FER, this paper proposes a new architecture that extends a state-of-the-art FER scheme and a multi-modal neural network that can effectively fuse image and landmark information. To this end, we propose three methods. To maximize the performance of the recurrent neural network (RNN) in the previous scheme, we first propose a frame substitution module that replaces the latent features of less important frames with those of important frames based on inter-frame correlation. Second, we propose a method for extracting facial landmark features based on the correlation between frames. Third, we propose a new multi-modal fusion method that effectively fuses video and facial landmark information at the feature level. By applying attention based on the characteristics of each modality to the features of the modality, novel fusion is achieved. Experimental results show that the proposed method provides remarkable performance, with 51.4% accuracy for the wild AFEW dataset, 98.5% accuracy for the CK+ dataset and 81.9% accuracy for the MMI dataset, outperforming the state-of-the-art networks.

Download Full-text