MFF-Net: Deepfake Detection Network Based on Multi-Feature Fusion

Lei Zhao; Mingcheng Zhang; Hongwei Ding; Xiaohui Cui

doi:10.3390/e23121692

MFF-Net: Deepfake Detection Network Based on Multi-Feature Fusion

Entropy ◽

10.3390/e23121692 ◽

2021 ◽

Vol 23 (12) ◽

pp. 1692

Author(s):

Lei Zhao ◽

Mingcheng Zhang ◽

Hongwei Ding ◽

Xiaohui Cui

Keyword(s):

Feature Extraction ◽

Semantic Information ◽

Feature Fusion ◽

State Of The Art ◽

Detection Methods ◽

Textural Features ◽

Detection Technology ◽

Rgb Images ◽

Signal Processing Methods ◽

Made In

Significant progress has been made in generating counterfeit images and videos. Forged videos generated by deepfaking have been widely spread and have caused severe societal impacts, which stir up public concern about automatic deepfake detection technology. Recently, many deepfake detection methods based on forged features have been proposed. Among the popular forged features, textural features are widely used. However, most of the current texture-based detection methods extract textures directly from RGB images, ignoring the mature spectral analysis methods. Therefore, this research proposes a deepfake detection network fusing RGB features and textural information extracted by neural networks and signal processing methods, namely, MFF-Net. Specifically, it consists of four key components: (1) a feature extraction module to further extract textural and frequency information using the Gabor convolution and residual attention blocks; (2) a texture enhancement module to zoom into the subtle textural features in shallow layers; (3) an attention module to force the classifier to focus on the forged part; (4) two instances of feature fusion to firstly fuse textural features from the shallow RGB branch and feature extraction module and then to fuse the textural features and semantic information. Moreover, we further introduce a new diversity loss to force the feature extraction module to learn features of different scales and directions. The experimental results show that MFF-Net has excellent generalization and has achieved state-of-the-art performance on various deepfake datasets.

Download Full-text

A Multi-Branch Feature Fusion Strategy Based on an Attention Mechanism for Remote Sensing Image Scene Classification

Remote Sensing ◽

10.3390/rs13101950 ◽

2021 ◽

Vol 13 (10) ◽

pp. 1950

Author(s):

Cuiping Shi ◽

Xin Zhao ◽

Liguo Wang

Keyword(s):

Remote Sensing ◽

Feature Extraction ◽

Classification Accuracy ◽

Feature Fusion ◽

State Of The Art ◽

Rapid Development ◽

Remote Sensing Image ◽

Classification Performance ◽

Attention Mechanism ◽

Scene Classification

In recent years, with the rapid development of computer vision, increasing attention has been paid to remote sensing image scene classification. To improve the classification performance, many studies have increased the depth of convolutional neural networks (CNNs) and expanded the width of the network to extract more deep features, thereby increasing the complexity of the model. To solve this problem, in this paper, we propose a lightweight convolutional neural network based on attention-oriented multi-branch feature fusion (AMB-CNN) for remote sensing image scene classification. Firstly, we propose two convolution combination modules for feature extraction, through which the deep features of images can be fully extracted with multi convolution cooperation. Then, the weights of the feature are calculated, and the extracted deep features are sent to the attention mechanism for further feature extraction. Next, all of the extracted features are fused by multiple branches. Finally, depth separable convolution and asymmetric convolution are implemented to greatly reduce the number of parameters. The experimental results show that, compared with some state-of-the-art methods, the proposed method still has a great advantage in classification accuracy with very few parameters.

Download Full-text

Action recognition based on 2D skeletons extracted from RGB videos

MATEC Web of Conferences ◽

10.1051/matecconf/201927702034 ◽

2019 ◽

Vol 277 ◽

pp. 02034

Author(s):

Sophie Aubry ◽

Sohaib Laraba ◽

Joëlle Tilmanne ◽

Thierry Dutoit

Keyword(s):

Neural Networks ◽

Image Classification ◽

Action Recognition ◽

State Of The Art ◽

Video Stream ◽

Motion Data ◽

Rgb Images ◽

Human Pose ◽

2D Images ◽

Made In

In this paper a methodology to recognize actions based on RGB videos is proposed which takes advantages of the recent breakthrough made in deep learning. Following the development of Convolutional Neural Networks (CNNs), research was conducted on the transformation of skeletal motion data into 2D images. In this work, a solution is proposed requiring only the use of RGB videos instead of RGB-D videos. This work is based on multiple works studying the conversion of RGB-D data into 2D images. From a video stream (RGB images), a two-dimension skeleton of 18 joints for each detected body is extracted with a DNN-based human pose estimator called OpenPose. The skeleton data are encoded into Red, Green and Blue channels of images. Different ways of encoding motion data into images were studied. We successfully use state-of-the-art deep neural networks designed for image classification to recognize actions. Based on a study of the related works, we chose to use image classification models: SqueezeNet, AlexNet, DenseNet, ResNet, Inception, VGG and retrained them to perform action recognition. For all the test the NTU RGB+D database is used. The highest accuracy is obtained with ResNet: 83.317% cross-subject and 88.780% cross-view which outperforms most of state-of-the-art results.

Download Full-text

SeDAR: Reading Floorplans Like a Human—Using Deep Learning to Enable Human-Inspired Localisation

International Journal of Computer Vision ◽

10.1007/s11263-019-01239-4 ◽

2019 ◽

Vol 128 (5) ◽

pp. 1286-1310 ◽

Cited By ~ 3

Author(s):

Oscar Mendez ◽

Simon Hadfield ◽

Nicolas Pugeault ◽

Richard Bowden

Keyword(s):

Deep Learning ◽

Semantic Information ◽

State Of The Art ◽

Depth Information ◽

Semantic Maps ◽

Novel Method ◽

Rgb Images ◽

High Level ◽

Robotic Tasks ◽

And Robotics

Abstract The use of human-level semantic information to aid robotic tasks has recently become an important area for both Computer Vision and Robotics. This has been enabled by advances in Deep Learning that allow consistent and robust semantic understanding. Leveraging this semantic vision of the world has allowed human-level understanding to naturally emerge from many different approaches. Particularly, the use of semantic information to aid in localisation and reconstruction has been at the forefront of both fields. Like robots, humans also require the ability to localise within a structure. To aid this, humans have designed high-level semantic maps of our structures called floorplans. We are extremely good at localising in them, even with limited access to the depth information used by robots. This is because we focus on the distribution of semantic elements, rather than geometric ones. Evidence of this is that humans are normally able to localise in a floorplan that has not been scaled properly. In order to grant this ability to robots, it is necessary to use localisation approaches that leverage the same semantic information humans use. In this paper, we present a novel method for semantically enabled global localisation. Our approach relies on the semantic labels present in the floorplan. Deep Learning is leveraged to extract semantic labels from RGB images, which are compared to the floorplan for localisation. While our approach is able to use range measurements if available, we demonstrate that they are unnecessary as we can achieve results comparable to state-of-the-art without them.

Download Full-text

Low Resolution Information Also Matters: Learning Multi-Resolution Representations for Person Re-Identification

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2021/179 ◽

2021 ◽

Author(s):

Guoqing Zhang ◽

Yuhao Chen ◽

Weisi Lin ◽

Arun Chandran ◽

Xuan Jing

Keyword(s):

Feature Extraction ◽

High Resolution ◽

Feature Fusion ◽

State Of The Art ◽

Super Resolution ◽

Input Image ◽

Low Resolution ◽

Joint Learning ◽

Novel Method ◽

Valid Information

As a prevailing task in video surveillance and forensics field, person re-identification (re-ID) aims to match person images captured from non-overlapped cameras. In unconstrained scenarios, person images often suffer from the resolution mismatch problem, i.e., Cross-Resolution Person Re-ID. To overcome this problem, most existing methods restore low resolution (LR) images to high resolution (HR) by super-resolution (SR). However, they only focus on the HR feature extraction and ignore the valid information from original LR images. In this work, we explore the influence of resolutions on feature extraction and develop a novel method for cross-resolution person re-ID called Multi-Resolution Representations Joint Learning (MRJL). Our method consists of a Resolution Reconstruction Network (RRN) and a Dual Feature Fusion Network (DFFN). The RRN uses an input image to construct a HR version and a LR version with an encoder and two decoders, while the DFFN adopts a dual-branch structure to generate person representations from multi-resolution images. Comprehensive experiments on five benchmarks verify the superiority of the proposed MRJL over the relevent state-of-the-art methods.

Download Full-text

Scene Text Detection with Supervised Pyramid Context Network

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33019038 ◽

2019 ◽

Vol 33 ◽

pp. 9038-9045 ◽

Cited By ~ 30

Author(s):

Enze Xie ◽

Yuhang Zang ◽

Shuai Shao ◽

Gang Yu ◽

Cong Yao ◽

...

Keyword(s):

Semantic Information ◽

State Of The Art ◽

Text Detection ◽

Detection Methods ◽

Natural Scenes ◽

The Past ◽

Scene Text Detection ◽

Scene Text ◽

Previous State ◽

Feature Pyramid

Scene text detection methods based on deep learning have achieved remarkable results over the past years. However, due to the high diversity and complexity of natural scenes, previous state-of-the-art text detection methods may still produce a considerable amount of false positives, when applied to images captured in real-world environments. To tackle this issue, mainly inspired by Mask R-CNN, we propose in this paper an effective model for scene text detection, which is based on Feature Pyramid Network (FPN) and instance segmentation. We propose a supervised pyramid context network (SPCNET) to precisely locate text regions while suppressing false positives.Benefited from the guidance of semantic information and sharing FPN, SPCNET obtains significantly enhanced performance while introducing marginal extra computation. Experiments on standard datasets demonstrate that our SPCNET clearly outperforms start-of-the-art methods. Specifically, it achieves an F-measure of 92.1% on ICDAR2013, 87.2% on ICDAR2015, 74.1% on ICDAR2017 MLT and 82.9% on

Download Full-text

LiDAR Odometry and Mapping Based on Semantic Information for Outdoor Environment

Remote Sensing ◽

10.3390/rs13152864 ◽

2021 ◽

Vol 13 (15) ◽

pp. 2864

Author(s):

Shitong Du ◽

Yifan Li ◽

Xuyou Li ◽

Menghao Wu

Keyword(s):

Feature Extraction ◽

Semantic Information ◽

State Of The Art ◽

Outdoor Environment ◽

Geometric Features ◽

Dynamic Object ◽

Object Removal ◽

Localization And Mapping ◽

High Level ◽

Crucial Part

Simultaneous Localization and Mapping (SLAM) in an unknown environment is a crucial part for intelligent mobile robots to achieve high-level navigation and interaction tasks. As one of the typical LiDAR-based SLAM algorithms, the Lidar Odometry and Mapping in Real-time (LOAM) algorithm has shown impressive results. However, LOAM only uses low-level geometric features without considering semantic information. Moreover, the lack of a dynamic object removal strategy limits the algorithm to obtain higher accuracy. To this end, this paper extends the LOAM pipeline by integrating semantic information into the original framework. Specifically, we first propose a two-step dynamic objects filtering strategy. Point-wise semantic labels are then used to improve feature extraction and searching for corresponding points. We evaluate the performance of the proposed method in many challenging scenarios, including highway, country and urban from the KITTI dataset. The results demonstrate that the proposed SLAM system outperforms the state-of-the-art SLAM methods in terms of accuracy and robustness.

Download Full-text

A vehicle re-identification framework based on the improved multi-branch feature fusion network

Scientific Reports ◽

10.1038/s41598-021-99646-6 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Leilei Rong ◽

Yan Xu ◽

Xiaolei Zhou ◽

Lisu Han ◽

Linghui Li ◽

...

Keyword(s):

Feature Extraction ◽

Intelligent Transportation System ◽

Feature Fusion ◽

State Of The Art ◽

Local Feature ◽

Learning Ability ◽

Target Vehicle ◽

Illumination Changes ◽

Global And Local ◽

Vehicle Movement

AbstractVehicle re-identification (re-id) aims to solve the problems of matching and identifying the same vehicle under the scenes across multiple surveillance cameras. For public security and intelligent transportation system (ITS), it is extremely important to locate the target vehicle quickly and accurately in the massive vehicle database. However, re-id of the target vehicle is very challenging due to many factors, such as the orientation variations, illumination changes, occlusion, low resolution, rapid vehicle movement, and amounts of similar vehicle models. In order to resolve the difficulties and enhance the accuracy for vehicle re-id, in this work, we propose an improved multi-branch network in which global–local feature fusion, channel attention mechanism and weighted local feature are comprehensively combined. Firstly, the fusion of global and local features is adopted to obtain more information of the vehicle and enhance the learning ability of the model; Secondly, the channel attention module in the feature extraction branch is embedded to extract the personalized features of the targeting vehicle; Finally, the background and noise information on feature extraction is controlled by weighted local feature. The results of comprehensive experiments on the mainstream evaluation datasets including VeRi-776, VRIC, and VehicleID indicate that our method can effectively improve the accuracy of vehicle re-identification and is superior to the state-of-the-art methods.

Download Full-text

A Lightweight YOLOv4-Based Forestry Pest Detection Method Using Coordinate Attention and Feature Fusion

Entropy ◽

10.3390/e23121587 ◽

2021 ◽

Vol 23 (12) ◽

pp. 1587

Author(s):

Mingfeng Zha ◽

Wenbin Qian ◽

Wenlong Yi ◽

Jing Hua

Keyword(s):

Detection Method ◽

Feature Fusion ◽

State Of The Art ◽

Detection Methods ◽

Model Parameters ◽

Symmetric Structure ◽

Proposed Model ◽

Pest Detection ◽

Feature Information ◽

Small Targets

Traditional pest detection methods are challenging to use in complex forestry environments due to their low accuracy and speed. To address this issue, this paper proposes the YOLOv4_MF model. The YOLOv4_MF model utilizes MobileNetv2 as the feature extraction block and replaces the traditional convolution with depth-wise separated convolution to reduce the model parameters. In addition, the coordinate attention mechanism was embedded in MobileNetv2 to enhance feature information. A symmetric structure consisting of a three-layer spatial pyramid pool is presented, and an improved feature fusion structure was designed to fuse the target information. For the loss function, focal loss was used instead of cross-entropy loss to enhance the network’s learning of small targets. The experimental results showed that the YOLOv4_MF model has 4.24% higher mAP, 4.37% higher precision, and 6.68% higher recall than the YOLOv4 model. The size of the proposed model was reduced to 1/6 of that of YOLOv4. Moreover, the proposed algorithm achieved 38.62% mAP with respect to some state-of-the-art algorithms on the COCO dataset.

Download Full-text

State-of-the-Art Model for Music Object Recognition with Deep Learning

Applied Sciences ◽

10.3390/app9132645 ◽

2019 ◽

Vol 9 (13) ◽

pp. 2645 ◽

Cited By ~ 4

Author(s):

Zhiqing Huang ◽

Xiang Jia ◽

Yifan Guo

Keyword(s):

Semantic Information ◽

Feature Fusion ◽

State Of The Art ◽

General Music ◽

Pitch Accuracy ◽

Detection Model ◽

The Core ◽

Music Score ◽

Music Recognition ◽

Music Information

Optical music recognition (OMR) is an area in music information retrieval. Music object detection is a key part of the OMR pipeline. Notes are used to record pitch and duration and have semantic information. Therefore, note recognition is the core and key aspect of music score recognition. This paper proposes an end-to-end detection model based on a deep convolutional neural network and feature fusion. This model is able to directly process the entire image and then output the symbol categories and the pitch and duration of notes. We show a state-of-the-art recognition model for general music symbols which can get 0.92 duration accurary and 0.96 pitch accuracy .

Download Full-text

Vehicle Re-identification in Multi-branches Network

10.21203/rs.3.rs-598208/v1 ◽

2021 ◽

Author(s):

Leilei Rong ◽

Yan Xu ◽

Xiaolei Zhou ◽

Lisu Han ◽

Linghui Li ◽

...

Keyword(s):

Feature Extraction ◽

Feature Fusion ◽

State Of The Art ◽

Complete Information ◽

Local Feature ◽

Learning Ability ◽

Traffic Surveillance ◽

Illumination Changes ◽

Global And Local ◽

Vehicle Movement

Abstract Vehicle re-identification (Re-ID) aims to solve the problem of matching and identifying the same vehicles under the scene of cross multiple surveillance cameras. Finding the target vehicle quickly and accurately in the massive vehicle database is extremely important for public security, traffic surveillance and applications on smart city. However, it is very challenging due to the orientation variations, illumination changes, occlusion, low resolution, rapid vehicle movement, and amounts of similar vehicle models. In order to overcome these problems and improve the accuracy of vehicle re-identification, a multi-branches network is proposed, which is integrated by global-local feature fusion, channel attention mechanism, and weighted local feature. First, the fusion of global and local features is to obtain more complete information of the vehicle and enhance the learning ability of the model; second, the purpose of embedding the channel attention module in the feature extraction branch is to extract the personalized feature of the vehicle; finally, the influence of sky area and noise information on feature extraction is weakened by weighted local feature. The comprehensive experiments implemented on the mainstream evaluation datasets including VeRi-776, VRIC, and VehicleID indicate that our method can effectively improve the accuracy of vehicle re-identification and is superior to the state-of-the-art methods.

Download Full-text