A Review of Video Object Detection: Datasets, Metrics and Methods

Haidi Zhu; Haoran Wei; Baoqing Li; Xiaobing Yuan; Nasser Kehtarnavaz

doi:10.3390/app10217834

A Review of Video Object Detection: Datasets, Metrics and Methods

Applied Sciences ◽

10.3390/app10217834 ◽

2020 ◽

Vol 10 (21) ◽

pp. 7834

Author(s):

Haidi Zhu ◽

Haoran Wei ◽

Baoqing Li ◽

Xiaobing Yuan ◽

Nasser Kehtarnavaz

Keyword(s):

Object Detection ◽

Computational Efficiency ◽

Large Scale ◽

Visual Recognition ◽

Motion Blur ◽

Video Data ◽

Detection Methods ◽

Video Object ◽

Static Images ◽

Temporal And Spatial

Although there are well established object detection methods based on static images, their application to video data on a frame by frame basis faces two shortcomings: (i) lack of computational efficiency due to redundancy across image frames or by not using a temporal and spatial correlation of features across image frames, and (ii) lack of robustness to real-world conditions such as motion blur and occlusion. Since the introduction of the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2015, a growing number of methods have appeared in the literature on video object detection, many of which have utilized deep learning models. The aim of this paper is to provide a review of these papers on video object detection. An overview of the existing datasets for video object detection together with commonly used evaluation metrics is first presented. Video object detection methods are then categorized and a description of each of them is stated. Two comparison tables are provided to see their differences in terms of both accuracy and computational efficiency. Finally, some future trends in video object detection to address the challenges involved are noted.

Download Full-text

Salient Object Detection on Large-Scale Video Data

2007 IEEE Conference on Computer Vision and Pattern Recognition ◽

10.1109/cvpr.2007.383495 ◽

2007 ◽

Cited By ~ 5

Author(s):

Shile Zhang ◽

Jianping Fan ◽

Hong Lu ◽

Xiangyang Xue

Keyword(s):

Object Detection ◽

Large Scale ◽

Salient Object Detection ◽

Video Data ◽

Salient Object

Download Full-text

Transferable Adversarial Attacks for Image and Video Object Detection

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/134 ◽

2019 ◽

Cited By ~ 8

Author(s):

Xingxing Wei ◽

Siyuan Liang ◽

Ning Chen ◽

Xiaochun Cao

Keyword(s):

Object Detection ◽

Video Data ◽

Detection Methods ◽

Feature Maps ◽

Generative Adversarial Network ◽

Adversarial Network ◽

Adversarial Examples ◽

Adversarial Example ◽

High Level ◽

Image Object Detection

Identifying adversarial examples is beneficial for understanding deep networks and developing robust models. However, existing attacking methods for image object detection have two limitations: weak transferability---the generated adversarial examples often have a low success rate to attack other kinds of detection methods, and high computation cost---they need much time to deal with video data, where many frames need polluting. To address these issues, we present a generative method to obtain adversarial images and videos, thereby significantly reducing the processing time. To enhance transferability, we manipulate the feature maps extracted by a feature network, which usually constitutes the basis of object detectors. Our method is based on the Generative Adversarial Network (GAN) framework, where we combine a high-level class loss and a low-level feature loss to jointly train the adversarial example generator. Experimental results on PASCAL VOC and ImageNet VID datasets show that our method efficiently generates image and video adversarial examples, and more importantly, these adversarial examples have better transferability, therefore being able to simultaneously attack two kinds of representative object detection models: proposal based models like Faster-RCNN and regression based models like SSD.

Download Full-text

Video Object Detection through Traditional and Deep Learning Methods

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.d6833.049420 ◽

2020 ◽

Vol 9 (4) ◽

pp. 1822-1826

Keyword(s):

Deep Learning ◽

Object Detection ◽

Sparse Matrix ◽

Image Understanding ◽

Low Rank ◽

Detection Methods ◽

Video Object ◽

Learning Methods ◽

Primary Focus ◽

High Level

Object detection in videos is gaining more attention recently as it is related to video analytics and facilitates image understanding and applicable to . The video object detection methods can be divided into traditional and deep learning based methods. Trajectory classification, low rank sparse matrix, background subtraction and object tracking are considered as traditional object detection methods as they primary focus is informative feature collection, region selection and classification. The deep learning methods are more popular now days as they facilitate high-level features and problem solving in object detection algorithms. We have discussed various object detection methods and challenges in this paper.

Download Full-text

A Lightweight Keypoint-Based Oriented Object Detection of Remote Sensing Images

Remote Sensing ◽

10.3390/rs13132459 ◽

2021 ◽

Vol 13 (13) ◽

pp. 2459

Author(s):

Yangyang Li ◽

Heting Mao ◽

Ruijiao Liu ◽

Xuan Pei ◽

Licheng Jiao ◽

...

Keyword(s):

Remote Sensing ◽

Object Detection ◽

Large Scale ◽

Detection Methods ◽

Gaussian Kernel ◽

Remote Sensing Images ◽

Computational Overhead ◽

Comparable Performance ◽

Bounding Boxes ◽

Oriented Object

Object detection in remote sensing images has been widely used in military and civilian fields and is a challenging task due to the complex background, large-scale variation, and dense arrangement in arbitrary orientations of objects. In addition, existing object detection methods rely on the increasingly deeper network, which increases a lot of computational overhead and parameters, and is unfavorable to deployment on the edge devices. In this paper, we proposed a lightweight keypoint-based oriented object detector for remote sensing images. First, we propose a semantic transfer block (STB) when merging shallow and deep features, which reduces noise and restores the semantic information. Then, the proposed adaptive Gaussian kernel (AGK) is adapted to objects of different scales, and further improves detection performance. Finally, we propose the distillation loss associated with object detection to obtain a lightweight student network. Experiments on the HRSC2016 and UCAS-AOD datasets show that the proposed method adapts to different scale objects, obtains accurate bounding boxes, and reduces the influence of complex backgrounds. The comparison with mainstream methods proves that our method has comparable performance under lightweight.

Download Full-text

Visual Feature Learning on Video Object and Human Action Detection: A Systematic Review

Micromachines ◽

10.3390/mi13010072 ◽

2021 ◽

Vol 13 (1) ◽

pp. 72

Author(s):

Dengshan Li ◽

Rujing Wang ◽

Peng Chen ◽

Chengjun Xie ◽

Qiong Zhou ◽

...

Keyword(s):

Object Detection ◽

Action Recognition ◽

Human Action Recognition ◽

Human Action ◽

Detection Methods ◽

Video Object ◽

Action Detection ◽

Video Frames ◽

Video Detection ◽

Human Action Detection

Video object and human action detection are applied in many fields, such as video surveillance, face recognition, etc. Video object detection includes object classification and object location within the frame. Human action recognition is the detection of human actions. Usually, video detection is more challenging than image detection, since video frames are often more blurry than images. Moreover, video detection often has other difficulties, such as video defocus, motion blur, part occlusion, etc. Nowadays, the video detection technology is able to implement real-time detection, or high-accurate detection of blurry video frames. In this paper, various video object and human action detection approaches are reviewed and discussed, many of them have performed state-of-the-art results. We mainly review and discuss the classic video detection methods with supervised learning. In addition, the frequently-used video object detection and human action recognition datasets are reviewed. Finally, a summarization of the video detection is represented, e.g., the video object and human action detection methods could be classified into frame-by-frame (frame-based) detection, extracting-key-frame detection and using-temporal-information detection; the methods of utilizing temporal information of adjacent video frames are mainly the optical flow method, Long Short-Term Memory and convolution among adjacent frames.

Download Full-text

Multi-Scale Feature Integrated Attention-Based Rotation Network for Object Detection in VHR Aerial Images

Sensors ◽

10.3390/s20061686 ◽

2020 ◽

Vol 20 (6) ◽

pp. 1686 ◽

Cited By ~ 3

Author(s):

Feng Yang ◽

Wentong Li ◽

Haiwei Hu ◽

Wanyi Li ◽

Peng Wang

Keyword(s):

Object Detection ◽

Large Scale ◽

Ground Truth ◽

Classification Performance ◽

Aerial Images ◽

Detection Methods ◽

Robust Detection ◽

Scale Feature ◽

Multi Scale ◽

Bounding Boxes

Accurate and robust detection of multi-class objects in very high resolution (VHR) aerial images has been playing a significant role in many real-world applications. The traditional detection methods have made remarkable progresses with horizontal bounding boxes (HBBs) due to CNNs. However, HBB detection methods still exhibit limitations including the missed detection and the redundant detection regions, especially for densely-distributed and strip-like objects. Besides, large scale variations and diverse background also bring in many challenges. Aiming to address these problems, an effective region-based object detection framework named Multi-scale Feature Integration Attention Rotation Network (MFIAR-Net) is proposed for aerial images with oriented bounding boxes (OBBs), which promotes the integration of the inherent multi-scale pyramid features to generate a discriminative feature map. Meanwhile, the double-path feature attention network supervised by the mask information of ground truth is introduced to guide the network to focus on object regions and suppress the irrelevant noise. To boost the rotation regression and classification performance, we present a robust Rotation Detection Network, which can generate efficient OBB representation. Extensive experiments and comprehensive evaluations on two publicly available datasets demonstrate the effectiveness of the proposed framework.

Download Full-text

A Single Shot Framework with Multi-Scale Feature Fusion for Geospatial Object Detection

Remote Sensing ◽

10.3390/rs11050594 ◽

2019 ◽

Vol 11 (5) ◽

pp. 594 ◽

Cited By ~ 11

Author(s):

Shuo Zhuang ◽

Ping Wang ◽

Boran Jiang ◽

Gang Wang ◽

Cong Wang

Keyword(s):

Remote Sensing ◽

Object Detection ◽

Large Scale ◽

Feature Fusion ◽

Aerial Images ◽

Detection Methods ◽

Single Shot ◽

Feature Maps ◽

Scale Feature ◽

Multi Scale

With the rapid advances in remote-sensing technologies and the larger number of satellite images, fast and effective object detection plays an important role in understanding and analyzing image information, which could be further applied to civilian and military fields. Recently object detection methods with region-based convolutional neural network have shown excellent performance. However, these two-stage methods contain region proposal generation and object detection procedures, resulting in low computation speed. Because of the expensive manual costs, the quantity of well-annotated aerial images is scarce, which also limits the progress of geospatial object detection in remote sensing. In this paper, on the one hand, we construct and release a large-scale remote-sensing dataset for geospatial object detection (RSD-GOD) that consists of 5 different categories with 18,187 annotated images and 40,990 instances. On the other hand, we design a single shot detection framework with multi-scale feature fusion. The feature maps from different layers are fused together through the up-sampling and concatenation blocks to predict the detection results. High-level features with semantic information and low-level features with fine details are fully explored for detection tasks, especially for small objects. Meanwhile, a soft non-maximum suppression strategy is put into practice to select the final detection results. Extensive experiments have been conducted on two datasets to evaluate the designed network. Results show that the proposed approach achieves a good detection performance and obtains the mean average precision value of 89.0% on a newly constructed RSD-GOD dataset and 83.8% on the Northwestern Polytechnical University very high spatial resolution-10 (NWPU VHR-10) dataset at 18 frames per second (FPS) on a NVIDIA GTX-1080Ti GPU.

Download Full-text

Object Detection Algorithm Based on Improved YOLOv3

Electronics ◽

10.3390/electronics9030537 ◽

2020 ◽

Vol 9 (3) ◽

pp. 537 ◽

Cited By ~ 5

Author(s):

Liquan Zhao ◽

Shuaiyang Li

Keyword(s):

Markov Chains ◽

Object Detection ◽

Large Scale ◽

Ground Truth ◽

Detection Algorithm ◽

Detection Methods ◽

Cluster Method ◽

Cluster Center ◽

Initial Cluster ◽

Bounding Boxes

The ‘You Only Look Once’ v3 (YOLOv3) method is among the most widely used deep learning-based object detection methods. It uses the k-means cluster method to estimate the initial width and height of the predicted bounding boxes. With this method, the estimated width and height are sensitive to the initial cluster centers, and the processing of large-scale datasets is time-consuming. In order to address these problems, a new cluster method for estimating the initial width and height of the predicted bounding boxes has been developed. Firstly, it randomly selects a couple of width and height values as one initial cluster center separate from the width and height of the ground truth boxes. Secondly, it constructs Markov chains based on the selected initial cluster and uses the final points of every Markov chain as the other initial centers. In the construction of Markov chains, the intersection-over-union method is used to compute the distance between the selected initial clusters and each candidate point, instead of the square root method. Finally, this method can be used to continually update the cluster center with each new set of width and height values, which are only a part of the data selected from the datasets. Our simulation results show that the new method has faster convergence speed for initializing the width and height of the predicted bounding boxes and that it can select more representative initial widths and heights of the predicted bounding boxes. Our proposed method achieves better performance than the YOLOv3 method in terms of recall, mean average precision, and F1-score.

Download Full-text

Movement Tube Detection Network Integrating 3D CNN and Object Detection Framework to Detect Fall

Electronics ◽

10.3390/electronics10080898 ◽

2021 ◽

Vol 10 (8) ◽

pp. 898

Author(s):

Song Zou ◽

Weidong Min ◽

Lingfeng Liu ◽

Qi Wang ◽

Xiang Zhou

Keyword(s):

Neural Network ◽

Object Detection ◽

Large Scale ◽

Video Clip ◽

Fall Detection ◽

Detection Methods ◽

Time Range ◽

Temporal Dimension ◽

Spatio Temporal ◽

3D Cnn

Unlike most of the existing neural network-based fall detection methods, which only detect fall at the time range, the algorithm proposed in this paper detect fall in both spatial and temporal dimension. A movement tube detection network integrating 3D CNN and object detection framework such as SSD is proposed to detect human fall with constrained movement tubes. The constrained movement tube, which encapsulates the person with a sequence of bounding boxes, has the merits of encapsulating the person closely and avoiding peripheral interference. A 3D convolutional neural network is used to encode the motion and appearance features of a video clip, which are fed into the tube anchors generation layer, softmax classification, and movement tube regression layer. The movement tube regression layer fine tunes the tube anchors to the constrained movement tubes. A large-scale spatio-temporal (LSST) fall dataset is constructed using self-collected data to evaluate the fall detection in both spatial and temporal dimensions. LSST has three characteristics of large scale, annotation, and posture and viewpoint diversities. Furthermore, the comparative experiments on a public dataset demonstrate that the proposed algorithm achieved sensitivity, specificity an accuracy of 100%, 97.04%, and 97.23%, respectively, outperforms the existing methods.

Download Full-text

Abnormal Detection in Big Data Video with an Improved Autoencoder

Computational Intelligence and Neuroscience ◽

10.1155/2021/9861533 ◽

2021 ◽

Vol 2021 ◽

pp. 1-6

Author(s):

Yihan Bian ◽

Xinchen Tang

Keyword(s):

Big Data ◽

Anomaly Detection ◽

Large Scale ◽

Test Phase ◽

Video Data ◽

Detection Methods ◽

Memory Module ◽

Abnormal Detection ◽

Memory Content ◽

Increasing Demand

With the rapid growth of video surveillance data, there is an increasing demand for big data automatic anomaly detection of large-scale video data. The detection methods using reconstruction errors based on deep autoencoders have been widely discussed. However, sometimes the autoencoder could reconstruct the anomaly well and lead to missing detections. In order to solve this problem, this paper uses a memory module to enhance the autoencoder, which is called the memory-augmented autoencoder (Memory AE) method. Given the input, Memory AE first obtains the code from the encoder and then uses it as a query to retrieve the most relevant memory items for reconstruction. In the training phase, the memory content is updated and encouraged to represent prototype elements of normal data. In the test phase, the learned memory elements are fixed, and reconstruction is obtained from several selected memory records of normal data. So, the reconstruction will tend to be close to normal samples. Therefore, the reconstruction of abnormal errors will be strengthened for abnormal detection. The experimental results on two public video anomaly detection datasets, i.e., Avenue dataset and ShanghaiTech dataset, prove the effectiveness of the proposed method.

Download Full-text