Dynamic Warping Network for Semantic Video Segmentation

Complexity ◽

10.1155/2021/6680509 ◽

2021 ◽

Vol 2021 ◽

pp. 1-10

Author(s):

Jiangyun Li ◽

Yikai Zhao ◽

Xingjian He ◽

Xinxin Zhu ◽

Jing Liu

Keyword(s):

Optical Flow ◽

Video Sequence ◽

Video Segmentation ◽

State Of The Art ◽

Temporal Consistency ◽

Feature Maps ◽

Spatiotemporal Information ◽

Benchmark Datasets ◽

Additional Calculation ◽

Dynamic Warping

A major challenge for semantic video segmentation is how to exploit the spatiotemporal information and produce consistent results for a video sequence. Many previous works utilize the precomputed optical flow to warp the feature maps across adjacent frames. However, the imprecise optical flow and the warping operation without any learnable parameters may not achieve accurate feature warping and only bring a slight improvement. In this paper, we propose a novel framework named Dynamic Warping Network (DWNet) to adaptively warp the interframe features for improving the accuracy of warping-based models. Firstly, we design a flow refinement module (FRM) to optimize the precomputed optical flow. Then, we propose a flow-guided convolution (FG-Conv) to achieve the adaptive feature warping based on the refined optical flow. Furthermore, we introduce the temporal consistency loss including the feature consistency loss and prediction consistency loss to explicitly supervise the warped features instead of simple feature propagation and fusion, which guarantees the temporal consistency of video segmentation. Note that our DWNet adopts extra constraints to improve the temporal consistency in the training phase, while no additional calculation and postprocessing are required during inference. Extensive experiments show that our DWNet can achieve consistent improvement over various strong baselines and achieves state-of-the-art accuracy on the Cityscapes and CamVid benchmark datasets.

Download Full-text

Every Frame Counts: Joint Learning of Video Segmentation and Optical Flow

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i07.6699 ◽

2020 ◽

Vol 34 (07) ◽

pp. 10713-10720

Author(s):

Mingyu Ding ◽

Zhe Wang ◽

Bolei Zhou ◽

Jianping Shi ◽

Zhiwu Lu ◽

...

Keyword(s):

Optical Flow ◽

Video Segmentation ◽

Video Clip ◽

Semantic Segmentation ◽

Temporal Consistency ◽

Flow Estimation ◽

Optical Flow Estimation ◽

Optical Flows ◽

Benchmark Datasets ◽

Spatio Temporal

A major challenge for video semantic segmentation is the lack of labeled data. In most benchmark datasets, only one frame of a video clip is annotated, which makes most supervised methods fail to utilize information from the rest of the frames. To exploit the spatio-temporal information in videos, many previous works use pre-computed optical flows, which encode the temporal consistency to improve the video segmentation. However, the video segmentation and optical flow estimation are still considered as two separate tasks. In this paper, we propose a novel framework for joint video semantic segmentation and optical flow estimation. Semantic segmentation brings semantic information to handle occlusion for more robust optical flow estimation, while the non-occluded optical flow provides accurate pixel-level temporal correspondences to guarantee the temporal consistency of the segmentation. Moreover, our framework is able to utilize both labeled and unlabeled frames in the video through joint training, while no additional calculation is required in inference. Extensive experiments show that the proposed model makes the video semantic segmentation and optical flow estimation benefit from each other and outperforms existing methods under the same settings in both tasks.

Download Full-text

Semantics-Aligned Representation Learning for Person Re-Identification

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i07.6775 ◽

2020 ◽

Vol 34 (07) ◽

pp. 11173-11180 ◽

Cited By ~ 3

Author(s):

Xin Jin ◽

Cuiling Lan ◽

Wenjun Zeng ◽

Guoqiang Wei ◽

Zhibo Chen

Keyword(s):

State Of The Art ◽

Representation Learning ◽

The State ◽

Feature Representation ◽

Texture Image ◽

Computationally Efficient ◽

Feature Maps ◽

Benchmark Datasets ◽

Texture Generation ◽

Base Network

Person re-identification (reID) aims to match person images to retrieve the ones with the same identity. This is a challenging task, as the images to be matched are generally semantically misaligned due to the diversity of human poses and capture viewpoints, incompleteness of the visible bodies (due to occlusion), etc. In this paper, we propose a framework that drives the reID network to learn semantics-aligned feature representation through delicate supervision designs. Specifically, we build a Semantics Aligning Network (SAN) which consists of a base network as encoder (SA-Enc) for re-ID, and a decoder (SA-Dec) for reconstructing/regressing the densely semantics aligned full texture image. We jointly train the SAN under the supervisions of person re-identification and aligned texture generation. Moreover, at the decoder, besides the reconstruction loss, we add Triplet ReID constraints over the feature maps as the perceptual losses. The decoder is discarded in the inference and thus our scheme is computationally efficient. Ablation studies demonstrate the effectiveness of our design. We achieve the state-of-the-art performances on the benchmark datasets CUHK03, Market1501, MSMT17, and the partial person reID dataset Partial REID.

Download Full-text

End-to-End Thorough Body Perception for Person Search

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i07.6886 ◽

2020 ◽

Vol 34 (07) ◽

pp. 12079-12086

Author(s):

Kun Tian ◽

Houjing Huang ◽

Yun Ye ◽

Shiyu Li ◽

Jinbin Lin ◽

...

Keyword(s):

Background Noise ◽

State Of The Art ◽

Feature Maps ◽

Body Perception ◽

Benchmark Datasets ◽

Person Search ◽

End To End ◽

Highly Correlated ◽

Feature Expression ◽

Instance Segmentation

In this paper, we propose an improved end-to-end multi-branch person search network to jointly optimize person detection, re-identification, instance segmentation, and keypoint detection. First, we build a better and faster base model to extract non-highly correlated feature expression; Second, a foreground feature enhance module is used to alleviate undesirable background noise in person feature maps; Third, we design an algorithm to learn the part-aligned representation for person search. Extensive experiments with ablation analysis show the effectiveness of our proposed end-to-end multi-task model, and we demonstrate its superiority over the state-of-the-art methods on two benchmark datasets including CUHK-SYSU and PRW.

Download Full-text

Histopathological Classification of Breast Cancer Images Using a Multi-Scale Input and Multi-Feature Network

Cancers ◽

10.3390/cancers12082031 ◽

2020 ◽

Vol 12 (8) ◽

pp. 2031 ◽

Cited By ~ 2

Author(s):

Taimoor Shakeel Sheikh ◽

Yonghee Lee ◽

Migyung Cho

Keyword(s):

State Of The Art ◽

Texture Features ◽

Feature Maps ◽

Histopathological Classification ◽

Multi Scale ◽

Machine Learning Methods ◽

Proposed Model ◽

Benchmark Datasets ◽

Histopathological Images

Diagnosis of pathologies using histopathological images can be time-consuming when many images with different magnification levels need to be analyzed. State-of-the-art computer vision and machine learning methods can help automate the diagnostic pathology workflow and thus reduce the analysis time. Automated systems can also be more efficient and accurate, and can increase the objectivity of diagnosis by reducing operator variability. We propose a multi-scale input and multi-feature network (MSI-MFNet) model, which can learn the overall structures and texture features of different scale tissues by fusing multi-resolution hierarchical feature maps from the network’s dense connectivity structure. The MSI-MFNet predicts the probability of a disease on the patch and image levels. We evaluated the performance of our proposed model on two public benchmark datasets. Furthermore, through ablation studies of the model, we found that multi-scale input and multi-feature maps play an important role in improving the performance of the model. Our proposed model outperformed the existing state-of-the-art models by demonstrating better accuracy, sensitivity, and specificity.

Download Full-text

Lane and Road Marker Semantic Video Segmentation Using Mask Cropping and Optical Flow Estimation

Sensors ◽

10.3390/s21217156 ◽

2021 ◽

Vol 21 (21) ◽

pp. 7156

Author(s):

Guansheng Xing ◽

Ziming Zhu

Keyword(s):

Optical Flow ◽

Video Segmentation ◽

Learning Algorithm ◽

Time Consistency ◽

Autonomous Driving ◽

Target Area ◽

Temporal Consistency ◽

Current Frame ◽

The Past ◽

Single Output

Lane and road marker segmentation is crucial in autonomous driving, and many related methods have been proposed in this field. However, most of them are based on single-frame prediction, which causes unstable results between frames. Some semantic multi-frame segmentation methods produce error accumulation and are not fast enough. Therefore, we propose a deep learning algorithm that takes into account the continuity information of adjacent image frames, including image sequence processing and an end-to-end trainable multi-input single-output network to jointly process the segmentation of lanes and road markers. In order to emphasize the location of the target with high probability in the adjacent frames and to refine the segmentation result of the current frame, we explicitly consider the time consistency between frames, expand the segmentation region of the previous frame, and use the optical flow of the adjacent frames to reverse the past prediction, then use it as an additional input of the network in training and reasoning, thereby improving the network’s attention to the target area of the past frame. We segmented lanes and road markers on the Baidu Apolloscape lanemark segmentation dataset and CULane dataset, and present benchmarks for different networks. The experimental results show that this method accelerates the segmentation speed of video lanes and road markers by 2.5 times, increases accuracy by 1.4%, and reduces temporal consistency by only 2.2% at most.

Download Full-text

Multi-Scale Dense Attention Network for Stereo Matching

Electronics ◽

10.3390/electronics9111881 ◽

2020 ◽

Vol 9 (11) ◽

pp. 1881

Author(s):

Yuhui Chang ◽

Jiangtao Xu ◽

Zhiyuan Gao

Keyword(s):

Feature Extraction ◽

Stereo Matching ◽

State Of The Art ◽

Ground Truth ◽

Context Information ◽

Context Aware ◽

Feature Maps ◽

Attention Network ◽

Multi Scale ◽

Benchmark Datasets

To improve the accuracy of stereo matching, the multi-scale dense attention network (MDA-Net) is proposed. The network introduces two novel modules in the feature extraction stage to achieve better exploit of context information: dual-path upsampling (DU) block and attention-guided context-aware pyramid feature extraction (ACPFE) block. The DU block is introduced to fuse different scale feature maps. It introduces sub-pixel convolution to compensate for the loss of information caused by the traditional interpolation upsampling method. The ACPFE block is proposed to extract multi-scale context information. Pyramid atrous convolution is adopted to exploit multi-scale features and the channel-attention is used to fuse the multi-scale features. The proposed network has been evaluated on several benchmark datasets. The three-pixel-error evaluated over all ground truth pixels is 2.10% on KITTI 2015 dataset. The experiment results prove that MDA-Net achieves state-of-the-art accuracy on KITTI 2012 and 2015 datasets.

Download Full-text

Progressive Feature Polishing Network for Salient Object Detection

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i07.6892 ◽

2020 ◽

Vol 34 (07) ◽

pp. 12128-12135 ◽

Cited By ~ 1

Author(s):

Bo Wang ◽

Quan Chen ◽

Min Zhou ◽

Zhiqiang Zhang ◽

Xiaogang Jin ◽

...

Keyword(s):

Object Detection ◽

State Of The Art ◽

Hierarchical Structures ◽

Salient Object Detection ◽

Salient Object ◽

Post Processing ◽

Feature Maps ◽

Multiple Feature ◽

Benchmark Datasets ◽

Multi Level

Feature matters for salient object detection. Existing methods mainly focus on designing a sophisticated structure to incorporate multi-level features and filter out cluttered features. We present Progressive Feature Polishing Network (PFPN), a simple yet effective framework to progressively polish the multi-level features to be more accurate and representative. By employing multiple Feature Polishing Modules (FPMs) in a recurrent manner, our approach is able to detect salient objects with fine details without any post-processing. A FPM parallelly updates the features of each level by directly incorporating all higher level context information. Moreover, it can keep the dimensions and hierarchical structures of the feature maps, which makes it flexible to be integrated with any CNN-based models. Empirical experiments show that our results are monotonically getting better with increasing number of FPMs. Without bells and whistles, PFPN outperforms the state-of-the-art methods significantly on five benchmark datasets under various evaluation metrics. Our code is available at: https://github.com/chenquan-cq/PFPN.

Download Full-text

Context Modulated Dynamic Networks for Actor and Action Video Segmentation with Language Queries

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i07.6895 ◽

2020 ◽

Vol 34 (07) ◽

pp. 12152-12159

Author(s):

Hao Wang ◽

Cheng Deng ◽

Fan Ma ◽

Yi Yang

Keyword(s):

Video Segmentation ◽

State Of The Art ◽

Dynamic Networks ◽

Specific Region ◽

Visual Features ◽

Convolutional Network ◽

Fine Grained ◽

Convolutional Networks ◽

Benchmark Datasets ◽

Context Features

Actor and action video segmentation with language queries aims to segment out the expression referred objects in the video. This process requires comprehensive language reasoning and fine-grained video understanding. Previous methods mainly leverage dynamic convolutional networks to match visual and semantic representations. However, the dynamic convolution neglects spatial context when processing each region in the frame and is thus challenging to segment similar objects in the complex scenarios. To address such limitation, we construct a context modulated dynamic convolutional network. Specifically, we propose a context modulated dynamic convolutional operation in the proposed framework. The kernels for the specific region are generated from both language sentences and surrounding context features. Moreover, we devise a temporal encoder to incorporate motions into the visual features to further match the query descriptions. Extensive experiments on two benchmark datasets, Actor-Action Dataset Sentences (A2D Sentences) and J-HMDB Sentences, demonstrate that our proposed approach notably outperforms state-of-the-art methods.

Download Full-text

BiLabel-Specific Features for Multi-Label Classification

ACM Transactions on Knowledge Discovery from Data ◽

10.1145/3458283 ◽

2021 ◽

Vol 16 (1) ◽

pp. 1-23

Author(s):

Min-Ling Zhang ◽

Jun-Peng Fang ◽

Yi-Bo Wang

Keyword(s):

Predictive Models ◽

Comparative Studies ◽

State Of The Art ◽

Classification Model ◽

Generation Process ◽

Prototype Selection ◽

Class Label ◽

Benchmark Datasets ◽

Label Correlations ◽

Class Labels

In multi-label classification, the task is to induce predictive models which can assign a set of relevant labels for the unseen instance. The strategy of label-specific features has been widely employed in learning from multi-label examples, where the classification model for predicting the relevancy of each class label is induced based on its tailored features rather than the original features. Existing approaches work by generating a group of tailored features for each class label independently, where label correlations are not fully considered in the label-specific features generation process. In this article, we extend existing strategy by proposing a simple yet effective approach based on BiLabel-specific features. Specifically, a group of tailored features is generated for a pair of class labels with heuristic prototype selection and embedding. Thereafter, predictions of classifiers induced by BiLabel-specific features are ensembled to determine the relevancy of each class label for unseen instance. To thoroughly evaluate the BiLabel-specific features strategy, extensive experiments are conducted over a total of 35 benchmark datasets. Comparative studies against state-of-the-art label-specific features techniques clearly validate the superiority of utilizing BiLabel-specific features to yield stronger generalization performance for multi-label classification.

Download Full-text

Real-Time Environment Monitoring Using a Lightweight Image Super-Resolution Network

International Journal of Environmental Research and Public Health ◽

10.3390/ijerph18115890 ◽

2021 ◽

Vol 18 (11) ◽

pp. 5890

Author(s):

Qiang Yu ◽

Feiqiang Liu ◽

Long Xiao ◽

Zitao Liu ◽

Xiaomin Yang

Keyword(s):

Deep Learning ◽

Real Time ◽

Super Resolution ◽

Model Complexity ◽

Practical Application ◽

Single Image ◽

Feature Maps ◽

Benchmark Datasets ◽

Image Super Resolution ◽

Single Image Super Resolution

Deep-learning (DL)-based methods are of growing importance in the field of single image super-resolution (SISR). The practical application of these DL-based models is a remaining problem due to the requirement of heavy computation and huge storage resources. The powerful feature maps of hidden layers in convolutional neural networks (CNN) help the model learn useful information. However, there exists redundancy among feature maps, which can be further exploited. To address these issues, this paper proposes a lightweight efficient feature generating network (EFGN) for SISR by constructing the efficient feature generating block (EFGB). Specifically, the EFGB can conduct plain operations on the original features to produce more feature maps with parameters slightly increasing. With the help of these extra feature maps, the network can extract more useful information from low resolution (LR) images to reconstruct the desired high resolution (HR) images. Experiments conducted on the benchmark datasets demonstrate that the proposed EFGN can outperform other deep-learning based methods in most cases and possess relatively lower model complexity. Additionally, the running time measurement indicates the feasibility of real-time monitoring.

Download Full-text