State-of-the-Art Model for Music Object Recognition with Deep Learning

Zhiqing Huang; Xiang Jia; Yifan Guo

doi:10.3390/app9132645

State-of-the-Art Model for Music Object Recognition with Deep Learning

Applied Sciences ◽

10.3390/app9132645 ◽

2019 ◽

Vol 9 (13) ◽

pp. 2645 ◽

Cited By ~ 4

Author(s):

Zhiqing Huang ◽

Xiang Jia ◽

Yifan Guo

Keyword(s):

Semantic Information ◽

Feature Fusion ◽

State Of The Art ◽

General Music ◽

Pitch Accuracy ◽

Detection Model ◽

The Core ◽

Music Score ◽

Music Recognition ◽

Music Information

Optical music recognition (OMR) is an area in music information retrieval. Music object detection is a key part of the OMR pipeline. Notes are used to record pitch and duration and have semantic information. Therefore, note recognition is the core and key aspect of music score recognition. This paper proposes an end-to-end detection model based on a deep convolutional neural network and feature fusion. This model is able to directly process the entire image and then output the symbol categories and the pitch and duration of notes. We show a state-of-the-art recognition model for general music symbols which can get 0.92 duration accurary and 0.96 pitch accuracy .

Download Full-text

MFF-Net: Deepfake Detection Network Based on Multi-Feature Fusion

Entropy ◽

10.3390/e23121692 ◽

2021 ◽

Vol 23 (12) ◽

pp. 1692

Author(s):

Lei Zhao ◽

Mingcheng Zhang ◽

Hongwei Ding ◽

Xiaohui Cui

Keyword(s):

Feature Extraction ◽

Semantic Information ◽

Feature Fusion ◽

State Of The Art ◽

Detection Methods ◽

Textural Features ◽

Detection Technology ◽

Rgb Images ◽

Signal Processing Methods ◽

Made In

Significant progress has been made in generating counterfeit images and videos. Forged videos generated by deepfaking have been widely spread and have caused severe societal impacts, which stir up public concern about automatic deepfake detection technology. Recently, many deepfake detection methods based on forged features have been proposed. Among the popular forged features, textural features are widely used. However, most of the current texture-based detection methods extract textures directly from RGB images, ignoring the mature spectral analysis methods. Therefore, this research proposes a deepfake detection network fusing RGB features and textural information extracted by neural networks and signal processing methods, namely, MFF-Net. Specifically, it consists of four key components: (1) a feature extraction module to further extract textural and frequency information using the Gabor convolution and residual attention blocks; (2) a texture enhancement module to zoom into the subtle textural features in shallow layers; (3) an attention module to force the classifier to focus on the forged part; (4) two instances of feature fusion to firstly fuse textural features from the shallow RGB branch and feature extraction module and then to fuse the textural features and semantic information. Moreover, we further introduce a new diversity loss to force the feature extraction module to learn features of different scales and directions. The experimental results show that MFF-Net has excellent generalization and has achieved state-of-the-art performance on various deepfake datasets.

Download Full-text

ACV-tree: A New Method for Sentence Similarity Modeling

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2018/575 ◽

2018 ◽

Cited By ~ 7

Author(s):

Yuquan Le ◽

Zhi-Jie Wang ◽

Zhe Quan ◽

Jiawei He ◽

Bin Yao

Keyword(s):

Language Processing ◽

Semantic Information ◽

State Of The Art ◽

Word Embeddings ◽

The Core ◽

Tree Kernel ◽

Network Methods ◽

Syntactic Information ◽

Sentence Similarity ◽

Attention Weight

Sentence similarity modeling lies at the core of many natural language processing applications, and thus has received much attention. Owing to the success of word embeddings, recently, popular neural network methods have achieved sentence embedding, obtaining attractive performance. Nevertheless, most of them focused on learning semantic information and modeling it as a continuous vector, while the syntactic information of sentences has not been fully exploited. On the other hand, prior works have shown the benefits of structured trees that include syntactic information, while few methods in this branch utilized the advantages of word embeddings and another powerful technique ? attention weight mechanism. This paper makes the first attempt to absorb their advantages by merging these techniques in a unified structure, dubbed as ACV-tree. Meanwhile, this paper develops a new tree kernel, known as ACVT kernel, that is tailored for sentence similarity measure based on the proposed structure. The experimental results, based on 19 widely-used datasets, demonstrate that our model is effective and competitive, compared against state-of-the-art models.

Download Full-text

Lightweight Attention Pyramid Network for Object Detection and Instance Segmentation

Applied Sciences ◽

10.3390/app10030883 ◽

2020 ◽

Vol 10 (3) ◽

pp. 883 ◽

Cited By ~ 5

Author(s):

Jiwei Zhang ◽

Yanyu Yan ◽

Zelei Cheng ◽

Wendong Wang

Keyword(s):

Object Detection ◽

Semantic Information ◽

Target Location ◽

Feature Fusion ◽

State Of The Art ◽

Detection Accuracy ◽

Low Level ◽

Feature Attention ◽

High Level ◽

Bottom To Top

Feature pyramids of convolutional neural networks (ConvNets)—from bottom to top—are used by most recent researchers for the improvement of object detection accuracy, but they seldom aim to address the correlation of each feature channel and the fusion of low-level features and high-level features. In this paper, an Attention Pyramid Network (APN) is proposed, which mainly contains the adaptive transformation module and feature attention block. The adaptive transformation module utilizes the multiscale feature fusion, and makes full use of the accurate target location information of low-level features and the semantic information of high-level features. Then, the feature attention block strengthens the features of important channels and weakens the features of unimportant channels through learning. By implementing the APN in a basic Mask R-CNN system, our method achieves state-of-the-art results on the MS COCO dataset and 2018 WAD database without bells and whistles. In addition, the structure of the APN makes the network parameters lighter, and runs at 4 ms on average, which is ignorable when compared to the inference time of the backbone of ConvNet.

Download Full-text

Representing Deep Neural Networks Latent Space Geometries with Graphs

Algorithms ◽

10.3390/a14020039 ◽

2021 ◽

Vol 14 (2) ◽

pp. 39

Author(s):

Carlos Lassance ◽

Vincent Gripon ◽

Antonio Ortega

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Deep Learning ◽

Objective Function ◽

Learning Process ◽

Deep Neural Networks ◽

State Of The Art ◽

The Core ◽

Learning Tasks ◽

Latent Space

Deep Learning (DL) has attracted a lot of attention for its ability to reach state-of-the-art performance in many machine learning tasks. The core principle of DL methods consists of training composite architectures in an end-to-end fashion, where inputs are associated with outputs trained to optimize an objective function. Because of their compositional nature, DL architectures naturally exhibit several intermediate representations of the inputs, which belong to so-called latent spaces. When treated individually, these intermediate representations are most of the time unconstrained during the learning process, as it is unclear which properties should be favored. However, when processing a batch of inputs concurrently, the corresponding set of intermediate representations exhibit relations (what we call a geometry) on which desired properties can be sought. In this work, we show that it is possible to introduce constraints on these latent geometries to address various problems. In more detail, we propose to represent geometries by constructing similarity graphs from the intermediate representations obtained when processing a batch of inputs. By constraining these Latent Geometry Graphs (LGGs), we address the three following problems: (i) reproducing the behavior of a teacher architecture is achieved by mimicking its geometry, (ii) designing efficient embeddings for classification is achieved by targeting specific geometries, and (iii) robustness to deviations on inputs is achieved via enforcing smooth variation of geometry between consecutive latent spaces. Using standard vision benchmarks, we demonstrate the ability of the proposed geometry-based methods in solving the considered problems.

Download Full-text

A Multi-Branch Feature Fusion Strategy Based on an Attention Mechanism for Remote Sensing Image Scene Classification

Remote Sensing ◽

10.3390/rs13101950 ◽

2021 ◽

Vol 13 (10) ◽

pp. 1950

Author(s):

Cuiping Shi ◽

Xin Zhao ◽

Liguo Wang

Keyword(s):

Remote Sensing ◽

Feature Extraction ◽

Classification Accuracy ◽

Feature Fusion ◽

State Of The Art ◽

Rapid Development ◽

Remote Sensing Image ◽

Classification Performance ◽

Attention Mechanism ◽

Scene Classification

In recent years, with the rapid development of computer vision, increasing attention has been paid to remote sensing image scene classification. To improve the classification performance, many studies have increased the depth of convolutional neural networks (CNNs) and expanded the width of the network to extract more deep features, thereby increasing the complexity of the model. To solve this problem, in this paper, we propose a lightweight convolutional neural network based on attention-oriented multi-branch feature fusion (AMB-CNN) for remote sensing image scene classification. Firstly, we propose two convolution combination modules for feature extraction, through which the deep features of images can be fully extracted with multi convolution cooperation. Then, the weights of the feature are calculated, and the extracted deep features are sent to the attention mechanism for further feature extraction. Next, all of the extracted features are fused by multiple branches. Finally, depth separable convolution and asymmetric convolution are implemented to greatly reduce the number of parameters. The experimental results show that, compared with some state-of-the-art methods, the proposed method still has a great advantage in classification accuracy with very few parameters.

Download Full-text

Topological Frontier-Based Exploration and Map-Building Using Semantic Information

Sensors ◽

10.3390/s19204595 ◽

2019 ◽

Vol 19 (20) ◽

pp. 4595 ◽

Cited By ~ 2

Author(s):

Clara Gomez ◽

Alejandra C. Hernandez ◽

Ramon Barber

Keyword(s):

Open Space ◽

Semantic Information ◽

State Of The Art ◽

Fundamental Problem ◽

Cost Utility ◽

Indoor Environments ◽

Map Building ◽

Topological Map ◽

Unknown Area ◽

Closure Algorithm

Exploration of unknown environments is a fundamental problem in autonomous robotics that deals with the complexity of autonomously traversing an unknown area while acquiring the most important information of the environment. In this work, a mobile robot exploration algorithm for indoor environments is proposed. It combines frontier-based concepts with behavior-based strategies in order to build a topological representation of the environment. Frontier-based approaches assume that, to gain the most information of an environment, the robot has to move to the regions on the boundary between open space and unexplored space. The novelty of this work is in the semantic frontier classification and frontier selection according to a cost–utility function. In addition, a probabilistic loop closure algorithm is proposed to solve cyclic situations. The system outputs a topological map of the free areas of the environment for further navigation. Finally, simulated and real-world experiments have been carried out, their results and the comparison to other state-of-the-art algorithms show the feasibility of the exploration algorithm proposed and the improvement that it offers with regards to execution time and travelled distance.

Download Full-text

DCPNet: A Densely Connected Pyramid Network for Monocular Depth Estimation

Sensors ◽

10.3390/s21206780 ◽

2021 ◽

Vol 21 (20) ◽

pp. 6780

Author(s):

Zhitong Lai ◽

Rui Tian ◽

Zhiguo Wu ◽

Nannan Ding ◽

Linjian Sun ◽

...

Keyword(s):

Multiple Scales ◽

Feature Fusion ◽

State Of The Art ◽

Depth Estimation ◽

Multi Scale ◽

Pyramid Structure ◽

Benchmark Datasets ◽

The Common ◽

Monocular Depth ◽

Multiple Stages

Pyramid architecture is a useful strategy to fuse multi-scale features in deep monocular depth estimation approaches. However, most pyramid networks fuse features only within the adjacent stages in a pyramid structure. To take full advantage of the pyramid structure, inspired by the success of DenseNet, this paper presents DCPNet, a densely connected pyramid network that fuses multi-scale features from multiple stages of the pyramid structure. DCPNet not only performs feature fusion between the adjacent stages, but also non-adjacent stages. To fuse these features, we design a simple and effective dense connection module (DCM). In addition, we offer a new consideration of the common upscale operation in our approach. We believe DCPNet offers a more efficient way to fuse features from multiple scales in a pyramid-like network. We perform extensive experiments using both outdoor and indoor benchmark datasets (i.e., the KITTI and the NYU Depth V2 datasets) and DCPNet achieves the state-of-the-art results.

Download Full-text

ConAnomaly: Content-Based Anomaly Detection for System Logs

Sensors ◽

10.3390/s21186125 ◽

2021 ◽

Vol 21 (18) ◽

pp. 6125

Author(s):

Dan Lv ◽

Nurbol Luktarhan ◽

Yiyong Chen

Keyword(s):

Anomaly Detection ◽

Semantic Information ◽

Short Term Memory ◽

Weighted Average ◽

Detection Methods ◽

Detection Model ◽

Part Of Speech Tagging ◽

Part Of Speech ◽

System Logs ◽

System Maintenance

Enterprise systems typically produce a large number of logs to record runtime states and important events. Log anomaly detection is efficient for business management and system maintenance. Most existing log-based anomaly detection methods use log parser to get log event indexes or event templates and then utilize machine learning methods to detect anomalies. However, these methods cannot handle unknown log types and do not take advantage of the log semantic information. In this article, we propose ConAnomaly, a log-based anomaly detection model composed of a log sequence encoder (log2vec) and multi-layer Long Short Term Memory Network (LSTM). We designed log2vec based on the Word2vec model, which first vectorized the words in the log content, then deleted the invalid words through part of speech tagging, and finally obtained the sequence vector by the weighted average method. In this way, ConAnomaly not only captures semantic information in the log but also leverages log sequential relationships. We evaluate our proposed approach on two log datasets. Our experimental results show that ConAnomaly has good stability and can deal with unseen log types to a certain extent, and it provides better performance than most log-based anomaly detection methods.

Download Full-text

MR-InpaintNet: Toward Deep Multi-Resolution Learning for Progressive Image Inpainting

10.36227/techrxiv.16641241 ◽

2021 ◽

Author(s):

Huan Zhang ◽

Zhao Zhang ◽

Haijun Zhang ◽

Yi Yang ◽

Shuicheng Yan ◽

...

Keyword(s):

Deep Learning ◽

High Resolution ◽

Semantic Information ◽

Feature Fusion ◽

Image Inpainting ◽

Feature Learning ◽

Low Resolution ◽

Resolution Image ◽

Texture Information ◽

Multiple Resolutions

<div>Deep learning based image inpainting methods have improved the performance greatly due to powerful representation ability of deep learning. However, current deep inpainting methods still tend to produce unreasonable structure and blurry texture, implying that image inpainting is still a challenging topic due to the ill-posed property of the task. To address these issues, we propose a novel deep multi-resolution learning-based progressive image inpainting method, termed MR-InpaintNet, which takes the damaged images of different resolutions as input and then fuses the multi-resolution features for repairing the damaged images. The idea is motivated by the fact that images of different resolutions can provide different levels of feature information. Specifically, the low-resolution image provides strong semantic information and the high-resolution image offers detailed texture information. The middle-resolution image can be used to reduce the gap between low-resolution and high-resolution images, which can further refine the inpainting result. To fuse and improve the multi-resolution features, a novel multi-resolution feature learning (MRFL) process is designed, which is consisted of a multi-resolution feature fusion (MRFF) module, an adaptive feature enhancement (AFE) module and a memory enhanced mechanism (MEM) module for information preservation. Then, the refined multi-resolution features contain both rich semantic information and detailed texture information from multiple resolutions. We further handle the refined multiresolution features by the decoder to obtain the recovered image. Extensive experiments on the Paris Street View, Places2 and CelebA-HQ datasets demonstrate that our proposed MRInpaintNet can effectively recover the textures and structures, and performs favorably against state-of-the-art methods.</div>

Download Full-text

A Soft-YoloV4 for High-Performance Head Detection and Counting

Mathematics ◽

10.3390/math9233096 ◽

2021 ◽

Vol 9 (23) ◽

pp. 3096

Author(s):

Zhen Zhang ◽

Shihao Xia ◽

Yuxing Cai ◽

Cuimei Yang ◽

Shaoning Zeng

Keyword(s):

High Performance ◽

Detection Rate ◽

Detection Method ◽

State Of The Art ◽

Counting Method ◽

People Counting ◽

Detection Model ◽

Detection And Counting ◽

Head Detection ◽

Ap Value

Blockage of pedestrians will cause inaccurate people counting, and people’s heads are easily blocked by each other in crowded occasions. To reduce missed detections as much as possible and improve the capability of the detection model, this paper proposes a new people counting method, named Soft-YoloV4, by attenuating the score of adjacent detection frames to prevent the occurrence of missed detection. The proposed Soft-YoloV4 improves the accuracy of people counting and reduces the incorrect elimination of the detection frames when heads are blocked by each other. Compared with the state-of-the-art YoloV4, the AP value of the proposed head detection method is increased from 88.52 to 90.54%. The Soft-YoloV4 model has much higher robustness and a lower missed detection rate for head detection, and therefore it dramatically improves the accuracy of people counting.

Download Full-text