scholarly journals A Deeper Look at Sheet Music Composer Classification Using Self-Supervised Pretraining

2021 ◽  
Vol 11 (4) ◽  
pp. 1387
Author(s):  
Daniel Yang ◽  
Kevin Ji ◽  
TJ Tsai

This article studies a composer style classification task based on raw sheet music images. While previous works on composer recognition have relied exclusively on supervised learning, we explore the use of self-supervised pretraining methods that have been recently developed for natural language processing. We first convert sheet music images to sequences of musical words, train a language model on a large set of unlabeled musical “sentences”, initialize a classifier with the pretrained language model weights, and then finetune the classifier on a small set of labeled data. We conduct extensive experiments on International Music Score Library Project (IMSLP) piano data using a range of modern language model architectures. We show that pretraining substantially improves classification performance and that Transformer-based architectures perform best. We also introduce two data augmentation strategies and present evidence that the model learns generalizable and semantically meaningful information.

2021 ◽  
Vol 13 (3) ◽  
pp. 516
Author(s):  
Yakoub Bazi ◽  
Laila Bashmal ◽  
Mohamad M. Al Rahhal ◽  
Reham Al Dayil ◽  
Naif Al Ajlan

In this paper, we propose a remote-sensing scene-classification method based on vision transformers. These types of networks, which are now recognized as state-of-the-art models in natural language processing, do not rely on convolution layers as in standard convolutional neural networks (CNNs). Instead, they use multihead attention mechanisms as the main building block to derive long-range contextual relation between pixels in images. In a first step, the images under analysis are divided into patches, then converted to sequence by flattening and embedding. To keep information about the position, embedding position is added to these patches. Then, the resulting sequence is fed to several multihead attention layers for generating the final representation. At the classification stage, the first token sequence is fed to a softmax classification layer. To boost the classification performance, we explore several data augmentation strategies to generate additional data for training. Moreover, we show experimentally that we can compress the network by pruning half of the layers while keeping competing classification accuracies. Experimental results conducted on different remote-sensing image datasets demonstrate the promising capability of the model compared to state-of-the-art methods. Specifically, Vision Transformer obtains an average classification accuracy of 98.49%, 95.86%, 95.56% and 93.83% on Merced, AID, Optimal31 and NWPU datasets, respectively. While the compressed version obtained by removing half of the multihead attention layers yields 97.90%, 94.27%, 95.30% and 93.05%, respectively.


Symmetry ◽  
2019 ◽  
Vol 11 (11) ◽  
pp. 1393
Author(s):  
Dongju Park ◽  
Chang Wook Ahn

In this paper, we propose a novel data augmentation method with respect to the target context of the data via self-supervised learning. Instead of looking for the exact synonyms of masked words, the proposed method finds words that can replace the original words considering the context. For self-supervised learning, we can employ the masked language model (MLM), which masks a specific word within a sentence and obtains the original word. The MLM learns the context of a sentence through asymmetrical inputs and outputs. However, without using the existing MLM, we propose a label-masked language model (LMLM) that can include label information for the mask tokens used in the MLM to effectively use the MLM in data with label information. The augmentation method performs self-supervised learning using LMLM and then implements data augmentation through the trained model. We demonstrate that our proposed method improves the classification accuracy of recurrent neural networks and convolutional neural network-based classifiers through several experiments for text classification benchmark datasets, including the Stanford Sentiment Treebank-5 (SST5), the Stanford Sentiment Treebank-2 (SST2), the subjectivity (Subj), the Multi-Perspective Question Answering (MPQA), the Movie Reviews (MR), and the Text Retrieval Conference (TREC) datasets. In addition, since the proposed method does not use external data, it can eliminate the time spent collecting external data, or pre-training using external data.


2021 ◽  
Vol 12 (2) ◽  
pp. 1-24
Author(s):  
Md Abul Bashar ◽  
Richi Nayak

Language model (LM) has become a common method of transfer learning in Natural Language Processing (NLP) tasks when working with small labeled datasets. An LM is pretrained using an easily available large unlabelled text corpus and is fine-tuned with the labelled data to apply to the target (i.e., downstream) task. As an LM is designed to capture the linguistic aspects of semantics, it can be biased to linguistic features. We argue that exposing an LM model during fine-tuning to instances that capture diverse semantic aspects (e.g., topical, linguistic, semantic relations) present in the dataset will improve its performance on the underlying task. We propose a Mixed Aspect Sampling (MAS) framework to sample instances that capture different semantic aspects of the dataset and use the ensemble classifier to improve the classification performance. Experimental results show that MAS performs better than random sampling as well as the state-of-the-art active learning models to abuse detection tasks where it is hard to collect the labelled data for building an accurate classifier.


2020 ◽  
Vol 34 (09) ◽  
pp. 13389-13396 ◽  
Author(s):  
Jiaqi Lun ◽  
Jia Zhu ◽  
Yong Tang ◽  
Min Yang

Automatic short answer scoring (ASAS) is a research subject of intelligent education, which is a hot field of natural language understanding. Many experiments have confirmed that the ASAS system is not good enough, because its performance is limited by the training data. Focusing on the problem, we propose MDA-ASAS, multiple data augmentation strategies for improving performance on automatic short answer scoring. MDA-ASAS is designed to learn language representation enhanced by data augmentation strategies, which includes back-translation, correct answer as reference answer, and swap content. We argue that external knowledge has a profound impact on the ASAS process. Meanwhile, the Bidirectional Encoder Representations from Transformers (BERT) model has been shown to be effective for improving many natural language processing tasks, which acquires more semantic, grammatical and other features in large amounts of unsupervised data, and actually adds external knowledge. Combining with the latest BERT model, our experimental results on the ASAS dataset show that MDA-ASAS brings a significant gain over state-of-art. We also perform extensive ablation studies and suggest parameters for practical use.


2019 ◽  
Vol 9 (6) ◽  
pp. 1128 ◽  
Author(s):  
Yundong Li ◽  
Wei Hu ◽  
Han Dong ◽  
Xueyan Zhang

Using aerial cameras, satellite remote sensing or unmanned aerial vehicles (UAV) equipped with cameras can facilitate search and rescue tasks after disasters. The traditional manual interpretation of huge aerial images is inefficient and could be replaced by machine learning-based methods combined with image processing techniques. Given the development of machine learning, researchers find that convolutional neural networks can effectively extract features from images. Some target detection methods based on deep learning, such as the single-shot multibox detector (SSD) algorithm, can achieve better results than traditional methods. However, the impressive performance of machine learning-based methods results from the numerous labeled samples. Given the complexity of post-disaster scenarios, obtaining many samples in the aftermath of disasters is difficult. To address this issue, a damaged building assessment method using SSD with pretraining and data augmentation is proposed in the current study and highlights the following aspects. (1) Objects can be detected and classified into undamaged buildings, damaged buildings, and ruins. (2) A convolution auto-encoder (CAE) that consists of VGG16 is constructed and trained using unlabeled post-disaster images. As a transfer learning strategy, the weights of the SSD model are initialized using the weights of the CAE counterpart. (3) Data augmentation strategies, such as image mirroring, rotation, Gaussian blur, and Gaussian noise processing, are utilized to augment the training data set. As a case study, aerial images of Hurricane Sandy in 2012 were maximized to validate the proposed method’s effectiveness. Experiments show that the pretraining strategy can improve of 10% in terms of overall accuracy compared with the SSD trained from scratch. These experiments also demonstrate that using data augmentation strategies can improve mAP and mF1 by 72% and 20%, respectively. Finally, the experiment is further verified by another dataset of Hurricane Irma, and it is concluded that the paper method is feasible.


2021 ◽  
Vol 11 (1) ◽  
pp. 428
Author(s):  
Donghoon Oh ◽  
Jeong-Sik Park ◽  
Ji-Hwan Kim ◽  
Gil-Jin Jang

Speech recognition consists of converting input sound into a sequence of phonemes, then finding text for the input using language models. Therefore, phoneme classification performance is a critical factor for the successful implementation of a speech recognition system. However, correctly distinguishing phonemes with similar characteristics is still a challenging problem even for state-of-the-art classification methods, and the classification errors are hard to be recovered in the subsequent language processing steps. This paper proposes a hierarchical phoneme clustering method to exploit more suitable recognition models to different phonemes. The phonemes of the TIMIT database are carefully analyzed using a confusion matrix from a baseline speech recognition model. Using automatic phoneme clustering results, a set of phoneme classification models optimized for the generated phoneme groups is constructed and integrated into a hierarchical phoneme classification method. According to the results of a number of phoneme classification experiments, the proposed hierarchical phoneme group models improved performance over the baseline by 3%, 2.1%, 6.0%, and 2.2% for fricative, affricate, stop, and nasal sounds, respectively. The average accuracy was 69.5% and 71.7% for the baseline and proposed hierarchical models, showing a 2.2% overall improvement.


Sensors ◽  
2021 ◽  
Vol 21 (13) ◽  
pp. 4365
Author(s):  
Kwangyong Jung ◽  
Jae-In Lee ◽  
Nammoon Kim ◽  
Sunjin Oh ◽  
Dong-Wook Seo

Radar target classification is an important task in the missile defense system. State-of-the-art studies using micro-doppler frequency have been conducted to classify the space object targets. However, existing studies rely highly on feature extraction methods. Therefore, the generalization performance of the classifier is limited and there is room for improvement. Recently, to improve the classification performance, the popular approaches are to build a convolutional neural network (CNN) architecture with the help of transfer learning and use the generative adversarial network (GAN) to increase the training datasets. However, these methods still have drawbacks. First, they use only one feature to train the network. Therefore, the existing methods cannot guarantee that the classifier learns more robust target characteristics. Second, it is difficult to obtain large amounts of data that accurately mimic real-world target features by performing data augmentation via GAN instead of simulation. To mitigate the above problem, we propose a transfer learning-based parallel network with the spectrogram and the cadence velocity diagram (CVD) as the inputs. In addition, we obtain an EM simulation-based dataset. The radar-received signal is simulated according to a variety of dynamics using the concept of shooting and bouncing rays with relative aspect angles rather than the scattering center reconstruction method. Our proposed model is evaluated on our generated dataset. The proposed method achieved about 0.01 to 0.39% higher accuracy than the pre-trained networks with a single input feature.


2021 ◽  
Vol 13 (4) ◽  
pp. 547
Author(s):  
Wenning Wang ◽  
Xuebin Liu ◽  
Xuanqin Mou

For both traditional classification and current popular deep learning methods, the limited sample classification problem is very challenging, and the lack of samples is an important factor affecting the classification performance. Our work includes two aspects. First, the unsupervised data augmentation for all hyperspectral samples not only improves the classification accuracy greatly with the newly added training samples, but also further improves the classification accuracy of the classifier by optimizing the augmented test samples. Second, an effective spectral structure extraction method is designed, and the effective spectral structure features have a better classification accuracy than the true spectral features.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Fahime Khozeimeh ◽  
Danial Sharifrazi ◽  
Navid Hoseini Izadi ◽  
Javad Hassannataj Joloudari ◽  
Afshin Shoeibi ◽  
...  

AbstractCOVID-19 has caused many deaths worldwide. The automation of the diagnosis of this virus is highly desired. Convolutional neural networks (CNNs) have shown outstanding classification performance on image datasets. To date, it appears that COVID computer-aided diagnosis systems based on CNNs and clinical information have not yet been analysed or explored. We propose a novel method, named the CNN-AE, to predict the survival chance of COVID-19 patients using a CNN trained with clinical information. Notably, the required resources to prepare CT images are expensive and limited compared to those required to collect clinical data, such as blood pressure, liver disease, etc. We evaluated our method using a publicly available clinical dataset that we collected. The dataset properties were carefully analysed to extract important features and compute the correlations of features. A data augmentation procedure based on autoencoders (AEs) was proposed to balance the dataset. The experimental results revealed that the average accuracy of the CNN-AE (96.05%) was higher than that of the CNN (92.49%). To demonstrate the generality of our augmentation method, we trained some existing mortality risk prediction methods on our dataset (with and without data augmentation) and compared their performances. We also evaluated our method using another dataset for further generality verification. To show that clinical data can be used for COVID-19 survival chance prediction, the CNN-AE was compared with multiple pre-trained deep models that were tuned based on CT images.


Sign in / Sign up

Export Citation Format

Share Document