ALBERT-based Self-ensemble Model with Semi-supervised Learning and Data Augmentation for Clinical Semantic Textual Similarity Calculation: Algorithm Validation Study (Preprint)

10.2196/preprints.23086 ◽

2020 ◽

Author(s):

Junyi Li ◽

Xuejie Zhang ◽

Xiaobing Zhou

Keyword(s):

Supervised Learning ◽

Semantic Similarity ◽

Data Augmentation ◽

Pearson Correlation ◽

Model Performance ◽

Small Data ◽

Calculation Algorithm ◽

Long Distance ◽

Similarity Calculation ◽

Semantic Textual Similarity

BACKGROUND In recent years, with the increase in the amount of information and the importance of information screening, increasing attention has been paid to the calculation of textual semantic similarity. In the medical field, with the rapid increase in electronic medical data, electronic medical records and medical research documents have become important data resources for medical clinical research. Medical textual semantic similarity calculation has become an urgent problem to be solved. The 2019 N2C2/OHNLP shared task Track on Clinical Semantic Textual Similarity is one of significant tasks for medical textual semantic similarity calculation. OBJECTIVE This research aims to solve two problems: 1) The size of medical datasets is small, which leads to the problem of insufficient learning with understanding of the models; 2) The data information will be lost in the process of long-distance propagation, which causes the models to be unable to grasp key information. METHODS This paper combines a text data augmentation method and a self-ensemble ALBERT model under semi-supervised learning to perform clinical textual semantic similarity calculation. RESULTS Compared with the competition methods the 2019 N2C2/OHNLP Track 1 ClinicalSTS, our method achieves state-of-the-art result with a value 0.92 of the Pearson correlation coefficient and surpasses the best result by 2 percentage point. CONCLUSIONS When the size of medical dataset is small, data augmentation and improved semi-supervised learning can increase the size of dataset and boost the learning efficiency of the model. Additionally, self-ensemble improves the model performance significantly. Through the results, we can know that our method has excellent performance and it has great potential to improve related medical problems. CLINICALTRIAL

Download Full-text

ClassMix: Segmentation-Based Data Augmentation for Semi-Supervised Learning

2021 IEEE Winter Conference on Applications of Computer Vision (WACV) ◽

10.1109/wacv48630.2021.00141 ◽

2021 ◽

Author(s):

Viktor Olsson ◽

Wilhelm Tranheden ◽

Juliano Pinto ◽

Lennart Svensson

Keyword(s):

Supervised Learning ◽

Data Augmentation

Download Full-text

Data augmentation and semi-supervised learning for deep neural networks-based text classifier

Proceedings of the 35th Annual ACM Symposium on Applied Computing ◽

10.1145/3341105.3373992 ◽

2020 ◽

Author(s):

Heereen Shim ◽

Stijn Luca ◽

Dietwig Lowet ◽

Bart Vanrumste

Keyword(s):

Neural Networks ◽

Supervised Learning ◽

Deep Neural Networks ◽

Data Augmentation

Download Full-text

Simplifying the Supervised Learning of Kerr Nonlinearity Compensation Algorithms by Data Augmentation

2020 European Conference on Optical Communications (ECOC) ◽

10.1109/ecoc48923.2020.9333417 ◽

2020 ◽

Author(s):

Vladislav Neskorniuk ◽

Pedro J. Freire ◽

Antonio Napoli ◽

Bernhard Spinnler ◽

Wolfgang Schairer ◽

...

Keyword(s):

Supervised Learning ◽

Data Augmentation ◽

Kerr Nonlinearity ◽

Nonlinearity Compensation

Download Full-text

A review: preprocessing techniques and data augmentation for sentiment analysis

Computational Social Networks ◽

10.1186/s40649-020-00080-x ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Huu-Thanh Duong ◽

Tram-Anh Nguyen-Thi

Keyword(s):

Machine Learning ◽

Sentiment Analysis ◽

Supervised Learning ◽

Data Augmentation ◽

Original Data ◽

Training Data ◽

Unseen Data ◽

Augmentation Techniques ◽

User Intervention

AbstractIn literature, the machine learning-based studies of sentiment analysis are usually supervised learning which must have pre-labeled datasets to be large enough in certain domains. Obviously, this task is tedious, expensive and time-consuming to build, and hard to handle unseen data. This paper has approached semi-supervised learning for Vietnamese sentiment analysis which has limited datasets. We have summarized many preprocessing techniques which were performed to clean and normalize data, negation handling, intensification handling to improve the performances. Moreover, data augmentation techniques, which generate new data from the original data to enrich training data without user intervention, have also been presented. In experiments, we have performed various aspects and obtained competitive results which may motivate the next propositions.

Download Full-text

A semi-supervised learning detection method for vision-based monitoring of construction sites by integrating teacher-student networks and data augmentation

Advanced Engineering Informatics ◽

10.1016/j.aei.2021.101372 ◽

2021 ◽

Vol 50 ◽

pp. 101372

Author(s):

Bo Xiao ◽

Yuxuan Zhang ◽

Yuan Chen ◽

Xianfei Yin

Keyword(s):

Supervised Learning ◽

Data Augmentation ◽

Detection Method ◽

Construction Sites ◽

Teacher Student

Download Full-text

Impact of data augmentation on supervised learning for a moving mid-frequency source

The Journal of the Acoustical Society of America ◽

10.1121/10.0007284 ◽

2021 ◽

Vol 150 (5) ◽

pp. 3914-3928

Author(s):

J. A. Castro-Correa ◽

M. Badiey ◽

T. B. Neilsen ◽

D. P. Knobles ◽

W. S. Hodgkiss

Keyword(s):

Supervised Learning ◽

Data Augmentation ◽

Frequency Source

Download Full-text

Text Mining Drug-Protein Interactions using an Ensemble of BERT, Sentence BERT and T5 models

10.1101/2021.10.26.465944 ◽

2021 ◽

Author(s):

Xin Sui ◽

Wanjing Wang ◽

Jinfeng Zhang

Keyword(s):

Protein Interactions ◽

Clustering Algorithm ◽

Data Augmentation ◽

Majority Vote ◽

Classification Model ◽

Ensemble Model ◽

K Nearest Neighbors ◽

Test Dataset ◽

Improved Performance ◽

Using Data

In this work, we trained an ensemble model for predicting drug-protein interactions within a sentence based on only its semantics. Our ensembled model was built using three separate models: 1) a classification model using a fine-tuned BERT model; 2) a fine-tuned sentence BERT model that embeds every sentence into a vector; and 3) another classification model using a fine-tuned T5 model. In all models, we further improved performance using data augmentation. For model 2, we predicted the label of a sentence using k-nearest neighbors with its embedded vector. We also explored ways to ensemble these 3 models: a) we used the majority vote method to ensemble these 3 models; and b) based on the HDBSCAN clustering algorithm, we trained another ensemble model using features from all the models to make decisions. Our best model achieved an F-1 score of 0.753 on the BioCreative VII Track 1 test dataset.

Download Full-text

A Semantic Textual Similarity Calculation Model Based on Pre-training Model

Knowledge Science, Engineering and Management - Lecture Notes in Computer Science ◽

10.1007/978-3-030-82147-0_1 ◽

2021 ◽

pp. 3-15

Author(s):

Zhaoyun Ding ◽

Kai Liu ◽

Wenhao Wang ◽

Bin Liu

Keyword(s):

Training Model ◽

Calculation Model ◽

Model Based ◽

Similarity Calculation ◽

Semantic Textual Similarity

Download Full-text

Self-Supervised Contextual Data Augmentation for Natural Language Processing

Symmetry ◽

10.3390/sym11111393 ◽

2019 ◽

Vol 11 (11) ◽

pp. 1393

Author(s):

Dongju Park ◽

Chang Wook Ahn

Keyword(s):

Supervised Learning ◽

Language Processing ◽

Recurrent Neural Networks ◽

Question Answering ◽

Data Augmentation ◽

Language Model ◽

Contextual Data ◽

External Data ◽

Label Information ◽

Benchmark Datasets

In this paper, we propose a novel data augmentation method with respect to the target context of the data via self-supervised learning. Instead of looking for the exact synonyms of masked words, the proposed method finds words that can replace the original words considering the context. For self-supervised learning, we can employ the masked language model (MLM), which masks a specific word within a sentence and obtains the original word. The MLM learns the context of a sentence through asymmetrical inputs and outputs. However, without using the existing MLM, we propose a label-masked language model (LMLM) that can include label information for the mask tokens used in the MLM to effectively use the MLM in data with label information. The augmentation method performs self-supervised learning using LMLM and then implements data augmentation through the trained model. We demonstrate that our proposed method improves the classification accuracy of recurrent neural networks and convolutional neural network-based classifiers through several experiments for text classification benchmark datasets, including the Stanford Sentiment Treebank-5 (SST5), the Stanford Sentiment Treebank-2 (SST2), the subjectivity (Subj), the Multi-Perspective Question Answering (MPQA), the Movie Reviews (MR), and the Text Retrieval Conference (TREC) datasets. In addition, since the proposed method does not use external data, it can eliminate the time spent collecting external data, or pre-training using external data.

Download Full-text