scholarly journals Using distant supervision to augment manually annotated data for relation extraction

2019 ◽  
Author(s):  
Peng Su ◽  
Gang Li ◽  
Cathy Wu ◽  
K. Vijay-Shanker

AbstractSignificant progress has been made in applying deep learning on natural language processing tasks recently. However, deep learning models typically require a large amount of annotated training data while often only small labeled datasets are available for many natural language processing tasks in biomedical literature. Building large-size datasets for deep learning is expensive since it involves considerable human effort and usually requires domain expertise in specialized fields. In this work, we consider augmenting manually annotated data with large amounts of data using distant supervision. However, data obtained by distant supervision is often noisy, we first apply some heuristics to remove some of the incorrect annotations. Then using methods inspired from transfer learning, we show that the resulting models outperform models trained on the original manually annotated sets.

2021 ◽  
Vol 45 (10) ◽  
Author(s):  
A. W. Olthof ◽  
P. M. A. van Ooijen ◽  
L. J. Cornelissen

AbstractIn radiology, natural language processing (NLP) allows the extraction of valuable information from radiology reports. It can be used for various downstream tasks such as quality improvement, epidemiological research, and monitoring guideline adherence. Class imbalance, variation in dataset size, variation in report complexity, and algorithm type all influence NLP performance but have not yet been systematically and interrelatedly evaluated. In this study, we investigate these factors on the performance of four types [a fully connected neural network (Dense), a long short-term memory recurrent neural network (LSTM), a convolutional neural network (CNN), and a Bidirectional Encoder Representations from Transformers (BERT)] of deep learning-based NLP. Two datasets consisting of radiologist-annotated reports of both trauma radiographs (n = 2469) and chest radiographs and computer tomography (CT) studies (n = 2255) were split into training sets (80%) and testing sets (20%). The training data was used as a source to train all four model types in 84 experiments (Fracture-data) and 45 experiments (Chest-data) with variation in size and prevalence. The performance was evaluated on sensitivity, specificity, positive predictive value, negative predictive value, area under the curve, and F score. After the NLP of radiology reports, all four model-architectures demonstrated high performance with metrics up to > 0.90. CNN, LSTM, and Dense were outperformed by the BERT algorithm because of its stable results despite variation in training size and prevalence. Awareness of variation in prevalence is warranted because it impacts sensitivity and specificity in opposite directions.


2015 ◽  
Vol 5 (3) ◽  
pp. 19-38 ◽  
Author(s):  
María Herrero-Zazo ◽  
Isabel Segura-Bedmar ◽  
Janna Hastings ◽  
Paloma Martínez

Natural Language Processing (NLP) techniques can provide an interesting way to mine the growing biomedical literature, and a promising approach for new knowledge discovery. However, the major bottleneck in this area is that these systems rely on specific resources providing the domain knowledge. Domain ontologies provide a contextual framework and a semantic representation of the domain, and they can contribute to a better performance of current NLP systems. However, their contribution to information extraction has not been well studied yet. The aim of this paper is to provide insights into the potential role that domain ontologies can play in NLP. To do this, the authors apply the drug-drug interactions ontology (DINTO) to named entity recognition and relation extraction from pharmacological texts. The authors use the DDI corpus, a gold-standard for the development and evaluation of IE systems in this domain, and evaluate their results in the framework of the last SemEval-2013 DDI Extraction task.


Author(s):  
Tian Kang ◽  
Adler Perotte ◽  
Youlan Tang ◽  
Casey Ta ◽  
Chunhua Weng

Abstract Objective The study sought to develop and evaluate a knowledge-based data augmentation method to improve the performance of deep learning models for biomedical natural language processing by overcoming training data scarcity. Materials and Methods We extended the easy data augmentation (EDA) method for biomedical named entity recognition (NER) by incorporating the Unified Medical Language System (UMLS) knowledge and called this method UMLS-EDA. We designed experiments to systematically evaluate the effect of UMLS-EDA on popular deep learning architectures for both NER and classification. We also compared UMLS-EDA to BERT. Results UMLS-EDA enables substantial improvement for NER tasks from the original long short-term memory conditional random fields (LSTM-CRF) model (micro-F1 score: +5%, + 17%, and +15%), helps the LSTM-CRF model (micro-F1 score: 0.66) outperform LSTM-CRF with transfer learning by BERT (0.63), and improves the performance of the state-of-the-art sentence classification model. The largest gain on micro-F1 score is 9%, from 0.75 to 0.84, better than classifiers with BERT pretraining (0.82). Conclusions This study presents a UMLS-based data augmentation method, UMLS-EDA. It is effective at improving deep learning models for both NER and sentence classification, and contributes original insights for designing new, superior deep learning approaches for low-resource biomedical domains.


Author(s):  
K.G.C.M Kooragama ◽  
L.R.W.D. Jayashanka ◽  
J.A. Munasinghe ◽  
K.W. Jayawardana ◽  
Muditha Tissera ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document