Clinical Term Normalization Using Learned Edit Patterns and Subconcept Matching (Preprint)

Mapping Intimacies ◽

10.2196/preprints.23104 ◽

2020 ◽

Author(s):

Rohit Kate

Keyword(s):

Test Data ◽

Edit Distance ◽

Standard Test ◽

Training Data ◽

Shared Task ◽

Clinical Text ◽

System A ◽

Semantic Types ◽

The Given ◽

Clinical Terminologies

BACKGROUND Clinical terms mentioned in clinical text are often not in their standardized forms as listed in clinical terminologies due to linguistic and stylistic variations. However, many downstream automated applications require clinical terms mapped to their corresponding concepts in clinical terminologies thus necessitating the task of clinical term normalization. OBJECTIVE In this paper, a system for clinical term normalization is presented which utilizes edit patterns to convert clinical terms into their normalized forms. METHODS The edit patterns are automatically learned from UMLS as well as from the given training data. The edit patterns are generalized sequences of edits which are derived from edit distance computations. The edit patterns are both character-based as well as word-based and are learned separately for different semantic types. Besides these edit patterns, the system also normalizes clinical terms through the subconcepts mentioned in them. RESULTS The system was evaluated on the MCN corpus as part of the 2019 n2c2 Track 3 shared task of clinical term normalization. It obtained 80.79% accuracy on the standard test data. The paper includes ablation studies to evaluate contributions of different components of the system. A challenging part of the task was disambiguation when a clinical term could be normalized to multiple concepts. CONCLUSIONS The learned edit patterns led the system to perform well on the normalization task. Given that the system is based on patterns, it is human-interpretable and is also capable of giving insights about common variations of clinical terms mentioned in clinical text that are different from their standardized forms. CLINICALTRIAL

Download Full-text

Evaluating the state of the art in disorder recognition and normalization of the clinical narrative

Journal of the American Medical Informatics Association ◽

10.1136/amiajnl-2013-002544 ◽

2014 ◽

Vol 22 (1) ◽

pp. 143-154 ◽

Cited By ~ 48

Author(s):

Sameer Pradhan ◽

Noémie Elhadad ◽

Brett R South ◽

David Martinez ◽

Lee Christensen ◽

...

Keyword(s):

Gold Standard ◽

State Of The Art ◽

Hybrid Approach ◽

Standard Test ◽

The State ◽

Machine Learning Algorithms ◽

Training Data ◽

Clinical Text ◽

Clinical Narrative ◽

Community Evaluation

Abstract Objective The ShARe/CLEF eHealth 2013 Evaluation Lab Task 1 was organized to evaluate the state of the art on the clinical text in (i) disorder mention identification/recognition based on Unified Medical Language System (UMLS) definition (Task 1a) and (ii) disorder mention normalization to an ontology (Task 1b). Such a community evaluation has not been previously executed. Task 1a included a total of 22 system submissions, and Task 1b included 17. Most of the systems employed a combination of rules and machine learners. Materials and methods We used a subset of the Shared Annotated Resources (ShARe) corpus of annotated clinical text—199 clinical notes for training and 99 for testing (roughly 180 K words in total). We provided the community with the annotated gold standard training documents to build systems to identify and normalize disorder mentions. The systems were tested on a held-out gold standard test set to measure their performance. Results For Task 1a, the best-performing system achieved an F1 score of 0.75 (0.80 precision; 0.71 recall). For Task 1b, another system performed best with an accuracy of 0.59. Discussion Most of the participating systems used a hybrid approach by supplementing machine-learning algorithms with features generated by rules and gazetteers created from the training data and from external resources. Conclusions The task of disorder normalization is more challenging than that of identification. The ShARe corpus is available to the community as a reference standard for future studies.

Download Full-text

Emotion Detection in Suicide Notes using Maximum Entropy Classification

Biomedical Informatics Insights ◽

10.4137/bii.s8972 ◽

2012 ◽

Vol 5s1 ◽

pp. BII.S8972 ◽

Cited By ~ 8

Author(s):

Richard Wicentowski ◽

Matthew R. Sydes

Keyword(s):

Maximum Entropy ◽

Test Data ◽

Training Data ◽

Emotion Detection ◽

Shared Task ◽

Training Set ◽

Suicide Notes ◽

Syntactic Features ◽

Formed Part

An ensemble of supervised maximum entropy classifiers can accurately detect and identify sentiments expressed in suicide notes. Using lexical and syntactic features extracted from a training set of externally annotated suicide notes, we trained separate classifiers for each of fifteen pre-specified emotions. This formed part of the 2011 i2b2 NLP Shared Task, Track 2. The precision and recall of these classifiers related strongly with the number of occurrences of each emotion in the training data. Evaluating on previously unseen test data, our best system achieved an F1 score of 0.534.

Download Full-text

Unified Medical Language System resources improve sieve-based generation and Bidirectional Encoder Representations from Transformers (BERT)–based ranking for concept normalization

Journal of the American Medical Informatics Association ◽

10.1093/jamia/ocaa080 ◽

2020 ◽

Vol 27 (10) ◽

pp. 1510-1519

Author(s):

Dongfang Xu ◽

Manoj Gopale ◽

Jiacheng Zhang ◽

Kris Brown ◽

Edmon Begoli ◽

...

Keyword(s):

Neural Network ◽

Relation Extraction ◽

Training Data ◽

Shared Task ◽

Semantic Type ◽

Language System ◽

Unified Medical Language System ◽

Medical Language ◽

Rank System ◽

Semantic Types

Abstract Objective Concept normalization, the task of linking phrases in text to concepts in an ontology, is useful for many downstream tasks including relation extraction, information retrieval, etc. We present a generate-and-rank concept normalization system based on our participation in the 2019 National NLP Clinical Challenges Shared Task Track 3 Concept Normalization. Materials and Methods The shared task provided 13 609 concept mentions drawn from 100 discharge summaries. We first design a sieve-based system that uses Lucene indices over the training data, Unified Medical Language System (UMLS) preferred terms, and UMLS synonyms to generate a list of possible concepts for each mention. We then design a listwise classifier based on the BERT (Bidirectional Encoder Representations from Transformers) neural network to rank the candidate concepts, integrating UMLS semantic types through a regularizer. Results Our generate-and-rank system was third of 33 in the competition, outperforming the candidate generator alone (81.66% vs 79.44%) and the previous state of the art (76.35%). During postevaluation, the model’s accuracy was increased to 83.56% via improvements to how training data are generated from UMLS and incorporation of our UMLS semantic type regularizer. Discussion Analysis of the model shows that prioritizing UMLS preferred terms yields better performance, that the UMLS semantic type regularizer results in qualitatively better concept predictions, and that the model performs well even on concepts not seen during training. Conclusions Our generate-and-rank framework for UMLS concept normalization integrates key UMLS features like preferred terms and semantic types with a neural network–based ranking model to accurately link phrases in text to UMLS concepts.

Download Full-text

New polyp image classification technique using transfer learning of network-in-network structure in endoscopic images

Scientific Reports ◽

10.1038/s41598-021-83199-9 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Young Jae Kim ◽

Jang Pyo Bae ◽

Jun-Won Chung ◽

Dong Kyun Park ◽

Kwang Gi Kim ◽

...

Keyword(s):

Colorectal Cancer ◽

Transfer Learning ◽

Test Data ◽

State Of The Art ◽

Early Stage ◽

Statistical Significance ◽

Recall Rate ◽

Training Data ◽

Fine Tuning ◽

Accuracy Evaluation

AbstractWhile colorectal cancer is known to occur in the gastrointestinal tract. It is the third most common form of cancer of 27 major types of cancer in South Korea and worldwide. Colorectal polyps are known to increase the potential of developing colorectal cancer. Detected polyps need to be resected to reduce the risk of developing cancer. This research improved the performance of polyp classification through the fine-tuning of Network-in-Network (NIN) after applying a pre-trained model of the ImageNet database. Random shuffling is performed 20 times on 1000 colonoscopy images. Each set of data are divided into 800 images of training data and 200 images of test data. An accuracy evaluation is performed on 200 images of test data in 20 experiments. Three compared methods were constructed from AlexNet by transferring the weights trained by three different state-of-the-art databases. A normal AlexNet based method without transfer learning was also compared. The accuracy of the proposed method was higher in statistical significance than the accuracy of four other state-of-the-art methods, and showed an 18.9% improvement over the normal AlexNet based method. The area under the curve was approximately 0.930 ± 0.020, and the recall rate was 0.929 ± 0.029. An automatic algorithm can assist endoscopists in identifying polyps that are adenomatous by considering a high recall rate and accuracy. This system can enable the timely resection of polyps at an early stage.

Download Full-text

Improved Training for Machine Learning: The Additional Potential of Innovative Algorithmic Approaches.

10.5194/egusphere-egu21-4683 ◽

2021 ◽

Author(s):

Octavian Dumitru ◽

Gottfried Schwarz ◽

Mihai Datcu ◽

Dongyang Ao ◽

Zhongling Huang ◽

...

Keyword(s):

Machine Learning ◽

Remote Sensing ◽

Test Data ◽

Satellite Images ◽

Training Data ◽

Data Selection ◽

Generative Adversarial Networks ◽

Radar Images ◽

Basic Work ◽

Selection Of

During the last years, much progress has been reached with machine learning algorithms. Among the typical application fields of machine learning are many technical and commercial applications as well as Earth science analyses, where most often indirect and distorted detector data have to be converted to well-calibrated scientific data that are a prerequisite for a correct understanding of the desired physical quantities and their relationships.However, the provision of sufficient calibrated data is not enough for the testing, training, and routine processing of most machine learning applications. In principle, one also needs a clear strategy for the selection of necessary and useful training data and an easily understandable quality control of the finally desired parameters.At a first glance, one could guess that this problem could be solved by a careful selection of representative test data covering many typical cases as well as some counterexamples. Then these test data can be used for the training of the internal parameters of a machine learning application. At a second glance, however, many researchers found out that a simple stacking up of plain examples is not the best choice for many scientific applications.To get improved machine learning results, we concentrated on the analysis of satellite images depicting the Earth&#8217;s surface under various conditions such as the selected instrument type, spectral bands, and spatial resolution. In our case, such data are routinely provided by the freely accessible European Sentinel satellite products (e.g., Sentinel-1, and Sentinel-2). Our basic work then included investigations of how some additional processing steps &#8211; to be linked with the selected training data &#8211; can provide better machine learning results.To this end, we analysed and compared three different approaches to find out machine learning strategies for the joint selection and processing of training data for our Earth observation images:<ul><li>One can optimize the training data selection by adapting the data selection to the specific instrument, target, and application characteristics [1].</li> <li>As an alternative, one can dynamically generate new training parameters by Generative Adversarial Networks. This is comparable to the role of a sparring partner in boxing [2].</li> <li>One can also use a hybrid semi-supervised approach for Synthetic Aperture Radar images with limited labelled data. The method is split in: polarimetric scattering classification, topic modelling for scattering labels, unsupervised constraint learning, and supervised label prediction with constraints [3].</li> </ul>We applied these strategies in the ExtremeEarth sea-ice monitoring project (http://earthanalytics.eu/). As a result, we can demonstrate for which application cases these three strategies will provide a promising alternative to a simple conventional selection of available training data.[1] C.O. Dumitru et. al, &#8220;Understanding Satellite Images: A Data Mining Module for Sentinel Images&#8221;, Big Earth Data, 2020, 4(4), pp. 367-408.[2] D. Ao et. al., &#8220;Dialectical GAN for SAR Image Translation: From Sentinel-1 to TerraSAR-X&#8221;, Remote Sensing, 2018, 10(10), pp. 1-23.[3] Z. Huang, et. al., "HDEC-TFA: An Unsupervised Learning Approach for Discovering Physical Scattering Properties of Single-Polarized SAR Images", IEEE Transactions on Geoscience and Remote Sensing, 2020, pp.1-18.

Download Full-text

Application of the C4.5 Algorithm to Predict the Types of Disease in Pigs Based on Android

JELIKU (Jurnal Elektronik Ilmu Komputer Udayana) ◽

10.24843/jlk.2021.v10.i01.p14 ◽

2021 ◽

Vol 10 (1) ◽

pp. 105

Author(s):

I Gusti Ayu Purnami Indryaswari ◽

Ida Bagus Made Mahendra

Keyword(s):

Programming Language ◽

Test Data ◽

Training Data ◽

Data Sets ◽

Android Application ◽

C4.5 Algorithm ◽

Sqlite Database

Many Indonesian people, especially in Bali, make pigs as livestock. Pig livestock are susceptible to various types of diseases and there have been many cases of pig deaths due to diseases that cause losses to breeders. Therefore, the author wants to create an Android-based application that can predict the type of disease in pigs by applying the C4.5 Algorithm. The C4.5 algorithm is an algorithm for classifying data in order to obtain a rule that is used to predict something. In this study, 50 training data sets were used with 8 types of diseases in pigs and 31 symptoms of disease. which is then inputted into the system so that the data is processed so that the system in the form of an Android application can predict the type of disease in pigs. In the testing process, it was carried out by testing 15 test data sets and producing an accuracy value that is 86.7%. In testing the application features built using the Kotlin programming language and the SQLite database, it has been running as expected.

Download Full-text

Synthetic Sonic Log Generation With Machine Learning: A Contest Summary From Five Methods

Petrophysics – The SPWLA Journal of Formation Evaluation and Reservoir Description ◽

10.30632/pjv62n4-2021a4 ◽

2021 ◽

Vol 62 (4) ◽

pp. 393-406

Author(s):

Yanxiang Yu ◽

◽

Chicheng Xu ◽

Siddharth Misra ◽

Weichang Li ◽

...

Keyword(s):

Machine Learning ◽

Test Data ◽

Short Term Memory ◽

Rock Physics ◽

Training Data ◽

Machine Learning Techniques ◽

Blind Test ◽

Data Set ◽

Benchmark Model ◽

Sonic Log

Compressional and shear sonic traveltime logs (DTC and DTS, respectively) are crucial for subsurface characterization and seismic-well tie. However, these two logs are often missing or incomplete in many oil and gas wells. Therefore, many petrophysical and geophysical workflows include sonic log synthetization or pseudo-log generation based on multivariate regression or rock physics relations. Started on March 1, 2020, and concluded on May 7, 2020, the SPWLA PDDA SIG hosted a contest aiming to predict the DTC and DTS logs from seven “easy-to-acquire” conventional logs using machine-learning methods (GitHub, 2020). In the contest, a total number of 20,525 data points with half-foot resolution from three wells was collected to train regression models using machine-learning techniques. Each data point had seven features, consisting of the conventional “easy-to-acquire” logs: caliper, neutron porosity, gamma ray (GR), deep resistivity, medium resistivity, photoelectric factor, and bulk density, respectively, as well as two sonic logs (DTC and DTS) as the target. The separate data set of 11,089 samples from a fourth well was then used as the blind test data set. The prediction performance of the model was evaluated using root mean square error (RMSE) as the metric, shown in the equation below: RMSE=sqrt(1/2*1/m* [∑_(i=1)^m▒〖(〖DTC〗_pred^i-〖DTC〗_true^i)〗^2 + 〖(〖DTS〗_pred^i-〖DTS〗_true^i)〗^2 ] In the benchmark model, (Yu et al., 2020), we used a Random Forest regressor and conducted minimal preprocessing to the training data set; an RMSE score of 17.93 was achieved on the test data set. The top five models from the contest, on average, beat the performance of our benchmark model by 27% in the RMSE score. In the paper, we will review these five solutions, including preprocess techniques and different machine-learning models, including neural network, long short-term memory (LSTM), and ensemble trees. We found that data cleaning and clustering were critical for improving the performance in all models.

Download Full-text

Equivalent Winding Capacitance Network for Transformer Transient Analysis Based on Standard Test Data

IEEE Transactions on Power Delivery ◽

10.1109/tpwrd.2016.2578047 ◽

2017 ◽

Vol 32 (4) ◽

pp. 1899-1906 ◽

Cited By ~ 1

Author(s):

Afshin Rezaei-Zare

Keyword(s):

Test Data ◽

Transient Analysis ◽

Standard Test

Download Full-text

Analysis of Pressure Pulsations in Reciprocating Compressor Piping Systems

Journal of Engineering for Industry ◽

10.1115/1.3670908 ◽

1966 ◽

Vol 88 (2) ◽

pp. 164-168 ◽

Cited By ~ 4

Author(s):

S. S. Grover

Keyword(s):

Field Test ◽

Test Data ◽

Piping System ◽

Reciprocating Compressor ◽

Practical Application ◽

Pressure Pulsations ◽

System A ◽

Piping Systems ◽

Limiting Condition

This paper deals with pulsations in pressure and flow in the reciprocating compressor and connected piping system. A model is presented that describes the excitation at the compressor and the propagation of the pulsations in the interconnected piping. It has been adapted to digital computations to predict the pulse magnitudes in reciprocating compressor piping systems and to assess measures for their control. Predicted results have been compared with field test data and with simplified limiting condition results. A discussion of its practical application is included.

Download Full-text

Simulation of Stock Prediction System using Artificial Neural Networks

International Journal of Business Analytics ◽

10.4018/ijban.2016070102 ◽

2016 ◽

Vol 3 (3) ◽

pp. 25-44 ◽

Cited By ~ 1

Author(s):

Omisore Olatunji Mumini ◽

Fayemiwo Michael Adebisi ◽

Ofoegbu Osita Edward ◽

Adeniyi Shukurat Abidemi

Keyword(s):

Test Data ◽

Stock Prices ◽

Prediction Accuracy ◽

Stock Exchange ◽

Training Data ◽

Prediction System ◽

Stock Trading ◽

Closing Price ◽

Non Linear ◽

Predicted Values

Stock trading, used to predict the direction of future stock prices, is a dynamic business primarily based on human intuition. This involves analyzing some non-linear fundamental and technical stock variables which are recorded periodically. This study presents the development of an ANN-based prediction model for forecasting closing price in the stock markets. The major steps taken are identification of technical variables used for prediction of stock prices, collection and pre-processing of stock data, and formulation of the ANN-based predictive model. Stock data of periods between 2010 and 2014 were collected from the Nigerian Stock Exchange (NSE) and stored in a database. The data collected were classified into training and test data, where the training data was used to learn non-linear patterns that exist in the dataset; and test data was used to validate the prediction accuracy of the model. Evaluation results obtained from WEKA shows that discrepancies between actual and predicted values are insignificant.

Download Full-text