Occupational Injury Surveillance Methods Using Free Text Data and Machine Learning: Creating a Gold Standard Data Set

Predicting drug-target interactions using multi-label learning with community detection method (DTI-MLCD)

10.1101/2020.05.11.087734 ◽

2020 ◽

Cited By ~ 1

Author(s):

Yanyi Chu ◽

Xiaoqi Shan ◽

Dennis R. Salahub ◽

Yi Xiong ◽

Dong-Qing Wei

Keyword(s):

Machine Learning ◽

Community Detection ◽

Gold Standard ◽

Drug Target ◽

Drug Repositioning ◽

Binary Classification ◽

Predictive Performance ◽

Detection Methods ◽

Data Set ◽

Standard Data

AbstractIdentifying drug-target interactions (DTIs) is an important step for drug discovery and drug repositioning. To reduce heavily experiment cost, booming machine learning has been applied to this field and developed many computational methods, especially binary classification methods. However, there is still much room for improvement in the performance of current methods. Multi-label learning can reduce difficulties faced by binary classification learning with high predictive performance, and has not been explored extensively. The key challenge it faces is the exponential-sized output space, and considering label correlations can help it. Thus, we facilitate the multi-label classification by introducing community detection methods for DTIs prediction, named DTI-MLCD. On the other hand, we updated the gold standard data set proposed in 2008 and still in use today. The proposed DTI-MLCD is performed on the gold standard data set before and after the update, and shows the superiority than other classical machine learning methods and other benchmark proposed methods, which confirms the efficiency of it. The data and code for this study can be found at https://github.com/a96123155/DTI-MLCD.

Download Full-text

Predicting adult neuroscience intensive care unit admission from emergency department triage using a retrospective, tabular-free text machine learning approach

Scientific Reports ◽

10.1038/s41598-021-80985-3 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Eyal Klang ◽

Benjamin R. Kummer ◽

Neha S. Dangayach ◽

Amy Zhong ◽

M. Arash Kia ◽

...

Keyword(s):

Machine Learning ◽

Intensive Care Unit ◽

Emergency Department ◽

Intensive Care ◽

Learning Model ◽

Free Text ◽

Combined Model ◽

Text Data ◽

Machine Learning Model ◽

Record Data

AbstractEarly admission to the neurosciences intensive care unit (NSICU) is associated with improved patient outcomes. Natural language processing offers new possibilities for mining free text in electronic health record data. We sought to develop a machine learning model using both tabular and free text data to identify patients requiring NSICU admission shortly after arrival to the emergency department (ED). We conducted a single-center, retrospective cohort study of adult patients at the Mount Sinai Hospital, an academic medical center in New York City. All patients presenting to our institutional ED between January 2014 and December 2018 were included. Structured (tabular) demographic, clinical, bed movement record data, and free text data from triage notes were extracted from our institutional data warehouse. A machine learning model was trained to predict likelihood of NSICU admission at 30 min from arrival to the ED. We identified 412,858 patients presenting to the ED over the study period, of whom 1900 (0.5%) were admitted to the NSICU. The daily median number of ED presentations was 231 (IQR 200–256) and the median time from ED presentation to the decision for NSICU admission was 169 min (IQR 80–324). A model trained only with text data had an area under the receiver-operating curve (AUC) of 0.90 (95% confidence interval (CI) 0.87–0.91). A structured data-only model had an AUC of 0.92 (95% CI 0.91–0.94). A combined model trained on structured and text data had an AUC of 0.93 (95% CI 0.92–0.95). At a false positive rate of 1:100 (99% specificity), the combined model was 58% sensitive for identifying NSICU admission. A machine learning model using structured and free text data can predict NSICU admission soon after ED arrival. This may potentially improve ED and NSICU resource allocation. Further studies should validate our findings.

Download Full-text

Machine learning model for feature recognition of sports competition based on improved TLD algorithm

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-189312 ◽

2020 ◽

pp. 1-12

Author(s):

Qinglong Ding ◽

Zhenfeng Ding

Keyword(s):

Machine Learning ◽

Feature Recognition ◽

Experimental Results ◽

Pedestrian Tracking ◽

Data Set ◽

Recognition Model ◽

Standard Data ◽

Machine Learning Model ◽

Environmental Background

Sports competition characteristics play an important role in judging the fairness of the game and improving the skills of the athletes. At present, the feature recognition of sports competition is affected by the environmental background, which causes problems in feature recognition. In order to improve the effect of feature recognition of sports competition, this study improves the TLD algorithm, and uses machine learning to build a feature recognition model of sports competition based on the improved TLD algorithm. Moreover, this study applies the TLD algorithm to the long-term pedestrian tracking of PTZ cameras. In view of the shortcomings of the TLD algorithm, this study improves the TLD algorithm. In addition, the improved TLD algorithm is experimentally analyzed on a standard data set, and the improved TLD algorithm is experimentally verified. Finally, the experimental results are visually represented by mathematical statistics methods. The research shows that the method proposed by this paper has certain effects.

Download Full-text

Validation for 2D/3D registration II: The comparison of intensity- and gradient-based merit functions using a new gold standard data set

Medical Physics ◽

10.1118/1.3553403 ◽

2011 ◽

Vol 38 (3) ◽

pp. 1491-1502 ◽

Cited By ~ 27

Author(s):

Christelle Gendrin ◽

Primož Markelj ◽

Supriyanto Ardjo Pawiro ◽

Jakob Spoerk ◽

Christoph Bloch ◽

...

Keyword(s):

Gold Standard ◽

Merit Functions ◽

3D Registration ◽

Data Set ◽

Standard Data ◽

Gradient Based

Download Full-text

Extraction of Traditional Chinese Medicine Entity: Design of a Novel Span-Level Named Entity Recognition Method With Distant Supervision (Preprint)

10.2196/preprints.28219 ◽

2021 ◽

Author(s):

Qi Jia ◽

Dezheng Zhang ◽

Haifeng Xu ◽

Yonghong Xie

Keyword(s):

Chinese Medicine ◽

Gold Standard ◽

False Negative ◽

Named Entity Recognition ◽

Entity Recognition ◽

Data Set ◽

Standard Data ◽

Named Entity ◽

Medical Entity ◽

Clinical Records

BACKGROUND Traditional Chinese medicine (TCM) clinical records contain the symptoms of patients, diagnoses, and subsequent treatment of doctors. These records are important resources for research and analysis of TCM diagnosis knowledge. However, most of TCM clinical records are unstructured text. Therefore, a method to automatically extract medical entities from TCM clinical records is indispensable. OBJECTIVE Training a medical entity extracting model needs a large number of annotated corpus. The cost of annotated corpus is very high and there is a lack of gold-standard data sets for supervised learning methods. Therefore, we utilized distantly supervised named entity recognition (NER) to respond to the challenge. METHODS We propose a span-level distantly supervised NER approach to extract TCM medical entity. It utilizes the pretrained language model and a simple multilayer neural network as classifier to detect and classify entity. We also designed a negative sampling strategy for the span-level model. The strategy randomly selects negative samples in every epoch and filters the possible false-negative samples periodically. It reduces the bad influence from the false-negative samples. RESULTS We compare our methods with other baseline methods to illustrate the effectiveness of our method on a gold-standard data set. The F1 score of our method is 77.34 and it remarkably outperforms the other baselines. CONCLUSIONS We developed a distantly supervised NER approach to extract medical entity from TCM clinical records. We estimated our approach on a TCM clinical record data set. Our experimental results indicate that the proposed approach achieves a better performance than other baselines.

Download Full-text

Machine Learning Electronic Health Record Identification of Patients with Rheumatoid Arthritis: Algorithm Pipeline Development and Validation Study (Preprint)

10.2196/preprints.23930 ◽

2020 ◽

Author(s):

Tjardo D Maarseveen ◽

Timo Meinderink ◽

Marcel J T Reinders ◽

Johannes Knitza ◽

Tom W J Huizinga ◽

...

Keyword(s):

Rheumatoid Arthritis ◽

Machine Learning ◽

Electronic Health Record ◽

Support Vector ◽

Free Text ◽

Health Record ◽

Electronic Health Record Data ◽

Data Set ◽

Record Data ◽

Electronic Health

BACKGROUND Financial codes are often used to extract diagnoses from electronic health records. This approach is prone to false positives. Alternatively, queries are constructed, but these are highly center and language specific. A tantalizing alternative is the automatic identification of patients by employing machine learning on format-free text entries. OBJECTIVE The aim of this study was to develop an easily implementable workflow that builds a machine learning algorithm capable of accurately identifying patients with rheumatoid arthritis from format-free text fields in electronic health records. METHODS Two electronic health record data sets were employed: Leiden (n=3000) and Erlangen (n=4771). Using a portion of the Leiden data (n=2000), we compared 6 different machine learning methods and a naïve word-matching algorithm using 10-fold cross-validation. Performances were compared using the area under the receiver operating characteristic curve (AUROC) and the area under the precision recall curve (AUPRC), and F1 score was used as the primary criterion for selecting the best method to build a classifying algorithm. We selected the optimal threshold of positive predictive value for case identification based on the output of the best method in the training data. This validation workflow was subsequently applied to a portion of the Erlangen data (n=4293). For testing, the best performing methods were applied to remaining data (Leiden n=1000; Erlangen n=478) for an unbiased evaluation. RESULTS For the Leiden data set, the word-matching algorithm demonstrated mixed performance (AUROC 0.90; AUPRC 0.33; F1 score 0.55), and 4 methods significantly outperformed word-matching, with support vector machines performing best (AUROC 0.98; AUPRC 0.88; F1 score 0.83). Applying this support vector machine classifier to the test data resulted in a similarly high performance (F1 score 0.81; positive predictive value [PPV] 0.94), and with this method, we could identify 2873 patients with rheumatoid arthritis in less than 7 seconds out of the complete collection of 23,300 patients in the Leiden electronic health record system. For the Erlangen data set, gradient boosting performed best (AUROC 0.94; AUPRC 0.85; F1 score 0.82) in the training set, and applied to the test data, resulted once again in good results (F1 score 0.67; PPV 0.97). CONCLUSIONS We demonstrate that machine learning methods can extract the records of patients with rheumatoid arthritis from electronic health record data with high precision, allowing research on very large populations for limited costs. Our approach is language and center independent and could be applied to any type of diagnosis. We have developed our pipeline into a universally applicable and easy-to-implement workflow to equip centers with their own high-performing algorithm. This allows the creation of observational studies of unprecedented size covering different countries for low cost from already available data in electronic health record systems.

Download Full-text

What Proportion of Planned Missing Data Is Allowed for Unbiased Estimates of the Association Between Energy Intake and Body Weight Using Multiple Imputation?

Current Developments in Nutrition ◽

10.1093/cdn/nzaa056_014 ◽

2020 ◽

Vol 4 (Supplement_2) ◽

pp. 1167-1167

Author(s):

Keisuke Ejima ◽

Roger Zoh ◽

Carmen Tekwe ◽

David Allison ◽

Andrew Brown

Keyword(s):

Body Weight ◽

Multiple Imputation ◽

Energy Intake ◽

Gold Standard ◽

Self Report ◽

Full Data ◽

Data Set ◽

Gold Standard Method ◽

Standard Data ◽

Unbiased Estimates

Abstract Objectives A gold standard method to measure energy intake (EI) is doubly labeled water (DLW), but it is expensive and not feasible for large studies. EI from self-report (EISR) is prone to bias, but is still widely used due to convenience; however, estimated associations between EISR and outcomes are biased in many cases. Double sampling with multiple imputation (MI) involves obtaining gold standard (e.g., EIDLW) measurements on a random subsample, and proxy data (e.g., EISR) on the whole sample, and recovering missing gold standard information using MI. However, it is not known what proportion of missingness in EIDLW is acceptable to obtain unbiased estimates of associations between EI and outcomes. Methods We used body weight as an example outcome from the CALERIE Study (N = 218). We performed two regressions on the complete dataset: EIDLW as a predictor and body weight (kg) as an outcome to estimate the ‘true’ coefficient (denoted βDLW), or using EISR as the predictor (βSR). Random subsets of EIDLW were deleted (10% to 90% of full data in 10% increments) to simulate obtaining EIDLW data on only a subset of participants. Regressions were performed using the subset EIDLW data using two different approaches: complete case analysis of only the subset (βDLWsub) and MI informed by EISR on the full data set (βMI). Bias was estimated as the difference between βDLW and βSR, between βDLW and βDLWsub for each EIDLW subset, and between βDLW and βMI for each subset. Resampling was repeated 100 times to assess the uncertainty of the bias. Results Bias of EISR was substantial (∼50%). Bias of βDLWsub was not significantly different from zero for all proportions of missing EIDLW; 95% CIs increased as proportion of missingness increased (as expected). Bias for βMI was not significantly different from zero for missingness of EIDLW up to 80%. βMI was significantly negatively biased toward βSR when the proportion of missingness was 90%. 95%CIs of βMI estimates were narrower than those of βDLWsub for all amounts of missingness. Conclusions Unbiased, more precise estimates of the association between EI and body weight using MI were obtained with missing EIDLW as high as 80%. Obtaining gold standard data collection on subsets may allow for unbiased estimates using self-report data feasible in larger samples. Funding Sources NIH R25HL124208. JSPS KAKENHI 18K18146. Meiji Yasuda Foundation of Health and Welfare 2019.

Download Full-text

Family refusal of eye tissue donation from potential solid organ donors: a retrospective analysis of summary and free-text data from the UK National Health Service Blood and Transplant Services (NHS-BT) National Referral Centre (1 April 2014 to 31 March 2017)

BMJ Open ◽

10.1136/bmjopen-2020-045250 ◽

2021 ◽

Vol 11 (9) ◽

pp. e045250

Author(s):

Mike Bracher ◽

Banyana C Madi-Segwagwe ◽

Emma Winstanley ◽

Helen Gillan ◽

Tracy Long-Sutehall

Keyword(s):

National Health Service ◽

Health Service ◽

Free Text ◽

Solid Organ ◽

Organ Donors ◽

Text Data ◽

Data Set ◽

Uk National Health Service ◽

The Uk ◽

Eye Donation

ObjectivesLong-standing undersupply of eye tissue exists both in the UK and globally, and the UK National Health Service Blood and Transplant Service (NHSBT) has called for further research exploring barriers to eye donation. This study aims to: (1) describe reported reasons for non-donation of eye tissue from solid organ donors in the UK between 1 April 2014 and 31 March 2017 and (2) discuss these findings with respect to existing theories relating to non-donation of eyes by family members.DesignSecondary analysis of a national primary data set of recorded reasons for non-donation of eyes from 2790 potential solid organ donors. Data analysis including descriptive statistics and qualitative content analysis of free-text data for 126 recorded cases of family decline of eye donation.SettingNational data set covering solid organ donation (secondary care).Participants2790 potential organ donors were assessed for eye donation eligibility between 1 April 2014 and 31 March 2017.ResultsReasons for non-retrieval of eyes were recorded as: family wishes (n=1339, 48% of total cases); medical reasons (n=841, 30%); deceased wishes (n=180, 7%). In >50% of recorded cases, reasons for non-donation were based on family’s knowledge of the deceased wishes, their perception of the deceased wishes and specific concerns regarding processes or effects of eye donation (for the deceased body). Findings are discussed with respect to the existing theoretical perspectives.ConclusionEye donation involves distinct psychological and sociocultural factors for families and HCPs that have not been fully explored in research or integrated into service design. We propose areas for future research and service development including potential of only retrieving corneal discs as opposed to full eyes to reduce disfigurement concerns; public education regarding donation processes; exploration of how request processes potentially influence acceptance of eye donation; procedures for assessment of familial responses to information provided during consent conversations.

Download Full-text

Machine Learning Electronic Health Record Identification of Patients with Rheumatoid Arthritis: Algorithm Pipeline Development and Validation Study

JMIR Medical Informatics ◽

10.2196/23930 ◽

2020 ◽

Vol 8 (11) ◽

pp. e23930

Author(s):

Tjardo D Maarseveen ◽

Timo Meinderink ◽

Marcel J T Reinders ◽

Johannes Knitza ◽

Tom W J Huizinga ◽

...

Keyword(s):

Rheumatoid Arthritis ◽

Machine Learning ◽

Electronic Health Record ◽

Support Vector ◽

Free Text ◽

Health Record ◽

Electronic Health Record Data ◽

Data Set ◽

Record Data ◽

Electronic Health

Background Financial codes are often used to extract diagnoses from electronic health records. This approach is prone to false positives. Alternatively, queries are constructed, but these are highly center and language specific. A tantalizing alternative is the automatic identification of patients by employing machine learning on format-free text entries. Objective The aim of this study was to develop an easily implementable workflow that builds a machine learning algorithm capable of accurately identifying patients with rheumatoid arthritis from format-free text fields in electronic health records. Methods Two electronic health record data sets were employed: Leiden (n=3000) and Erlangen (n=4771). Using a portion of the Leiden data (n=2000), we compared 6 different machine learning methods and a naïve word-matching algorithm using 10-fold cross-validation. Performances were compared using the area under the receiver operating characteristic curve (AUROC) and the area under the precision recall curve (AUPRC), and F1 score was used as the primary criterion for selecting the best method to build a classifying algorithm. We selected the optimal threshold of positive predictive value for case identification based on the output of the best method in the training data. This validation workflow was subsequently applied to a portion of the Erlangen data (n=4293). For testing, the best performing methods were applied to remaining data (Leiden n=1000; Erlangen n=478) for an unbiased evaluation. Results For the Leiden data set, the word-matching algorithm demonstrated mixed performance (AUROC 0.90; AUPRC 0.33; F1 score 0.55), and 4 methods significantly outperformed word-matching, with support vector machines performing best (AUROC 0.98; AUPRC 0.88; F1 score 0.83). Applying this support vector machine classifier to the test data resulted in a similarly high performance (F1 score 0.81; positive predictive value [PPV] 0.94), and with this method, we could identify 2873 patients with rheumatoid arthritis in less than 7 seconds out of the complete collection of 23,300 patients in the Leiden electronic health record system. For the Erlangen data set, gradient boosting performed best (AUROC 0.94; AUPRC 0.85; F1 score 0.82) in the training set, and applied to the test data, resulted once again in good results (F1 score 0.67; PPV 0.97). Conclusions We demonstrate that machine learning methods can extract the records of patients with rheumatoid arthritis from electronic health record data with high precision, allowing research on very large populations for limited costs. Our approach is language and center independent and could be applied to any type of diagnosis. We have developed our pipeline into a universally applicable and easy-to-implement workflow to equip centers with their own high-performing algorithm. This allows the creation of observational studies of unprecedented size covering different countries for low cost from already available data in electronic health record systems.

Download Full-text

Efficient and Accurate Extracting of Unstructured EHRs on Cancer Therapy Responses for the Development of RECIST Natural Language Processing Tools: Part I, the Corpus

JCO Clinical Cancer Informatics ◽

10.1200/cci.19.00147 ◽

2020 ◽

pp. 383-391 ◽

Cited By ~ 1

Author(s):

Yalun Li ◽

Yung-Hung Luo ◽

Jason A. Wampfler ◽

Samuel M. Rubinstein ◽

Firat Tiryaki ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Gold Standard ◽

Data Extraction ◽

Complete Response ◽

Time Interval ◽

Data Set ◽

Clinical Notes ◽

Standard Data

PURPOSE Electronic health records (EHRs) are created primarily for nonresearch purposes; thus, the amounts of data are enormous, and the data are crude, heterogeneous, incomplete, and largely unstructured, presenting challenges to effective analyses for timely, reliable results. Particularly, research dealing with clinical notes relevant to patient care and outcome is seldom conducted, due to the complexity of data extraction and accurate annotation in the past. RECIST is a set of widely accepted research criteria to evaluate tumor response in patients undergoing antineoplastic therapy. The aim for this study was to identify textual sources for RECIST information in EHRs and to develop a corpus of pharmacotherapy and response entities for development of natural language processing tools. METHODS We focused on pharmacotherapies and patient responses, using 55,120 medical notes (n = 72 types) in Mayo Clinic’s EHRs from 622 randomly selected patients who signed authorization for research. Using the Multidocument Annotation Environment tool, we applied and evaluated predefined keywords, and time interval and note-type filters for identifying RECIST information and established a gold standard data set for patient outcome research. RESULTS Key words reduced clinical notes to 37,406, and using four note types within 12 months postdiagnosis further reduced the number of notes to 5,005 that were manually annotated, which covered 97.9% of all cases (n = 609 of 622). The resulting data set of 609 cases (n = 503 for training and n = 106 for validation purpose), contains 736 fully annotated, deidentified clinical notes, with pharmacotherapies and four response end points: complete response, partial response, stable disease, and progressive disease. This resource is readily expandable to specific drugs, regimens, and most solid tumors. CONCLUSION We have established a gold standard data set to accommodate development of biomedical informatics tools in accelerating research into antineoplastic therapeutic response.

Download Full-text