Occupational Injury Surveillance Methods Using Free Text Data and Machine Learning: Creating a Gold Standard Data Set

2020 ◽  
Author(s):  
Liane Hirabayashi ◽  
Erika Scott ◽  
Paul Jenkins ◽  
Nicole Krupa
Author(s):  
Yanyi Chu ◽  
Xiaoqi Shan ◽  
Dennis R. Salahub ◽  
Yi Xiong ◽  
Dong-Qing Wei

AbstractIdentifying drug-target interactions (DTIs) is an important step for drug discovery and drug repositioning. To reduce heavily experiment cost, booming machine learning has been applied to this field and developed many computational methods, especially binary classification methods. However, there is still much room for improvement in the performance of current methods. Multi-label learning can reduce difficulties faced by binary classification learning with high predictive performance, and has not been explored extensively. The key challenge it faces is the exponential-sized output space, and considering label correlations can help it. Thus, we facilitate the multi-label classification by introducing community detection methods for DTIs prediction, named DTI-MLCD. On the other hand, we updated the gold standard data set proposed in 2008 and still in use today. The proposed DTI-MLCD is performed on the gold standard data set before and after the update, and shows the superiority than other classical machine learning methods and other benchmark proposed methods, which confirms the efficiency of it. The data and code for this study can be found at https://github.com/a96123155/DTI-MLCD.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Eyal Klang ◽  
Benjamin R. Kummer ◽  
Neha S. Dangayach ◽  
Amy Zhong ◽  
M. Arash Kia ◽  
...  

AbstractEarly admission to the neurosciences intensive care unit (NSICU) is associated with improved patient outcomes. Natural language processing offers new possibilities for mining free text in electronic health record data. We sought to develop a machine learning model using both tabular and free text data to identify patients requiring NSICU admission shortly after arrival to the emergency department (ED). We conducted a single-center, retrospective cohort study of adult patients at the Mount Sinai Hospital, an academic medical center in New York City. All patients presenting to our institutional ED between January 2014 and December 2018 were included. Structured (tabular) demographic, clinical, bed movement record data, and free text data from triage notes were extracted from our institutional data warehouse. A machine learning model was trained to predict likelihood of NSICU admission at 30 min from arrival to the ED. We identified 412,858 patients presenting to the ED over the study period, of whom 1900 (0.5%) were admitted to the NSICU. The daily median number of ED presentations was 231 (IQR 200–256) and the median time from ED presentation to the decision for NSICU admission was 169 min (IQR 80–324). A model trained only with text data had an area under the receiver-operating curve (AUC) of 0.90 (95% confidence interval (CI) 0.87–0.91). A structured data-only model had an AUC of 0.92 (95% CI 0.91–0.94). A combined model trained on structured and text data had an AUC of 0.93 (95% CI 0.92–0.95). At a false positive rate of 1:100 (99% specificity), the combined model was 58% sensitive for identifying NSICU admission. A machine learning model using structured and free text data can predict NSICU admission soon after ED arrival. This may potentially improve ED and NSICU resource allocation. Further studies should validate our findings.


2020 ◽  
pp. 1-12
Author(s):  
Qinglong Ding ◽  
Zhenfeng Ding

Sports competition characteristics play an important role in judging the fairness of the game and improving the skills of the athletes. At present, the feature recognition of sports competition is affected by the environmental background, which causes problems in feature recognition. In order to improve the effect of feature recognition of sports competition, this study improves the TLD algorithm, and uses machine learning to build a feature recognition model of sports competition based on the improved TLD algorithm. Moreover, this study applies the TLD algorithm to the long-term pedestrian tracking of PTZ cameras. In view of the shortcomings of the TLD algorithm, this study improves the TLD algorithm. In addition, the improved TLD algorithm is experimentally analyzed on a standard data set, and the improved TLD algorithm is experimentally verified. Finally, the experimental results are visually represented by mathematical statistics methods. The research shows that the method proposed by this paper has certain effects.


2011 ◽  
Vol 38 (3) ◽  
pp. 1491-1502 ◽  
Author(s):  
Christelle Gendrin ◽  
Primož Markelj ◽  
Supriyanto Ardjo Pawiro ◽  
Jakob Spoerk ◽  
Christoph Bloch ◽  
...  

2021 ◽  
Author(s):  
Qi Jia ◽  
Dezheng Zhang ◽  
Haifeng Xu ◽  
Yonghong Xie

BACKGROUND Traditional Chinese medicine (TCM) clinical records contain the symptoms of patients, diagnoses, and subsequent treatment of doctors. These records are important resources for research and analysis of TCM diagnosis knowledge. However, most of TCM clinical records are unstructured text. Therefore, a method to automatically extract medical entities from TCM clinical records is indispensable. OBJECTIVE Training a medical entity extracting model needs a large number of annotated corpus. The cost of annotated corpus is very high and there is a lack of gold-standard data sets for supervised learning methods. Therefore, we utilized distantly supervised named entity recognition (NER) to respond to the challenge. METHODS We propose a span-level distantly supervised NER approach to extract TCM medical entity. It utilizes the pretrained language model and a simple multilayer neural network as classifier to detect and classify entity. We also designed a negative sampling strategy for the span-level model. The strategy randomly selects negative samples in every epoch and filters the possible false-negative samples periodically. It reduces the bad influence from the false-negative samples. RESULTS We compare our methods with other baseline methods to illustrate the effectiveness of our method on a gold-standard data set. The F1 score of our method is 77.34 and it remarkably outperforms the other baselines. CONCLUSIONS We developed a distantly supervised NER approach to extract medical entity from TCM clinical records. We estimated our approach on a TCM clinical record data set. Our experimental results indicate that the proposed approach achieves a better performance than other baselines.


2020 ◽  
Author(s):  
Tjardo D Maarseveen ◽  
Timo Meinderink ◽  
Marcel J T Reinders ◽  
Johannes Knitza ◽  
Tom W J Huizinga ◽  
...  

BACKGROUND Financial codes are often used to extract diagnoses from electronic health records. This approach is prone to false positives. Alternatively, queries are constructed, but these are highly center and language specific. A tantalizing alternative is the automatic identification of patients by employing machine learning on format-free text entries. OBJECTIVE The aim of this study was to develop an easily implementable workflow that builds a machine learning algorithm capable of accurately identifying patients with rheumatoid arthritis from format-free text fields in electronic health records. METHODS Two electronic health record data sets were employed: Leiden (n=3000) and Erlangen (n=4771). Using a portion of the Leiden data (n=2000), we compared 6 different machine learning methods and a naïve word-matching algorithm using 10-fold cross-validation. Performances were compared using the area under the receiver operating characteristic curve (AUROC) and the area under the precision recall curve (AUPRC), and F1 score was used as the primary criterion for selecting the best method to build a classifying algorithm. We selected the optimal threshold of positive predictive value for case identification based on the output of the best method in the training data. This validation workflow was subsequently applied to a portion of the Erlangen data (n=4293). For testing, the best performing methods were applied to remaining data (Leiden n=1000; Erlangen n=478) for an unbiased evaluation. RESULTS For the Leiden data set, the word-matching algorithm demonstrated mixed performance (AUROC 0.90; AUPRC 0.33; F1 score 0.55), and 4 methods significantly outperformed word-matching, with support vector machines performing best (AUROC 0.98; AUPRC 0.88; F1 score 0.83). Applying this support vector machine classifier to the test data resulted in a similarly high performance (F1 score 0.81; positive predictive value [PPV] 0.94), and with this method, we could identify 2873 patients with rheumatoid arthritis in less than 7 seconds out of the complete collection of 23,300 patients in the Leiden electronic health record system. For the Erlangen data set, gradient boosting performed best (AUROC 0.94; AUPRC 0.85; F1 score 0.82) in the training set, and applied to the test data, resulted once again in good results (F1 score 0.67; PPV 0.97). CONCLUSIONS We demonstrate that machine learning methods can extract the records of patients with rheumatoid arthritis from electronic health record data with high precision, allowing research on very large populations for limited costs. Our approach is language and center independent and could be applied to any type of diagnosis. We have developed our pipeline into a universally applicable and easy-to-implement workflow to equip centers with their own high-performing algorithm. This allows the creation of observational studies of unprecedented size covering different countries for low cost from already available data in electronic health record systems.


2020 ◽  
Vol 4 (Supplement_2) ◽  
pp. 1167-1167
Author(s):  
Keisuke Ejima ◽  
Roger Zoh ◽  
Carmen Tekwe ◽  
David Allison ◽  
Andrew Brown

Abstract Objectives A gold standard method to measure energy intake (EI) is doubly labeled water (DLW), but it is expensive and not feasible for large studies. EI from self-report (EISR) is prone to bias, but is still widely used due to convenience; however, estimated associations between EISR and outcomes are biased in many cases. Double sampling with multiple imputation (MI) involves obtaining gold standard (e.g., EIDLW) measurements on a random subsample, and proxy data (e.g., EISR) on the whole sample, and recovering missing gold standard information using MI. However, it is not known what proportion of missingness in EIDLW is acceptable to obtain unbiased estimates of associations between EI and outcomes. Methods We used body weight as an example outcome from the CALERIE Study (N = 218). We performed two regressions on the complete dataset: EIDLW as a predictor and body weight (kg) as an outcome to estimate the ‘true’ coefficient (denoted βDLW), or using EISR as the predictor (βSR). Random subsets of EIDLW were deleted (10% to 90% of full data in 10% increments) to simulate obtaining EIDLW data on only a subset of participants. Regressions were performed using the subset EIDLW data using two different approaches: complete case analysis of only the subset (βDLWsub) and MI informed by EISR on the full data set (βMI). Bias was estimated as the difference between βDLW and βSR, between βDLW and βDLWsub for each EIDLW subset, and between βDLW and βMI for each subset. Resampling was repeated 100 times to assess the uncertainty of the bias. Results Bias of EISR was substantial (∼50%). Bias of βDLWsub was not significantly different from zero for all proportions of missing EIDLW; 95% CIs increased as proportion of missingness increased (as expected). Bias for βMI was not significantly different from zero for missingness of EIDLW up to 80%. βMI was significantly negatively biased toward βSR when the proportion of missingness was 90%. 95%CIs of βMI estimates were narrower than those of βDLWsub for all amounts of missingness. Conclusions Unbiased, more precise estimates of the association between EI and body weight using MI were obtained with missing EIDLW as high as 80%. Obtaining gold standard data collection on subsets may allow for unbiased estimates using self-report data feasible in larger samples. Funding Sources NIH R25HL124208. JSPS KAKENHI 18K18146. Meiji Yasuda Foundation of Health and Welfare 2019.


BMJ Open ◽  
2021 ◽  
Vol 11 (9) ◽  
pp. e045250
Author(s):  
Mike Bracher ◽  
Banyana C Madi-Segwagwe ◽  
Emma Winstanley ◽  
Helen Gillan ◽  
Tracy Long-Sutehall

ObjectivesLong-standing undersupply of eye tissue exists both in the UK and globally, and the UK National Health Service Blood and Transplant Service (NHSBT) has called for further research exploring barriers to eye donation. This study aims to: (1) describe reported reasons for non-donation of eye tissue from solid organ donors in the UK between 1 April 2014 and 31 March 2017 and (2) discuss these findings with respect to existing theories relating to non-donation of eyes by family members.DesignSecondary analysis of a national primary data set of recorded reasons for non-donation of eyes from 2790 potential solid organ donors. Data analysis including descriptive statistics and qualitative content analysis of free-text data for 126 recorded cases of family decline of eye donation.SettingNational data set covering solid organ donation (secondary care).Participants2790 potential organ donors were assessed for eye donation eligibility between 1 April 2014 and 31 March 2017.ResultsReasons for non-retrieval of eyes were recorded as: family wishes (n=1339, 48% of total cases); medical reasons (n=841, 30%); deceased wishes (n=180, 7%). In >50% of recorded cases, reasons for non-donation were based on family’s knowledge of the deceased wishes, their perception of the deceased wishes and specific concerns regarding processes or effects of eye donation (for the deceased body). Findings are discussed with respect to the existing theoretical perspectives.ConclusionEye donation involves distinct psychological and sociocultural factors for families and HCPs that have not been fully explored in research or integrated into service design. We propose areas for future research and service development including potential of only retrieving corneal discs as opposed to full eyes to reduce disfigurement concerns; public education regarding donation processes; exploration of how request processes potentially influence acceptance of eye donation; procedures for assessment of familial responses to information provided during consent conversations.


10.2196/23930 ◽  
2020 ◽  
Vol 8 (11) ◽  
pp. e23930
Author(s):  
Tjardo D Maarseveen ◽  
Timo Meinderink ◽  
Marcel J T Reinders ◽  
Johannes Knitza ◽  
Tom W J Huizinga ◽  
...  

Background Financial codes are often used to extract diagnoses from electronic health records. This approach is prone to false positives. Alternatively, queries are constructed, but these are highly center and language specific. A tantalizing alternative is the automatic identification of patients by employing machine learning on format-free text entries. Objective The aim of this study was to develop an easily implementable workflow that builds a machine learning algorithm capable of accurately identifying patients with rheumatoid arthritis from format-free text fields in electronic health records. Methods Two electronic health record data sets were employed: Leiden (n=3000) and Erlangen (n=4771). Using a portion of the Leiden data (n=2000), we compared 6 different machine learning methods and a naïve word-matching algorithm using 10-fold cross-validation. Performances were compared using the area under the receiver operating characteristic curve (AUROC) and the area under the precision recall curve (AUPRC), and F1 score was used as the primary criterion for selecting the best method to build a classifying algorithm. We selected the optimal threshold of positive predictive value for case identification based on the output of the best method in the training data. This validation workflow was subsequently applied to a portion of the Erlangen data (n=4293). For testing, the best performing methods were applied to remaining data (Leiden n=1000; Erlangen n=478) for an unbiased evaluation. Results For the Leiden data set, the word-matching algorithm demonstrated mixed performance (AUROC 0.90; AUPRC 0.33; F1 score 0.55), and 4 methods significantly outperformed word-matching, with support vector machines performing best (AUROC 0.98; AUPRC 0.88; F1 score 0.83). Applying this support vector machine classifier to the test data resulted in a similarly high performance (F1 score 0.81; positive predictive value [PPV] 0.94), and with this method, we could identify 2873 patients with rheumatoid arthritis in less than 7 seconds out of the complete collection of 23,300 patients in the Leiden electronic health record system. For the Erlangen data set, gradient boosting performed best (AUROC 0.94; AUPRC 0.85; F1 score 0.82) in the training set, and applied to the test data, resulted once again in good results (F1 score 0.67; PPV 0.97). Conclusions We demonstrate that machine learning methods can extract the records of patients with rheumatoid arthritis from electronic health record data with high precision, allowing research on very large populations for limited costs. Our approach is language and center independent and could be applied to any type of diagnosis. We have developed our pipeline into a universally applicable and easy-to-implement workflow to equip centers with their own high-performing algorithm. This allows the creation of observational studies of unprecedented size covering different countries for low cost from already available data in electronic health record systems.


2020 ◽  
pp. 383-391 ◽  
Author(s):  
Yalun Li ◽  
Yung-Hung Luo ◽  
Jason A. Wampfler ◽  
Samuel M. Rubinstein ◽  
Firat Tiryaki ◽  
...  

PURPOSE Electronic health records (EHRs) are created primarily for nonresearch purposes; thus, the amounts of data are enormous, and the data are crude, heterogeneous, incomplete, and largely unstructured, presenting challenges to effective analyses for timely, reliable results. Particularly, research dealing with clinical notes relevant to patient care and outcome is seldom conducted, due to the complexity of data extraction and accurate annotation in the past. RECIST is a set of widely accepted research criteria to evaluate tumor response in patients undergoing antineoplastic therapy. The aim for this study was to identify textual sources for RECIST information in EHRs and to develop a corpus of pharmacotherapy and response entities for development of natural language processing tools. METHODS We focused on pharmacotherapies and patient responses, using 55,120 medical notes (n = 72 types) in Mayo Clinic’s EHRs from 622 randomly selected patients who signed authorization for research. Using the Multidocument Annotation Environment tool, we applied and evaluated predefined keywords, and time interval and note-type filters for identifying RECIST information and established a gold standard data set for patient outcome research. RESULTS Key words reduced clinical notes to 37,406, and using four note types within 12 months postdiagnosis further reduced the number of notes to 5,005 that were manually annotated, which covered 97.9% of all cases (n = 609 of 622). The resulting data set of 609 cases (n = 503 for training and n = 106 for validation purpose), contains 736 fully annotated, deidentified clinical notes, with pharmacotherapies and four response end points: complete response, partial response, stable disease, and progressive disease. This resource is readily expandable to specific drugs, regimens, and most solid tumors. CONCLUSION We have established a gold standard data set to accommodate development of biomedical informatics tools in accelerating research into antineoplastic therapeutic response.


Sign in / Sign up

Export Citation Format

Share Document