scholarly journals De-identification of patient notes with recurrent neural networks

2016 ◽  
Vol 24 (3) ◽  
pp. 596-606 ◽  
Author(s):  
Franck Dernoncourt ◽  
Ji Young Lee ◽  
Ozlem Uzuner ◽  
Peter Szolovits

Objective: Patient notes in electronic health records (EHRs) may contain critical information for medical investigations. However, the vast majority of medical investigators can only access de-identified notes, in order to protect the confidentiality of patients. In the United States, the Health Insurance Portability and Accountability Act (HIPAA) defines 18 types of protected health information that needs to be removed to de-identify patient notes. Manual de-identification is impractical given the size of electronic health record databases, the limited number of researchers with access to non-de-identified notes, and the frequent mistakes of human annotators. A reliable automated de-identification system would consequently be of high value. Materials and Methods: We introduce the first de-identification system based on artificial neural networks (ANNs), which requires no handcrafted features or rules, unlike existing systems. We compare the performance of the system with state-of-the-art systems on two datasets: the i2b2 2014 de-identification challenge dataset, which is the largest publicly available de-identification dataset, and the MIMIC de-identification dataset, which we assembled and is twice as large as the i2b2 2014 dataset. Results: Our ANN model outperforms the state-of-the-art systems. It yields an F1-score of 97.85 on the i2b2 2014 dataset, with a recall of 97.38 and a precision of 98.32, and an F1-score of 99.23 on the MIMIC de-identification dataset, with a recall of 99.25 and a precision of 99.21. Conclusion: Our findings support the use of ANNs for de-identification of patient notes, as they show better performance than previously published systems while requiring no manual feature engineering.

Drug Safety ◽  
2019 ◽  
Vol 42 (1) ◽  
pp. 113-122 ◽  
Author(s):  
Susmitha Wunnava ◽  
Xiao Qin ◽  
Tabassum Kakar ◽  
Cansu Sen ◽  
Elke A. Rundensteiner ◽  
...  

2018 ◽  
Vol 26 (3) ◽  
pp. 262-268 ◽  
Author(s):  
Yifu Li ◽  
Ran Jin ◽  
Yuan Luo

Abstract We propose to use segment graph convolutional and recurrent neural networks (Seg-GCRNs), which use only word embedding and sentence syntactic dependencies, to classify relations from clinical notes without manual feature engineering. In this study, the relations between 2 medical concepts are classified by simultaneously learning representations of text segments in the context of sentence syntactic dependency: preceding, concept1, middle, concept2, and succeeding segments. Seg-GCRN was systematically evaluated on the i2b2/VA relation classification challenge datasets. Experiments show that Seg-GCRN attains state-of-the-art micro-averaged F-measure for all 3 relation categories: 0.692 for classifying medical treatment–problem relations, 0.827 for medical test–problem relations, and 0.741 for medical problem–medical problem relations. Comparison with the previous state-of-the-art segment convolutional neural network (Seg-CNN) suggests that adding syntactic dependency information helps refine medical word embedding and improves concept relation classification without manual feature engineering. Seg-GCRN can be trained efficiently for the i2b2/VA dataset on a GPU platform.


2020 ◽  
Author(s):  
Dean Sumner ◽  
Jiazhen He ◽  
Amol Thakkar ◽  
Ola Engkvist ◽  
Esben Jannik Bjerrum

<p>SMILES randomization, a form of data augmentation, has previously been shown to increase the performance of deep learning models compared to non-augmented baselines. Here, we propose a novel data augmentation method we call “Levenshtein augmentation” which considers local SMILES sub-sequence similarity between reactants and their respective products when creating training pairs. The performance of Levenshtein augmentation was tested using two state of the art models - transformer and sequence-to-sequence based recurrent neural networks with attention. Levenshtein augmentation demonstrated an increase performance over non-augmented, and conventionally SMILES randomization augmented data when used for training of baseline models. Furthermore, Levenshtein augmentation seemingly results in what we define as <i>attentional gain </i>– an enhancement in the pattern recognition capabilities of the underlying network to molecular motifs.</p>


2020 ◽  
Vol 7 (Supplement_1) ◽  
pp. S819-S820
Author(s):  
Jonathan Todd ◽  
Jon Puro ◽  
Matthew Jones ◽  
Jee Oakley ◽  
Laura A Vonnahme ◽  
...  

Abstract Background Over 80% of tuberculosis (TB) cases in the United States are attributed to reactivation of latent TB infection (LTBI). Eliminating TB in the United States requires expanding identification and treatment of LTBI. Centralized electronic health records (EHRs) are an unexplored data source to identify persons with LTBI. We explored EHR data to evaluate TB and LTBI screening and diagnoses within OCHIN, Inc., a U.S. practice-based research network with a high proportion of Federally Qualified Health Centers. Methods From the EHRs of patients who had an encounter at an OCHIN member clinic between January 1, 2012 and December 31, 2016, we extracted demographic variables, TB risk factors, TB screening tests, International Classification of Diseases (ICD) 9 and 10 codes, and treatment regimens. Based on test results, ICD codes, and treatment regimens, we developed a novel algorithm to classify patient records into LTBI categories: definite, probable or possible. We used multivariable logistic regression, with a referent group of all cohort patients not classified as having LTBI or TB, to identify associations between TB risk factors and LTBI. Results Among 2,190,686 patients, 6.9% (n=151,195) had a TB screening test; among those, 8% tested positive. Non-U.S. –born or non-English–speaking persons comprised 24% of our cohort; 11% were tested for TB infection, and 14% had a positive test. Risk factors in the multivariable model significantly associated with being classified as having LTBI included preferring non-English language (adjusted odds ratio [aOR] 4.20, 95% confidence interval [CI] 4.09–4.32); non-Hispanic Asian (aOR 5.17, 95% CI 4.94–5.40), non-Hispanic black (aOR 3.02, 95% CI 2.91–3.13), or Native Hawaiian/other Pacific Islander (aOR 3.35, 95% CI 2.92–3.84) race; and HIV infection (aOR 3.09, 95% CI 2.84–3.35). Conclusion This study demonstrates the utility of EHR data for understanding TB screening practices and as an important data source that can be used to enhance public health surveillance of LTBI prevalence. Increasing screening among high-risk populations remains an important step toward eliminating TB in the United States. These results underscore the importance of offering TB screening in non-U.S.–born populations. Disclosures All Authors: No reported disclosures


2018 ◽  
Vol 136 (2) ◽  
pp. 164 ◽  
Author(s):  
Michele C. Lim ◽  
Michael V. Boland ◽  
Colin A. McCannel ◽  
Arvind Saini ◽  
Michael F. Chiang ◽  
...  

2021 ◽  
Vol 12 (04) ◽  
pp. 816-825
Author(s):  
Yingcheng Sun ◽  
Alex Butler ◽  
Ibrahim Diallo ◽  
Jae Hyun Kim ◽  
Casey Ta ◽  
...  

Abstract Background Clinical trials are the gold standard for generating robust medical evidence, but clinical trial results often raise generalizability concerns, which can be attributed to the lack of population representativeness. The electronic health records (EHRs) data are useful for estimating the population representativeness of clinical trial study population. Objectives This research aims to estimate the population representativeness of clinical trials systematically using EHR data during the early design stage. Methods We present an end-to-end analytical framework for transforming free-text clinical trial eligibility criteria into executable database queries conformant with the Observational Medical Outcomes Partnership Common Data Model and for systematically quantifying the population representativeness for each clinical trial. Results We calculated the population representativeness of 782 novel coronavirus disease 2019 (COVID-19) trials and 3,827 type 2 diabetes mellitus (T2DM) trials in the United States respectively using this framework. With the use of overly restrictive eligibility criteria, 85.7% of the COVID-19 trials and 30.1% of T2DM trials had poor population representativeness. Conclusion This research demonstrates the potential of using the EHR data to assess the clinical trials population representativeness, providing data-driven metrics to inform the selection and optimization of eligibility criteria.


2017 ◽  
Vol 25 (1) ◽  
pp. 93-98 ◽  
Author(s):  
Yuan Luo ◽  
Yu Cheng ◽  
Özlem Uzuner ◽  
Peter Szolovits ◽  
Justin Starren

Abstract We propose Segment Convolutional Neural Networks (Seg-CNNs) for classifying relations from clinical notes. Seg-CNNs use only word-embedding features without manual feature engineering. Unlike typical CNN models, relations between 2 concepts are identified by simultaneously learning separate representations for text segments in a sentence: preceding, concept1, middle, concept2, and succeeding. We evaluate Seg-CNN on the i2b2/VA relation classification challenge dataset. We show that Seg-CNN achieves a state-of-the-art micro-average F-measure of 0.742 for overall evaluation, 0.686 for classifying medical problem–treatment relations, 0.820 for medical problem–test relations, and 0.702 for medical problem–medical problem relations. We demonstrate the benefits of learning segment-level representations. We show that medical domain word embeddings help improve relation classification. Seg-CNNs can be trained quickly for the i2b2/VA dataset on a graphics processing unit (GPU) platform. These results support the use of CNNs computed over segments of text for classifying medical relations, as they show state-of-the-art performance while requiring no manual feature engineering.


2020 ◽  
Vol 159 (6) ◽  
pp. 2221-2225.e6 ◽  
Author(s):  
Shailendra Singh ◽  
Mohammad Bilal ◽  
Haig Pakhchanian ◽  
Rahul Raiker ◽  
Gursimran S. Kochhar ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document