De-identification of patient notes with recurrent neural networks

Franck Dernoncourt; Ji Young Lee; Ozlem Uzuner; Peter Szolovits

doi:10.1093/jamia/ocw156

De-identification of patient notes with recurrent neural networks

Journal of the American Medical Informatics Association ◽

10.1093/jamia/ocw156 ◽

2016 ◽

Vol 24 (3) ◽

pp. 596-606 ◽

Cited By ~ 52

Author(s):

Franck Dernoncourt ◽

Ji Young Lee ◽

Ozlem Uzuner ◽

Peter Szolovits

Keyword(s):

Neural Networks ◽

Recurrent Neural Networks ◽

State Of The Art ◽

The United States ◽

Identification System ◽

Feature Engineering ◽

Protected Health Information ◽

Ann Model ◽

Health Records ◽

Electronic Health

Objective: Patient notes in electronic health records (EHRs) may contain critical information for medical investigations. However, the vast majority of medical investigators can only access de-identified notes, in order to protect the confidentiality of patients. In the United States, the Health Insurance Portability and Accountability Act (HIPAA) defines 18 types of protected health information that needs to be removed to de-identify patient notes. Manual de-identification is impractical given the size of electronic health record databases, the limited number of researchers with access to non-de-identified notes, and the frequent mistakes of human annotators. A reliable automated de-identification system would consequently be of high value. Materials and Methods: We introduce the first de-identification system based on artificial neural networks (ANNs), which requires no handcrafted features or rules, unlike existing systems. We compare the performance of the system with state-of-the-art systems on two datasets: the i2b2 2014 de-identification challenge dataset, which is the largest publicly available de-identification dataset, and the MIMIC de-identification dataset, which we assembled and is twice as large as the i2b2 2014 dataset. Results: Our ANN model outperforms the state-of-the-art systems. It yields an F1-score of 97.85 on the i2b2 2014 dataset, with a recall of 97.38 and a precision of 98.32, and an F1-score of 99.23 on the MIMIC de-identification dataset, with a recall of 99.25 and a precision of 99.21. Conclusion: Our findings support the use of ANNs for de-identification of patient notes, as they show better performance than previously published systems while requiring no manual feature engineering.

Download Full-text

Adverse Drug Event Detection from Electronic Health Records Using Hierarchical Recurrent Neural Networks with Dual-Level Embedding

Drug Safety ◽

10.1007/s40264-018-0765-9 ◽

2019 ◽

Vol 42 (1) ◽

pp. 113-122 ◽

Cited By ~ 11

Author(s):

Susmitha Wunnava ◽

Xiao Qin ◽

Tabassum Kakar ◽

Cansu Sen ◽

Elke A. Rundensteiner ◽

...

Keyword(s):

Neural Networks ◽

Electronic Health Records ◽

Adverse Drug Event ◽

Event Detection ◽

Recurrent Neural Networks ◽

Drug Event ◽

Health Records ◽

Electronic Health

Download Full-text

Classifying relations in clinical narratives using segment graph convolutional and recurrent neural networks (Seg-GCRNs)

Journal of the American Medical Informatics Association ◽

10.1093/jamia/ocy157 ◽

2018 ◽

Vol 26 (3) ◽

pp. 262-268 ◽

Cited By ~ 13

Author(s):

Yifu Li ◽

Ran Jin ◽

Yuan Luo

Keyword(s):

Neural Networks ◽

Recurrent Neural Networks ◽

State Of The Art ◽

Medical Problem ◽

Word Embedding ◽

Feature Engineering ◽

Medical Test ◽

Relation Classification ◽

Syntactic Dependencies ◽

Syntactic Dependency

Abstract We propose to use segment graph convolutional and recurrent neural networks (Seg-GCRNs), which use only word embedding and sentence syntactic dependencies, to classify relations from clinical notes without manual feature engineering. In this study, the relations between 2 medical concepts are classified by simultaneously learning representations of text segments in the context of sentence syntactic dependency: preceding, concept1, middle, concept2, and succeeding segments. Seg-GCRN was systematically evaluated on the i2b2/VA relation classification challenge datasets. Experiments show that Seg-GCRN attains state-of-the-art micro-averaged F-measure for all 3 relation categories: 0.692 for classifying medical treatment–problem relations, 0.827 for medical test–problem relations, and 0.741 for medical problem–medical problem relations. Comparison with the previous state-of-the-art segment convolutional neural network (Seg-CNN) suggests that adding syntactic dependency information helps refine medical word embedding and improves concept relation classification without manual feature engineering. Seg-GCRN can be trained efficiently for the i2b2/VA dataset on a GPU platform.

Download Full-text

Levenshtein Augmentation Improves Performance of SMILES Based Deep-Learning Synthesis Prediction

10.26434/chemrxiv.12562121 ◽

2020 ◽

Author(s):

Dean Sumner ◽

Jiazhen He ◽

Amol Thakkar ◽

Ola Engkvist ◽

Esben Jannik Bjerrum

Keyword(s):

Neural Networks ◽

Pattern Recognition ◽

Deep Learning ◽

Recurrent Neural Networks ◽

Data Augmentation ◽

State Of The Art ◽

Sequence Similarity ◽

Learning Models ◽

Underlying Network

<p>SMILES randomization, a form of data augmentation, has previously been shown to increase the performance of deep learning models compared to non-augmented baselines. Here, we propose a novel data augmentation method we call “Levenshtein augmentation” which considers local SMILES sub-sequence similarity between reactants and their respective products when creating training pairs. The performance of Levenshtein augmentation was tested using two state of the art models - transformer and sequence-to-sequence based recurrent neural networks with attention. Levenshtein augmentation demonstrated an increase performance over non-augmented, and conventionally SMILES randomization augmented data when used for training of baseline models. Furthermore, Levenshtein augmentation seemingly results in what we define as <i>attentional gain </i>– an enhancement in the pattern recognition capabilities of the underlying network to molecular motifs.</p>

Download Full-text

Privacy preserving neural networks for electronic health records de-identification

Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics ◽

10.1145/3459930.3469555 ◽

2021 ◽

Author(s):

Tanbir Ahmed ◽

Md Momin Al Aziz ◽

Noman Mohammed ◽

Xiaoqian Jiang

Keyword(s):

Neural Networks ◽

Electronic Health Records ◽

Privacy Preserving ◽

Health Records ◽

Electronic Health

Download Full-text

1665. Using Electronic Health Records to Describe TB in Community Health Settings: a Cohort Analysis in a Large Safety-Net Population

Open Forum Infectious Diseases ◽

10.1093/ofid/ofaa439.1843 ◽

2020 ◽

Vol 7 (Supplement_1) ◽

pp. S819-S820

Author(s):

Jonathan Todd ◽

Jon Puro ◽

Matthew Jones ◽

Jee Oakley ◽

Laura A Vonnahme ◽

...

Keyword(s):

United States ◽

Risk Factors ◽

Electronic Health Records ◽

Safety Net ◽

The United States ◽

Research Network ◽

Health Records ◽

Treatment Regimens ◽

Data Source ◽

Electronic Health

Abstract Background Over 80% of tuberculosis (TB) cases in the United States are attributed to reactivation of latent TB infection (LTBI). Eliminating TB in the United States requires expanding identification and treatment of LTBI. Centralized electronic health records (EHRs) are an unexplored data source to identify persons with LTBI. We explored EHR data to evaluate TB and LTBI screening and diagnoses within OCHIN, Inc., a U.S. practice-based research network with a high proportion of Federally Qualified Health Centers. Methods From the EHRs of patients who had an encounter at an OCHIN member clinic between January 1, 2012 and December 31, 2016, we extracted demographic variables, TB risk factors, TB screening tests, International Classification of Diseases (ICD) 9 and 10 codes, and treatment regimens. Based on test results, ICD codes, and treatment regimens, we developed a novel algorithm to classify patient records into LTBI categories: definite, probable or possible. We used multivariable logistic regression, with a referent group of all cohort patients not classified as having LTBI or TB, to identify associations between TB risk factors and LTBI. Results Among 2,190,686 patients, 6.9% (n=151,195) had a TB screening test; among those, 8% tested positive. Non-U.S. –born or non-English–speaking persons comprised 24% of our cohort; 11% were tested for TB infection, and 14% had a positive test. Risk factors in the multivariable model significantly associated with being classified as having LTBI included preferring non-English language (adjusted odds ratio [aOR] 4.20, 95% confidence interval [CI] 4.09–4.32); non-Hispanic Asian (aOR 5.17, 95% CI 4.94–5.40), non-Hispanic black (aOR 3.02, 95% CI 2.91–3.13), or Native Hawaiian/other Pacific Islander (aOR 3.35, 95% CI 2.92–3.84) race; and HIV infection (aOR 3.09, 95% CI 2.84–3.35). Conclusion This study demonstrates the utility of EHR data for understanding TB screening practices and as an important data source that can be used to enhance public health surveillance of LTBI prevalence. Increasing screening among high-risk populations remains an important step toward eliminating TB in the United States. These results underscore the importance of offering TB screening in non-U.S.–born populations. Disclosures All Authors: No reported disclosures

Download Full-text

Adoption of Electronic Health Records and Perceptions of Financial and Clinical Outcomes Among Ophthalmologists in the United States

JAMA Ophthalmology ◽

10.1001/jamaophthalmol.2017.5978 ◽

2018 ◽

Vol 136 (2) ◽

pp. 164 ◽

Cited By ~ 22

Author(s):

Michele C. Lim ◽

Michael V. Boland ◽

Colin A. McCannel ◽

Arvind Saini ◽

Michael F. Chiang ◽

...

Keyword(s):

United States ◽

Electronic Health Records ◽

Clinical Outcomes ◽

The United States ◽

Health Records ◽

Electronic Health

Download Full-text

A Framework for Systematic Assessment of Clinical Trial Population Representativeness Using Electronic Health Records Data

Applied Clinical Informatics ◽

10.1055/s-0041-1733846 ◽

2021 ◽

Vol 12 (04) ◽

pp. 816-825

Author(s):

Yingcheng Sun ◽

Alex Butler ◽

Ibrahim Diallo ◽

Jae Hyun Kim ◽

Casey Ta ◽

...

Keyword(s):

Clinical Trial ◽

Clinical Trials ◽

Electronic Health Records ◽

The United States ◽

Design Stage ◽

Common Data Model ◽

Free Text ◽

Eligibility Criteria ◽

Health Records ◽

Electronic Health

Abstract Background Clinical trials are the gold standard for generating robust medical evidence, but clinical trial results often raise generalizability concerns, which can be attributed to the lack of population representativeness. The electronic health records (EHRs) data are useful for estimating the population representativeness of clinical trial study population. Objectives This research aims to estimate the population representativeness of clinical trials systematically using EHR data during the early design stage. Methods We present an end-to-end analytical framework for transforming free-text clinical trial eligibility criteria into executable database queries conformant with the Observational Medical Outcomes Partnership Common Data Model and for systematically quantifying the population representativeness for each clinical trial. Results We calculated the population representativeness of 782 novel coronavirus disease 2019 (COVID-19) trials and 3,827 type 2 diabetes mellitus (T2DM) trials in the United States respectively using this framework. With the use of overly restrictive eligibility criteria, 85.7% of the COVID-19 trials and 30.1% of T2DM trials had poor population representativeness. Conclusion This research demonstrates the potential of using the EHR data to assess the clinical trials population representativeness, providing data-driven metrics to inform the selection and optimization of eligibility criteria.

Download Full-text

Economic Burden Of Locally Advanced Or Metastatic Merkel Cell Carcinoma In The United States: An Analysis Of Electronic Health Records

Value in Health ◽

10.1016/j.jval.2018.04.260 ◽

2018 ◽

Vol 21 ◽

pp. S45

Author(s):

M Kearney ◽

D Esposito ◽

J Penalvo ◽

L Russo ◽

R Yin ◽

...

Keyword(s):

United States ◽

Electronic Health Records ◽

Cell Carcinoma ◽

Merkel Cell Carcinoma ◽

Economic Burden ◽

Locally Advanced ◽

The United States ◽

Merkel Cell ◽

Health Records ◽

Electronic Health

Download Full-text

Segment convolutional neural networks (Seg-CNNs) for classifying relations in clinical notes

Journal of the American Medical Informatics Association ◽

10.1093/jamia/ocx090 ◽

2017 ◽

Vol 25 (1) ◽

pp. 93-98 ◽

Cited By ~ 31

Author(s):

Yuan Luo ◽

Yu Cheng ◽

Özlem Uzuner ◽

Peter Szolovits ◽

Justin Starren

Keyword(s):

Neural Networks ◽

Convolutional Neural Networks ◽

State Of The Art ◽

Graphics Processing Unit ◽

Medical Problem ◽

Feature Engineering ◽

Processing Unit ◽

Clinical Notes ◽

Overall Evaluation ◽

Relation Classification

Abstract We propose Segment Convolutional Neural Networks (Seg-CNNs) for classifying relations from clinical notes. Seg-CNNs use only word-embedding features without manual feature engineering. Unlike typical CNN models, relations between 2 concepts are identified by simultaneously learning separate representations for text segments in a sentence: preceding, concept1, middle, concept2, and succeeding. We evaluate Seg-CNN on the i2b2/VA relation classification challenge dataset. We show that Seg-CNN achieves a state-of-the-art micro-average F-measure of 0.742 for overall evaluation, 0.686 for classifying medical problem–treatment relations, 0.820 for medical problem–test relations, and 0.702 for medical problem–medical problem relations. We demonstrate the benefits of learning segment-level representations. We show that medical domain word embeddings help improve relation classification. Seg-CNNs can be trained quickly for the i2b2/VA dataset on a graphics processing unit (GPU) platform. These results support the use of CNNs computed over segments of text for classifying medical relations, as they show state-of-the-art performance while requiring no manual feature engineering.

Download Full-text

Impact of Obesity on Outcomes of Patients With Coronavirus Disease 2019 in the United States: A Multicenter Electronic Health Records Network Study

Gastroenterology ◽

10.1053/j.gastro.2020.08.028 ◽

2020 ◽

Vol 159 (6) ◽

pp. 2221-2225.e6 ◽

Cited By ~ 3

Author(s):

Shailendra Singh ◽

Mohammad Bilal ◽

Haig Pakhchanian ◽

Rahul Raiker ◽

Gursimran S. Kochhar ◽

...

Keyword(s):

United States ◽

Electronic Health Records ◽

The United States ◽

Health Records ◽

Electronic Health

Download Full-text