Automating classification of free-text electronic health records for epidemiological studies

Martijn J. Schuemie; Emine Sen; Geert W. 't Jong; Eva M. Soest; Miriam C. Sturkenboom; Jan A. Kors

doi:10.1002/pds.3205

Incorporating natural language processing to improve classification of axial spondyloarthritis using electronic health records

Rheumatology ◽

10.1093/rheumatology/kez375 ◽

2019 ◽

Vol 59 (5) ◽

pp. 1059-1065 ◽

Cited By ~ 1

Author(s):

Sizheng Steven Zhao ◽

Chuan Hong ◽

Tianrun Cai ◽

Chang Xu ◽

Jie Huang ◽

...

Keyword(s):

Electronic Health Records ◽

Predictive Value ◽

Area Under The Curve ◽

Free Text ◽

Text Data ◽

Health Records ◽

Disease Concepts ◽

Icd Codes ◽

Electronic Health

Abstract Objectives To develop classification algorithms that accurately identify axial SpA (axSpA) patients in electronic health records, and compare the performance of algorithms incorporating free-text data against approaches using only International Classification of Diseases (ICD) codes. Methods An enriched cohort of 7853 eligible patients was created from electronic health records of two large hospitals using automated searches (⩾1 ICD codes combined with simple text searches). Key disease concepts from free-text data were extracted using NLP and combined with ICD codes to develop algorithms. We created both supervised regression-based algorithms—on a training set of 127 axSpA cases and 423 non-cases—and unsupervised algorithms to identify patients with high probability of having axSpA from the enriched cohort. Their performance was compared against classifications using ICD codes only. Results NLP extracted four disease concepts of high predictive value: ankylosing spondylitis, sacroiliitis, HLA-B27 and spondylitis. The unsupervised algorithm, incorporating both the NLP concept and ICD code for AS, identified the greatest number of patients. By setting the probability threshold to attain 80% positive predictive value, it identified 1509 axSpA patients (mean age 53 years, 71% male). Sensitivity was 0.78, specificity 0.94 and area under the curve 0.93. The two supervised algorithms performed similarly but identified fewer patients. All three outperformed traditional approaches using ICD codes alone (area under the curve 0.80–0.87). Conclusion Algorithms incorporating free-text data can accurately identify axSpA patients in electronic health records. Large cohorts identified using these novel methods offer exciting opportunities for future clinical research.

Download Full-text

A Framework for Systematic Assessment of Clinical Trial Population Representativeness Using Electronic Health Records Data

Applied Clinical Informatics ◽

10.1055/s-0041-1733846 ◽

2021 ◽

Vol 12 (04) ◽

pp. 816-825

Author(s):

Yingcheng Sun ◽

Alex Butler ◽

Ibrahim Diallo ◽

Jae Hyun Kim ◽

Casey Ta ◽

...

Keyword(s):

Clinical Trial ◽

Clinical Trials ◽

Electronic Health Records ◽

The United States ◽

Design Stage ◽

Common Data Model ◽

Free Text ◽

Eligibility Criteria ◽

Health Records ◽

Electronic Health

Abstract Background Clinical trials are the gold standard for generating robust medical evidence, but clinical trial results often raise generalizability concerns, which can be attributed to the lack of population representativeness. The electronic health records (EHRs) data are useful for estimating the population representativeness of clinical trial study population. Objectives This research aims to estimate the population representativeness of clinical trials systematically using EHR data during the early design stage. Methods We present an end-to-end analytical framework for transforming free-text clinical trial eligibility criteria into executable database queries conformant with the Observational Medical Outcomes Partnership Common Data Model and for systematically quantifying the population representativeness for each clinical trial. Results We calculated the population representativeness of 782 novel coronavirus disease 2019 (COVID-19) trials and 3,827 type 2 diabetes mellitus (T2DM) trials in the United States respectively using this framework. With the use of overly restrictive eligibility criteria, 85.7% of the COVID-19 trials and 30.1% of T2DM trials had poor population representativeness. Conclusion This research demonstrates the potential of using the EHR data to assess the clinical trials population representativeness, providing data-driven metrics to inform the selection and optimization of eligibility criteria.

Download Full-text

Validity of acute cardiovascular outcome diagnoses in European electronic health records: a systematic review protocol

BMJ Open ◽

10.1136/bmjopen-2019-031373 ◽

2019 ◽

Vol 9 (10) ◽

pp. e031373 ◽

Cited By ~ 1

Author(s):

Jennifer Anne Davidson ◽

Amitava Banerjee ◽

Rutendo Muzambi ◽

Liam Smeeth ◽

Charlotte Warren-Gash

Keyword(s):

Systematic Review ◽

Electronic Health Records ◽

Predictive Value ◽

Grey Literature ◽

Cochrane Library ◽

Free Text ◽

Health Records ◽

Coronary Syndrome ◽

Validation Measure ◽

Electronic Health

IntroductionCardiovascular diseases (CVDs) are among the leading causes of death globally. Electronic health records (EHRs) provide a rich data source for research on CVD risk factors, treatments and outcomes. Researchers must be confident in the validity of diagnoses in EHRs, particularly when diagnosis definitions and use of EHRs change over time. Our systematic review provides an up-to-date appraisal of the validity of stroke, acute coronary syndrome (ACS) and heart failure (HF) diagnoses in European primary and secondary care EHRs.Methods and analysisWe will systematically review the published and grey literature to identify studies validating diagnoses of stroke, ACS and HF in European EHRs. MEDLINE, EMBASE, SCOPUS, Web of Science, Cochrane Library, OpenGrey and EThOS will be searched from the dates of inception to April 2019. A prespecified search strategy of subject headings and free-text terms in the title and abstract will be used. Two reviewers will independently screen titles and abstracts to identify eligible studies, followed by full-text review. We require studies to compare clinical codes with a suitable reference standard. Additionally, at least one validation measure (sensitivity, specificity, positive predictive value or negative predictive value) or raw data, for the calculation of a validation measure, is necessary. We will then extract data from the eligible studies using standardised tables and assess risk of bias in individual studies using the Quality Assessment of Diagnostic Accuracy Studies 2 tool. Data will be synthesised into a narrative format and heterogeneity assessed. Meta-analysis will be considered when a sufficient number of homogeneous studies are available. The overall quality of evidence will be assessed using the Grading of Recommendations, Assessment, Development and Evaluation tool.Ethics and disseminationThis is a systematic review, so it does not require ethical approval. Our results will be submitted for peer-review publication.PROSPERO registration numberCRD42019123898

Download Full-text

De-identifying Free Text of Japanese Dummy Electronic Health Records

10.18653/v1/w18-5608 ◽

2018 ◽

Author(s):

Kohei Kajiyama ◽

Hiromasa Horiguchi ◽

Takashi Okumura ◽

Mizuki Morita ◽

Yoshinobu Kano

Keyword(s):

Electronic Health Records ◽

Free Text ◽

Health Records ◽

Electronic Health

Download Full-text

Learning longitudinal patterns and identifying subtypes of pediatric Crohn disease treated with infliximab via trajectory cluster analysis of electronic health records

10.1101/2021.04.14.21255354 ◽

2021 ◽

Author(s):

Andrew Chen ◽

Ronen Stein ◽

Robert N. Baldassano ◽

Jing Huang

Keyword(s):

Electronic Health Records ◽

Disease Activity ◽

Crohn Disease ◽

Cross Sectional ◽

Health Records ◽

Disease Evolution ◽

Current Classification ◽

Electronic Health ◽

Over Time

ABSTRACTBackgroundThe current classification of pediatric CD is mainly based on cross-sectional data. The objective of this study is to identify subgroups of pediatric CD through trajectory cluster analysis of disease activity using data from electronic health records.MethodsWe conducted a retrospective study of pediatric CD patients who had been treated with infliximab. The evolution of disease over time was described using trajectory analysis of longitudinal data of C-Reactive Protein (CRP). Patterns of disease evolution were extracted through functional principal components analysis and subgroups were identified based on those patterns using the Gaussian mixture model. We compared patient characteristics, a biomarker for disease activity, received treatments, and long-term surgical outcomes across subgroups.ResultsWe identified four subgroups of pediatric CD patients with differential relapse-and-remission risk profiles. They had significantly different disease phenotype (p < 0.001), CRP (p < 0.001) and calprotectin (p = 0.037) at diagnosis, with increasing percentage of inflammatory phenotype and declining CRP and fecal calprotectin levels from Subgroup 1 through 4. The risk of colorectal surgery within 10 years after diagnosis was significantly different between groups (p < 0.001). We did not find statistical significance in gender or age at diagnosis across subgroups, but the BMI z-score was slightly smaller in subgroup 1 (p =0.055).ConclusionsReadily available longitudinal data from electronic health records can be leveraged to provide a deeper characterization of pediatric Crohn disease. The identified subgroups captured novel forms of variation in pediatric Crohn disease that were not explained by baseline measurements and treatment information.SummaryThe current classification of pediatric Crohn disease mainly relies on cross-sectional data, e.g., the Paris classification. However, the phenotypic classification may evolve over time after diagnosis. Our study utilized longitudinal measures from the electronic health records and stratified pediatric Crohn disease patients with differential relapse-and-remission risk profiles based on patterns of disease evolution. We found trajectories of well-maintained low disease activity were associated with less severe disease at baseline, early initiation of infliximab treatment, and lower risk of surgery within 10 years of diagnosis, but the difference was not fully explained by phenotype at diagnosis.

Download Full-text

Abstract MP21: Feasibility of Electronic Health Records-based community surveillance of cardiovascular disease: Findings from the Atherosclerosis Risk in Communities Study.

Circulation ◽

10.1161/circ.137.suppl_1.mp21 ◽

2018 ◽

Vol 137 (suppl_1) ◽

Author(s):

Brittany M Bogle ◽

Wayne D Rosamond ◽

Aaron R Folsom ◽

Paul Sorlie ◽

Elsayed Z Soliman ◽

...

Keyword(s):

Cardiovascular Disease ◽

Electronic Health Records ◽

Cardiac Biomarkers ◽

Free Text ◽

Health Records ◽

Efficient System ◽

Atherosclerosis Risk In Communities ◽

Atherosclerosis Risk ◽

Electronic Health ◽

Aric Study

Background: Accurate community surveillance of cardiovascular disease requires hospital record abstraction, which is typically a manual process. The costly and time-intensive nature of manual abstraction precludes its use on a regional or national scale in the US. Whether an efficient system can accurately reproduce traditional community surveillance methods by processing electronic health records (EHRs) has not been established. Objective: We sought to develop and test an EHR-based system to reproduce abstraction and classification procedures for acute myocardial infarction (MI) as defined by the Atherosclerosis Risk in Communities (ARIC) Study. Methods: Records from hospitalizations in 2014 within ARIC community surveillance areas were sampled using a broad set of ICD discharge codes likely to harbor MI. These records were manually abstracted by ARIC study personnel and used to classify MI according to ARIC protocols. We requested EHRs in a unified data structure for the same hospitalizations at 6 hospitals and built programs to convert free text and structured data into the ARIC criteria elements necessary for MI classification. Per ARIC protocol, MI was classified based on cardiac biomarkers, cardiac pain, and Minnesota-coded electrocardiogram abnormalities. We compared MI classified from manually abstracted data to (1) EHR-based classification and (2) final ICD-9 coded discharge diagnoses (410-414). Results: These preliminary results are based on hospitalizations from 1 hospital. Of 684 hospitalizations, 355 qualified for full manual abstraction; 83 (23%) of these were classified as definite MI and 78 (22%) as probable MI. Our EHR-based abstraction is sensitive (>75%) and highly specific (>83%) in classifying ARIC-defined definite MI and definite or probable MI (Table). Conclusions: Our results support the potential of a process to extract comprehensive sets of data elements from EHR from different hospitals, with completeness and accuracy sufficient for a standardized definition of hospitalized MI.

Download Full-text

Documentation of social determinants in electronic health records with and without standardized terminologies: A comparative study

Proceedings of Singapore Healthcare ◽

10.1177/2010105818785641 ◽

2018 ◽

Vol 28 (1) ◽

pp. 39-47 ◽

Cited By ~ 1

Author(s):

Karen A Monsen ◽

Joyce M Rudenick ◽

Nicole Kapinos ◽

Kathryn Warmbold ◽

Siobhan K McMahon ◽

...

Keyword(s):

Electronic Health Records ◽

Free Text ◽

Snomed Ct ◽

Health Records ◽

Behavioral Determinants ◽

Omaha System ◽

Standardized Terminology ◽

Electronic Health ◽

Data Elements ◽

Improve Health

Background: Electronic health records (EHRs) are a promising new source of population health data that may improve health outcomes. However, little is known about the extent to which social and behavioral determinants of health (SBDH) are currently documented in EHRs, including how SBDH are documented, and by whom. Standardized nursing terminologies have been developed to assess and document SBDH. Objective: We examined the documentation of SBDH in EHRs with and without standardized nursing terminologies. Methods: We carried out a review of the literature for SBDH phrases organized by topic, which were used for analyses. Key informant interviews were conducted regarding SBDH phrases. Results: In nine EHRs (six acute care, three community care) 107 SBDH phrases were documented using free text, structured text, and standardized terminologies in diverse screens and by multiple clinicians, admitting personnel, and other staff. SBDH phrases were documented using one of three standardized terminologies ( N = average number of phrases per terminology per EHR): ICD-9/10 ( N = 1); SNOMED CT ( N = 1); Omaha System ( N = 79). Most often, standardized terminology data were documented by nurses or other clinical staff versus receptionists or other non-clinical personnel. Documentation ‘unknown’ differed significantly between EHRs with and without the Omaha System (mean = 26.0 (standard deviation (SD) = 8.7) versus mean = 74.5 (SD = 16.5)) ( p = .005). SBDH documentation in EHRs differed based on the presence of a nursing terminology. Conclusions: The Omaha System enabled a more comprehensive, holistic assessment and documentation of interoperable SBDH data. Further research is needed to determine SBDH data elements that are needed across settings, the uses of SBDH data in practice, and to examine patient perspectives related to SBDH assessments.

Download Full-text

Are International Classification of Diseases Codes in Electronic Health Records Useful in Identifying Obesity as a Risk Factor When Evaluating Surgical Outcomes?

The Health Care Manager ◽

10.1097/hcm.0000000000000112 ◽

2016 ◽

Vol 35 (4) ◽

pp. 361-367 ◽

Cited By ~ 1

Author(s):

Victoria Goode ◽

Virginia Rovnyak ◽

Ivora Hinton ◽

Elayne Phillips ◽

Elizabeth Merwin

Keyword(s):

Risk Factor ◽

Electronic Health Records ◽

Surgical Outcomes ◽

International Classification Of Diseases ◽

International Classification ◽

Health Records ◽

Classification Of Diseases ◽

Electronic Health

Download Full-text

Unlocking the Potential of Electronic Health Records for Health Research

International Journal for Population Data Science ◽

10.23889/ijpds.v5i1.1123 ◽

2020 ◽

Vol 5 (1) ◽

Cited By ~ 1

Author(s):

Seungwon Lee ◽

Yuan Xu ◽

Adam G D'Souza ◽

Elliot A Martin ◽

Chelsea Doktorchik ◽

...

Keyword(s):

Electronic Health Records ◽

Health Research ◽

Care Delivery ◽

Free Text ◽

Imaging Data ◽

Health Records ◽

Data Source ◽

Electronic Health ◽

Data Elements ◽

The City

Electronic health records (EHRs), originally designed to facilitate health care delivery, are becoming a valuable data source for health research. EHR systems have two components: the front end, where the data is entered by healthcare workers including physicians and nurses, and the back-end electronic data warehouse where the data is stored in a relational database. EHR data elements can be of many types, which can be categorized as structured, unstructured free-text, and imaging data. The Sunrise Clinical Manager (SCM) EHR is one example of an inpatient EHR system, which covers the city of Calgary (Alberta, Canada). This system, under the management of Alberta Health Services, is now being explored for research use. The purpose of the present paper is to describe the SCM EHR for research purposes, showing how this generalizes to EHRs in general. We further discuss advantages, challenges (e.g. potential bias and data quality issues), and analytical capacities and requirements associated with using EHRs.

Download Full-text

Real-time clinician text feeds from electronic health records

10.1101/2020.10.02.20205617 ◽

2020 ◽

Author(s):

James Teo ◽

Vlad Dinu ◽

William Bernal ◽

Phil Davidson ◽

Vitaliy Oliynyk ◽

...

Keyword(s):

Social Media ◽

Electronic Health Records ◽

Real Time ◽

Capacity Planning ◽

Low Cost ◽

Free Text ◽

Record System ◽

Health Records ◽

Keywords And Phrases ◽

Electronic Health

AbstractAnalyses of search engine and social media feeds have been attempted for infectious disease outbreaks1, but have been found to be susceptible to artefactual distortions from health scares or keyword spamming in social media or the public internet 2–4. We describe an approach using real-time aggregation of keywords and phrases of free text from real-time clinician-generated documentation in electronic health records to produce a customisable real-time viral pneumonia signal providing up to 2 days warning for secondary care capacity planning. This low-cost approach is open-source, is locally customisable, is not dependent on any specific electronic health record system and can be deployed at multiple organisational scales.

Download Full-text