Structured Approach for Evaluating Strategies for Cancer Ascertainment Using Large-Scale Electronic Health Record Data

JCO Clinical Cancer Informatics ◽

10.1200/cci.17.00072 ◽

2018 ◽

pp. 1-12 ◽

Cited By ~ 6

Author(s):

Ashley Earles ◽

Lin Liu ◽

Ranier Bustamante ◽

Pat Coke ◽

Julie Lynch ◽

...

Keyword(s):

Large Scale ◽

High Sensitivity ◽

Performance Comparison ◽

Administrative Claims ◽

Electronic Health Record Data ◽

Perfect Agreement ◽

Electronic Health ◽

Record Review ◽

Structured Approach ◽

Manual Record

Purpose Cancer ascertainment using large-scale electronic health records is a challenge. Our aim was to propose and apply a structured approach for evaluating multiple candidate approaches for cancer ascertainment using colorectal cancer (CRC) ascertainment within the US Department of Veterans Affairs (VA) as a use case. Methods The proposed approach for evaluating cancer ascertainment strategies includes assessment of individual strategy performance, comparison of agreement across strategies, and review of discordant diagnoses. We applied this approach to compare three strategies for CRC ascertainment within the VA: administrative claims data consisting of International Classification of Diseases, Ninth Revision (ICD9) diagnosis codes; the VA Central Cancer Registry (VACCR); and the newly accessible Oncology Domain, consisting of cases abstracted by local cancer registrars. The study sample consisted of 1,839,043 veterans with index colonoscopy performed from 1999 to 2014. Strategy-specific performance was estimated based on manual record review of 100 candidate CRC cases and 100 colonoscopy controls. Strategies were further compared using Cohen’s κ and focused review of discordant CRC diagnoses. Results A total of 92,197 individuals met at least one CRC definition. All three strategies had high sensitivity and specificity for incident CRC. However, the ICD9-based strategy demonstrated poor positive predictive value (58%). VACCR and Oncology Domain had almost perfect agreement with each other (κ, 0.87) but only moderate agreement with ICD9-based diagnoses (κ, 0.51 and 0.57, respectively). Among discordant cases reviewed, 15% of ICD9-positive but VACCR- or Oncology Domain–negative cases had incident CRC. Conclusion Evaluating novel strategies for identifying cancer requires a structured approach, including validation against manual record review, agreement among candidate strategies, and focused review of discordant findings. Without careful assessment of ascertainment methods, analyses may be subject to bias and limited in clinical impact.

Download Full-text

Validation of a claims-based algorithm to identify patients with chronic thromboembolic pulmonary hypertension using electronic health record data

Pulmonary Circulation ◽

10.1177/2045894018814772 ◽

2018 ◽

Vol 9 (1) ◽

pp. 204589401881477 ◽

Cited By ~ 1

Author(s):

Simon Teal ◽

William R. Auger ◽

Rodney J. Hughes ◽

Dena Rosen Ramey ◽

Kelly S. Lewis ◽

...

Keyword(s):

Pulmonary Hypertension ◽

Gold Standard ◽

Chronic Thromboembolic Pulmonary Hypertension ◽

Administrative Claims ◽

Electronic Health Record Data ◽

Health Records ◽

Record Data ◽

Thromboembolic Pulmonary Hypertension ◽

History Of ◽

Electronic Health

This study aimed to validate an algorithm developed to identify chronic thromboembolic pulmonary hypertension (CTEPH) among patients with a history of pulmonary embolism. Validation was halted because too few patients had gold-standard evidence of CTEPH in the administrative claims/electronic health records database, suggesting that CTEPH is underdiagnosed.

Download Full-text

Stratifying risk for dementia onset using large‐scale electronic health record data: A retrospective cohort study

Alzheimer s & Dementia ◽

10.1016/j.jalz.2019.09.084 ◽

2020 ◽

Vol 16 (3) ◽

pp. 531-540 ◽

Cited By ~ 2

Author(s):

Thomas H. McCoy ◽

Larry Han ◽

Amelia M. Pellegrini ◽

Rudolph E. Tanzi ◽

Sabina Berretta ◽

...

Keyword(s):

Cohort Study ◽

Electronic Health Record ◽

Retrospective Cohort Study ◽

Retrospective Cohort ◽

Large Scale ◽

Health Record ◽

Electronic Health Record Data ◽

Record Data ◽

Electronic Health ◽

Dementia Onset

Download Full-text

MixEHR-Guided: A guided multi-modal topic modeling approach for large-scale automatic phenotyping using the electronic health record

10.1101/2021.12.17.473215 ◽

2021 ◽

Author(s):

Yuri Ahuja ◽

Yuesong Zou ◽

Aman Verma ◽

David Buckeridge ◽

Yue Li

Keyword(s):

Gold Standard ◽

Topic Modeling ◽

Large Scale ◽

Topic Model ◽

Disease Risk ◽

Clinical Decision ◽

Treatment Recommendation ◽

Administrative Claims ◽

Electronic Health ◽

Automatic Phenotyping

Electronic Health Records (EHRs) contain rich clinical data collected at the point of the care, and their increasing adoption offers exciting opportunities for clinical informatics, disease risk prediction, and personalized treatment recommendation. However, effective use of EHR data for research and clinical decision support is often hampered by a lack of reliable disease labels. To compile gold-standard labels, researchers often rely on clinical experts to develop rule-based phenotyping algorithms from billing codes and other surrogate features. This process is tedious and error-prone due to recall and observer biases in how codes and measures are selected, and some phenotypes are incompletely captured by a handful of surrogate features. To address this challenge, we present a novel automatic phenotyping model called MixEHR-Guided (MixEHR-G), a multimodal hierarchical Bayesian topic model that efficiently models the EHR generative process by identifying latent phenotype structure in the data. Unlike existing topic modeling algorithms wherein, the inferred topics are not identifiable, MixEHR-G uses prior information from informative surrogate features to align topics with known phenotypes. We applied MixEHR-G to an openly available EHR dataset of 38,597 intensive care patients (MIMIC-III) in Boston, USA and to administrative claims data for a population-based cohort (PopHR) of 1.3 million people in Quebec, Canada. Qualitatively, we demonstrate that MixEHR-G learns interpretable phenotypes and yields meaningful insights about phenotype similarities, comorbidities, and epidemiological associations. Quantitatively, MixEHR-G outperforms existing unsupervised phenotyping methods on a phenotype label annotation task, and it can accurately estimate relative phenotype prevalence functions without gold-standard phenotype information. Altogether, MixEHR-G is an important step towards building an interpretable and automated phenotyping system using EHR data.

Download Full-text

Developing a scalable FHIR-based clinical data normalization pipeline for standardizing and integrating unstructured and structured electronic health record data

JAMIA Open ◽

10.1093/jamiaopen/ooz056 ◽

2019 ◽

Vol 2 (4) ◽

pp. 570-579 ◽

Cited By ~ 5

Author(s):

Na Hong ◽

Andrew Wen ◽

Feichen Shen ◽

Sunghwan Sohn ◽

Chen Wang ◽

...

Keyword(s):

Electronic Health Record ◽

Language Processing ◽

Clinical Data ◽

Large Scale ◽

Structured Data ◽

Health Record ◽

Data Normalization ◽

Electronic Health Record Data ◽

Electronic Health ◽

Clinical Resource

Abstract Objective To design, develop, and evaluate a scalable clinical data normalization pipeline for standardizing unstructured electronic health record (EHR) data leveraging the HL7 Fast Healthcare Interoperability Resources (FHIR) specification. Methods We established an FHIR-based clinical data normalization pipeline known as NLP2FHIR that mainly comprises: (1) a module for a core natural language processing (NLP) engine with an FHIR-based type system; (2) a module for integrating structured data; and (3) a module for content normalization. We evaluated the FHIR modeling capability focusing on core clinical resources such as Condition, Procedure, MedicationStatement (including Medication), and FamilyMemberHistory using Mayo Clinic’s unstructured EHR data. We constructed a gold standard reusing annotation corpora from previous NLP projects. Results A total of 30 mapping rules, 62 normalization rules, and 11 NLP-specific FHIR extensions were created and implemented in the NLP2FHIR pipeline. The elements that need to integrate structured data from each clinical resource were identified. The performance of unstructured data modeling achieved F scores ranging from 0.69 to 0.99 for various FHIR element representations (0.69–0.99 for Condition; 0.75–0.84 for Procedure; 0.71–0.99 for MedicationStatement; and 0.75–0.95 for FamilyMemberHistory). Conclusion We demonstrated that the NLP2FHIR pipeline is feasible for modeling unstructured EHR data and integrating structured elements into the model. The outcomes of this work provide standards-based tools of clinical data normalization that is indispensable for enabling portable EHR-driven phenotyping and large-scale data analytics, as well as useful insights for future developments of the FHIR specifications with regard to handling unstructured clinical data.

Download Full-text

Developing HL7 CDA-Based Data Warehouse for the Use of Electronic Health Record Data for Secondary Purposes

ACI Open ◽

10.1055/s-0039-1688936 ◽

2019 ◽

Vol 03 (01) ◽

pp. e44-e62

Author(s):

Fabrizio Pecoraro ◽

Daniela Luzi ◽

Fabrizio L. Ricci

Keyword(s):

Conceptual Framework ◽

Data Warehouse ◽

Large Scale ◽

Heterogeneous Systems ◽

Target Population ◽

Electronic Health Record Data ◽

Health Level 7 ◽

Electronic Health ◽

Unique Source ◽

Definition Of

Background The growing availability of clinical and administrative data collected in electronic health records (EHRs) have led researchers and policy makers to implement data warehouses to improve the reuse of EHR data for secondary purposes. This approach can take advantages from a unique source of information that collects data from providers across multiple organizations. Moreover, the development of a data warehouse benefits from the standards adopted to exchange data provided by heterogeneous systems. Objective This article aims to design and implement a conceptual framework that semiautomatically extracts information collected in Health Level 7 Clinical Document Architecture (CDA) documents stored in an EHR and transforms them to be loaded in a target data warehouse. Results The solution adopted in this article supports the integration of the EHR as an operational data store in a data warehouse infrastructure. Moreover, data structure of EHR clinical documents and the data warehouse modeling schemas are analyzed to define a semiautomatic framework that maps the primitives of the CDA with the concepts of the dimensional model. The case study successfully tests this approach. Conclusion The proposed solution guarantees data quality using structured documents already integrated in a large-scale infrastructure, with a timely updated information flow. It ensures data integrity and consistency and has the advantage to be based on a sample size that covers a broad target population. Moreover, the use of CDAs simplifies the definition of extract, transform, and load tools through the adoption of a conceptual framework that load the information stored in the CDA in the data warehouse.

Download Full-text

Comparison of Electronic Laboratory Reports, Administrative Claims, and Electronic Health Record Data for Acute Viral Hepatitis Surveillance

Journal of Public Health Management and Practice ◽

10.1097/phh.0b013e31821f2d73 ◽

2012 ◽

Vol 18 (3) ◽

pp. 209-214 ◽

Cited By ~ 17

Author(s):

Joshua Allen-Dicker ◽

Michael Klompas

Keyword(s):

Electronic Health Record ◽

Viral Hepatitis ◽

Acute Viral Hepatitis ◽

Administrative Claims ◽

Health Record ◽

Electronic Health Record Data ◽

Record Data ◽

Electronic Health

Download Full-text

Supervised multi-specialist topic model with applications on large-scale electronic health record data

Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics ◽

10.1145/3459930.3469543 ◽

2021 ◽

Author(s):

Ziyang Song ◽

Xavier Sumba Toral ◽

Yixin Xu ◽

Aihua Liu ◽

Liming Guo ◽

...

Keyword(s):

Electronic Health Record ◽

Large Scale ◽

Topic Model ◽

Health Record ◽

Electronic Health Record Data ◽

Record Data ◽

Electronic Health

Download Full-text

Broadening the reach of the FDA Sentinel system: A roadmap for integrating electronic health record data in a causal analysis framework

npj Digital Medicine ◽

10.1038/s41746-021-00542-0 ◽

2021 ◽

Vol 4 (1) ◽

Author(s):

Rishi J. Desai ◽

Michael E. Matheny ◽

Kevin Johnson ◽

Keith Marsolo ◽

Lesley H. Curtis ◽

...

Keyword(s):

Large Scale ◽

Data Science ◽

Claims Data ◽

The United States ◽

Medical Product ◽

Product Safety ◽

Electronic Health Record Data ◽

Data Infrastructure ◽

Safety Surveillance ◽

Electronic Health

AbstractThe Sentinel System is a major component of the United States Food and Drug Administration’s (FDA) approach to active medical product safety surveillance. While Sentinel has historically relied on large quantities of health insurance claims data, leveraging longitudinal electronic health records (EHRs) that contain more detailed clinical information, as structured and unstructured features, may address some of the current gaps in capabilities. We identify key challenges when using EHR data to investigate medical product safety in a scalable and accelerated way, outline potential solutions, and describe the Sentinel Innovation Center’s initiatives to put solutions into practice by expanding and strengthening the existing system with a query-ready, large-scale data infrastructure of linked EHR and claims data. We describe our initiatives in four strategic priority areas: (1) data infrastructure, (2) feature engineering, (3) causal inference, and (4) detection analytics, with the goal of incorporating emerging data science innovations to maximize the utility of EHR data for medical product safety surveillance.

Download Full-text

A machine learning approach to identify distinct subgroups of veterans at risk for hospitalization or death using administrative and electronic health record data

PLoS ONE ◽

10.1371/journal.pone.0247203 ◽

2021 ◽

Vol 16 (2) ◽

pp. e0247203

Author(s):

Ravi B. Parikh ◽

Kristin A. Linn ◽

Jiali Yan ◽

Matthew L. Maciejewski ◽

Ann-Marie Rosland ◽

...

Keyword(s):

Machine Learning ◽

At Risk ◽

High Risk ◽

Administrative Claims ◽

Clustering Methods ◽

Electronic Health Record Data ◽

Post Surgery ◽

Polysubstance Use ◽

Care Assessment ◽

Electronic Health

Background Identifying individuals at risk for future hospitalization or death has been a major priority of population health management strategies. High-risk individuals are a heterogeneous group, and existing studies describing heterogeneity in high-risk individuals have been limited by data focused on clinical comorbidities and not socioeconomic or behavioral factors. We used machine learning clustering methods and linked comorbidity-based, sociodemographic, and psychobehavioral data to identify subgroups of high-risk Veterans and study long-term outcomes, hypothesizing that factors other than comorbidities would characterize several subgroups. Methods and findings In this cross-sectional study, we used data from the VA Corporate Data Warehouse, a national repository of VA administrative claims and electronic health data. To identify high-risk Veterans, we used the Care Assessment Needs (CAN) score, a routinely-used VA model that predicts a patient’s percentile risk of hospitalization or death at one year. Our study population consisted of 110,000 Veterans who were randomly sampled from 1,920,436 Veterans with a CAN score≥75th percentile in 2014. We categorized patient-level data into 119 independent variables based on demographics, comorbidities, pharmacy, vital signs, laboratories, and prior utilization. We used a previously validated density-based clustering algorithm to identify 30 subgroups of high-risk Veterans ranging in size from 50 to 2,446 patients. Mean CAN score ranged from 72.4 to 90.3 among subgroups. Two-year mortality ranged from 0.9% to 45.6% and was highest in the home-based care and metastatic cancer subgroups. Mean inpatient days ranged from 1.4 to 30.5 and were highest in the post-surgery and blood loss anemia subgroups. Mean emergency room visits ranged from 1.0 to 4.3 and were highest in the chronic sedative use and polysubstance use with amphetamine predominance subgroups. Five subgroups were distinguished by psychobehavioral factors and four subgroups were distinguished by sociodemographic factors. Conclusions High-risk Veterans are a heterogeneous population consisting of multiple distinct subgroups–many of which are not defined by clinical comorbidities–with distinct utilization and outcome patterns. To our knowledge, this represents the largest application of ML clustering methods to subgroup a high-risk population. Further study is needed to determine whether distinct subgroups may benefit from individualized interventions.

Download Full-text

Abstract 20: Cardiovascular Health Trends in Electronic Health Record Data (2010-2015): the Guideline Advantage

Circulation ◽

10.1161/circ.135.suppl_1.20 ◽

2017 ◽

Vol 135 (suppl_1) ◽

Author(s):

Randi Foraker ◽

Sejal Patel ◽

Yosef Khan ◽

Mary Ann Bauman ◽

Julie Bower

Keyword(s):

Large Scale ◽

Cardiovascular Health ◽

Treatment Guidelines ◽

Smoking Status ◽

Density Lipoprotein ◽

Heart Association ◽

Electronic Health Record Data ◽

Cancer Society ◽

Health Trends ◽

Electronic Health

Background: Electronic health records (EHRs) are an increasingly valuable data source for monitoring population health. However, EHR data are rarely shared across health system borders, limiting their utility to researchers and policymakers. The Guideline Advantage™ (TGA) program, a joint initiative by the American Heart Association (AHA), American Cancer Society, and American Diabetes Association, brings together data from EHRs across the country to support disease prevention and management efforts in the outpatient setting. Methods: We analyzed TGA EHR data from >70 clinics comprising 281,837 adult patients from 2010 to 2015. We used the first available measure per patient for each calendar year to characterize trends in the proportion of patients in “ideal”, “intermediate”, and “poor” CVH categories for blood pressure (BP), body mass index (BMI) and smoking. Total cholesterol and fasting glucose values were not reported to TGA. Thus, we used low-density lipoprotein (LDL) and hemoglobin A1c (A1c) treatment guidelines to classify patients into CVH categories for the respective metrics. Results: Patients were an average of 50 years old, and 57.4% were female. Of records with complete data on race, 70.9% of patients were white. Over 6 years of observation, we documented increases in the proportion of patients at ideal levels for BP, smoking, LDL, and A1c, but decreases in the proportion of patients at an ideal level for BMI (Figure). Conclusions: TGA data provide a large-scale perspective of outpatient CVH, yet we acknowledge limitations associated with using EHR data to assess trends in CVH. Specifically, EHR data entry is clinically-driven - BP and BMI values are likely to be updated at each visit for each patient, while smoking status, LDL, and A1c are not. Our analysis lays the groundwork for EHR analyses as these data become less siloed and more accessible to stakeholders. Figure. Trends in CVH from 2010 to 2015: The Guideline Advantage™

Download Full-text