scholarly journals Structured Approach for Evaluating Strategies for Cancer Ascertainment Using Large-Scale Electronic Health Record Data

2018 ◽  
pp. 1-12 ◽  
Author(s):  
Ashley Earles ◽  
Lin Liu ◽  
Ranier Bustamante ◽  
Pat Coke ◽  
Julie Lynch ◽  
...  

Purpose Cancer ascertainment using large-scale electronic health records is a challenge. Our aim was to propose and apply a structured approach for evaluating multiple candidate approaches for cancer ascertainment using colorectal cancer (CRC) ascertainment within the US Department of Veterans Affairs (VA) as a use case. Methods The proposed approach for evaluating cancer ascertainment strategies includes assessment of individual strategy performance, comparison of agreement across strategies, and review of discordant diagnoses. We applied this approach to compare three strategies for CRC ascertainment within the VA: administrative claims data consisting of International Classification of Diseases, Ninth Revision (ICD9) diagnosis codes; the VA Central Cancer Registry (VACCR); and the newly accessible Oncology Domain, consisting of cases abstracted by local cancer registrars. The study sample consisted of 1,839,043 veterans with index colonoscopy performed from 1999 to 2014. Strategy-specific performance was estimated based on manual record review of 100 candidate CRC cases and 100 colonoscopy controls. Strategies were further compared using Cohen’s κ and focused review of discordant CRC diagnoses. Results A total of 92,197 individuals met at least one CRC definition. All three strategies had high sensitivity and specificity for incident CRC. However, the ICD9-based strategy demonstrated poor positive predictive value (58%). VACCR and Oncology Domain had almost perfect agreement with each other (κ, 0.87) but only moderate agreement with ICD9-based diagnoses (κ, 0.51 and 0.57, respectively). Among discordant cases reviewed, 15% of ICD9-positive but VACCR- or Oncology Domain–negative cases had incident CRC. Conclusion Evaluating novel strategies for identifying cancer requires a structured approach, including validation against manual record review, agreement among candidate strategies, and focused review of discordant findings. Without careful assessment of ascertainment methods, analyses may be subject to bias and limited in clinical impact.

2018 ◽  
Vol 9 (1) ◽  
pp. 204589401881477 ◽  
Author(s):  
Simon Teal ◽  
William R. Auger ◽  
Rodney J. Hughes ◽  
Dena Rosen Ramey ◽  
Kelly S. Lewis ◽  
...  

This study aimed to validate an algorithm developed to identify chronic thromboembolic pulmonary hypertension (CTEPH) among patients with a history of pulmonary embolism. Validation was halted because too few patients had gold-standard evidence of CTEPH in the administrative claims/electronic health records database, suggesting that CTEPH is underdiagnosed.


2020 ◽  
Vol 16 (3) ◽  
pp. 531-540 ◽  
Author(s):  
Thomas H. McCoy ◽  
Larry Han ◽  
Amelia M. Pellegrini ◽  
Rudolph E. Tanzi ◽  
Sabina Berretta ◽  
...  

2021 ◽  
Author(s):  
Yuri Ahuja ◽  
Yuesong Zou ◽  
Aman Verma ◽  
David Buckeridge ◽  
Yue Li

Electronic Health Records (EHRs) contain rich clinical data collected at the point of the care, and their increasing adoption offers exciting opportunities for clinical informatics, disease risk prediction, and personalized treatment recommendation. However, effective use of EHR data for research and clinical decision support is often hampered by a lack of reliable disease labels. To compile gold-standard labels, researchers often rely on clinical experts to develop rule-based phenotyping algorithms from billing codes and other surrogate features. This process is tedious and error-prone due to recall and observer biases in how codes and measures are selected, and some phenotypes are incompletely captured by a handful of surrogate features. To address this challenge, we present a novel automatic phenotyping model called MixEHR-Guided (MixEHR-G), a multimodal hierarchical Bayesian topic model that efficiently models the EHR generative process by identifying latent phenotype structure in the data. Unlike existing topic modeling algorithms wherein, the inferred topics are not identifiable, MixEHR-G uses prior information from informative surrogate features to align topics with known phenotypes. We applied MixEHR-G to an openly available EHR dataset of 38,597 intensive care patients (MIMIC-III) in Boston, USA and to administrative claims data for a population-based cohort (PopHR) of 1.3 million people in Quebec, Canada. Qualitatively, we demonstrate that MixEHR-G learns interpretable phenotypes and yields meaningful insights about phenotype similarities, comorbidities, and epidemiological associations. Quantitatively, MixEHR-G outperforms existing unsupervised phenotyping methods on a phenotype label annotation task, and it can accurately estimate relative phenotype prevalence functions without gold-standard phenotype information. Altogether, MixEHR-G is an important step towards building an interpretable and automated phenotyping system using EHR data.


JAMIA Open ◽  
2019 ◽  
Vol 2 (4) ◽  
pp. 570-579 ◽  
Author(s):  
Na Hong ◽  
Andrew Wen ◽  
Feichen Shen ◽  
Sunghwan Sohn ◽  
Chen Wang ◽  
...  

Abstract Objective To design, develop, and evaluate a scalable clinical data normalization pipeline for standardizing unstructured electronic health record (EHR) data leveraging the HL7 Fast Healthcare Interoperability Resources (FHIR) specification. Methods We established an FHIR-based clinical data normalization pipeline known as NLP2FHIR that mainly comprises: (1) a module for a core natural language processing (NLP) engine with an FHIR-based type system; (2) a module for integrating structured data; and (3) a module for content normalization. We evaluated the FHIR modeling capability focusing on core clinical resources such as Condition, Procedure, MedicationStatement (including Medication), and FamilyMemberHistory using Mayo Clinic’s unstructured EHR data. We constructed a gold standard reusing annotation corpora from previous NLP projects. Results A total of 30 mapping rules, 62 normalization rules, and 11 NLP-specific FHIR extensions were created and implemented in the NLP2FHIR pipeline. The elements that need to integrate structured data from each clinical resource were identified. The performance of unstructured data modeling achieved F scores ranging from 0.69 to 0.99 for various FHIR element representations (0.69–0.99 for Condition; 0.75–0.84 for Procedure; 0.71–0.99 for MedicationStatement; and 0.75–0.95 for FamilyMemberHistory). Conclusion We demonstrated that the NLP2FHIR pipeline is feasible for modeling unstructured EHR data and integrating structured elements into the model. The outcomes of this work provide standards-based tools of clinical data normalization that is indispensable for enabling portable EHR-driven phenotyping and large-scale data analytics, as well as useful insights for future developments of the FHIR specifications with regard to handling unstructured clinical data.


ACI Open ◽  
2019 ◽  
Vol 03 (01) ◽  
pp. e44-e62
Author(s):  
Fabrizio Pecoraro ◽  
Daniela Luzi ◽  
Fabrizio L. Ricci

Background The growing availability of clinical and administrative data collected in electronic health records (EHRs) have led researchers and policy makers to implement data warehouses to improve the reuse of EHR data for secondary purposes. This approach can take advantages from a unique source of information that collects data from providers across multiple organizations. Moreover, the development of a data warehouse benefits from the standards adopted to exchange data provided by heterogeneous systems. Objective This article aims to design and implement a conceptual framework that semiautomatically extracts information collected in Health Level 7 Clinical Document Architecture (CDA) documents stored in an EHR and transforms them to be loaded in a target data warehouse. Results The solution adopted in this article supports the integration of the EHR as an operational data store in a data warehouse infrastructure. Moreover, data structure of EHR clinical documents and the data warehouse modeling schemas are analyzed to define a semiautomatic framework that maps the primitives of the CDA with the concepts of the dimensional model. The case study successfully tests this approach. Conclusion The proposed solution guarantees data quality using structured documents already integrated in a large-scale infrastructure, with a timely updated information flow. It ensures data integrity and consistency and has the advantage to be based on a sample size that covers a broad target population. Moreover, the use of CDAs simplifies the definition of extract, transform, and load tools through the adoption of a conceptual framework that load the information stored in the CDA in the data warehouse.


2021 ◽  
Vol 4 (1) ◽  
Author(s):  
Rishi J. Desai ◽  
Michael E. Matheny ◽  
Kevin Johnson ◽  
Keith Marsolo ◽  
Lesley H. Curtis ◽  
...  

AbstractThe Sentinel System is a major component of the United States Food and Drug Administration’s (FDA) approach to active medical product safety surveillance. While Sentinel has historically relied on large quantities of health insurance claims data, leveraging longitudinal electronic health records (EHRs) that contain more detailed clinical information, as structured and unstructured features, may address some of the current gaps in capabilities. We identify key challenges when using EHR data to investigate medical product safety in a scalable and accelerated way, outline potential solutions, and describe the Sentinel Innovation Center’s initiatives to put solutions into practice by expanding and strengthening the existing system with a query-ready, large-scale data infrastructure of linked EHR and claims data. We describe our initiatives in four strategic priority areas: (1) data infrastructure, (2) feature engineering, (3) causal inference, and (4) detection analytics, with the goal of incorporating emerging data science innovations to maximize the utility of EHR data for medical product safety surveillance.


PLoS ONE ◽  
2021 ◽  
Vol 16 (2) ◽  
pp. e0247203
Author(s):  
Ravi B. Parikh ◽  
Kristin A. Linn ◽  
Jiali Yan ◽  
Matthew L. Maciejewski ◽  
Ann-Marie Rosland ◽  
...  

Background Identifying individuals at risk for future hospitalization or death has been a major priority of population health management strategies. High-risk individuals are a heterogeneous group, and existing studies describing heterogeneity in high-risk individuals have been limited by data focused on clinical comorbidities and not socioeconomic or behavioral factors. We used machine learning clustering methods and linked comorbidity-based, sociodemographic, and psychobehavioral data to identify subgroups of high-risk Veterans and study long-term outcomes, hypothesizing that factors other than comorbidities would characterize several subgroups. Methods and findings In this cross-sectional study, we used data from the VA Corporate Data Warehouse, a national repository of VA administrative claims and electronic health data. To identify high-risk Veterans, we used the Care Assessment Needs (CAN) score, a routinely-used VA model that predicts a patient’s percentile risk of hospitalization or death at one year. Our study population consisted of 110,000 Veterans who were randomly sampled from 1,920,436 Veterans with a CAN score≥75th percentile in 2014. We categorized patient-level data into 119 independent variables based on demographics, comorbidities, pharmacy, vital signs, laboratories, and prior utilization. We used a previously validated density-based clustering algorithm to identify 30 subgroups of high-risk Veterans ranging in size from 50 to 2,446 patients. Mean CAN score ranged from 72.4 to 90.3 among subgroups. Two-year mortality ranged from 0.9% to 45.6% and was highest in the home-based care and metastatic cancer subgroups. Mean inpatient days ranged from 1.4 to 30.5 and were highest in the post-surgery and blood loss anemia subgroups. Mean emergency room visits ranged from 1.0 to 4.3 and were highest in the chronic sedative use and polysubstance use with amphetamine predominance subgroups. Five subgroups were distinguished by psychobehavioral factors and four subgroups were distinguished by sociodemographic factors. Conclusions High-risk Veterans are a heterogeneous population consisting of multiple distinct subgroups–many of which are not defined by clinical comorbidities–with distinct utilization and outcome patterns. To our knowledge, this represents the largest application of ML clustering methods to subgroup a high-risk population. Further study is needed to determine whether distinct subgroups may benefit from individualized interventions.


Circulation ◽  
2017 ◽  
Vol 135 (suppl_1) ◽  
Author(s):  
Randi Foraker ◽  
Sejal Patel ◽  
Yosef Khan ◽  
Mary Ann Bauman ◽  
Julie Bower

Background: Electronic health records (EHRs) are an increasingly valuable data source for monitoring population health. However, EHR data are rarely shared across health system borders, limiting their utility to researchers and policymakers. The Guideline Advantage™ (TGA) program, a joint initiative by the American Heart Association (AHA), American Cancer Society, and American Diabetes Association, brings together data from EHRs across the country to support disease prevention and management efforts in the outpatient setting. Methods: We analyzed TGA EHR data from >70 clinics comprising 281,837 adult patients from 2010 to 2015. We used the first available measure per patient for each calendar year to characterize trends in the proportion of patients in “ideal”, “intermediate”, and “poor” CVH categories for blood pressure (BP), body mass index (BMI) and smoking. Total cholesterol and fasting glucose values were not reported to TGA. Thus, we used low-density lipoprotein (LDL) and hemoglobin A1c (A1c) treatment guidelines to classify patients into CVH categories for the respective metrics. Results: Patients were an average of 50 years old, and 57.4% were female. Of records with complete data on race, 70.9% of patients were white. Over 6 years of observation, we documented increases in the proportion of patients at ideal levels for BP, smoking, LDL, and A1c, but decreases in the proportion of patients at an ideal level for BMI (Figure). Conclusions: TGA data provide a large-scale perspective of outpatient CVH, yet we acknowledge limitations associated with using EHR data to assess trends in CVH. Specifically, EHR data entry is clinically-driven - BP and BMI values are likely to be updated at each visit for each patient, while smoking status, LDL, and A1c are not. Our analysis lays the groundwork for EHR analyses as these data become less siloed and more accessible to stakeholders. Figure. Trends in CVH from 2010 to 2015: The Guideline Advantage™


Sign in / Sign up

Export Citation Format

Share Document