scholarly journals Cox regression increases power to detect genotype-phenotype associations in genomic studies using the electronic health record

2019 ◽  
Author(s):  
Jacob J. Hughey ◽  
Seth D. Rhoades ◽  
Darwin Y. Fu ◽  
Lisa Bastarache ◽  
Joshua C. Denny ◽  
...  

AbstractBackgroundThe growth of DNA biobanks linked to data from electronic health records (EHRs) has enabled the discovery of numerous associations between genomic variants and clinical phenotypes. Nonetheless, although clinical data are generally longitudinal, standard approaches for detecting genotype-phenotype associations in such linked data, notably logistic regression, do not naturally account for the times at which events occur. Here we explored the advantages of quantifying associations using Cox proportional hazards regression, which can account for the age at which a patient first visited the healthcare system (left truncation) and the age at which a patient either last visited the healthcare system or acquired a particular phenotype (right censoring).ResultsUsing simulated data, we found that, compared to logistic regression, Cox regression had greater power at equivalent Type I error. We then scanned for genotype-phenotype associations using logistic regression and Cox regression on 50 phenotypes derived from the electronic health records of 49 792 genotyped individuals. In terms of effect sizes, the hazard ratios estimated by Cox regression were nearly identical to the odds ratios estimated by logistic regression. Consistent with the findings from our simulations, Cox regression had approximately 10% greater relative sensitivity for detecting known associations from the NHGRI-EBI GWAS Catalog.ConclusionsAs longitudinal health-related data continue to grow, Cox regression may improve our ability to identify the genetic basis for a wide range of human phenotypes.

BMC Genomics ◽  
2019 ◽  
Vol 20 (1) ◽  
Author(s):  
Jacob J. Hughey ◽  
Seth D. Rhoades ◽  
Darwin Y. Fu ◽  
Lisa Bastarache ◽  
Joshua C. Denny ◽  
...  

Abstract Background The growth of DNA biobanks linked to data from electronic health records (EHRs) has enabled the discovery of numerous associations between genomic variants and clinical phenotypes. Nonetheless, although clinical data are generally longitudinal, standard approaches for detecting genotype-phenotype associations in such linked data, notably logistic regression, do not naturally account for variation in the period of follow-up or the time at which an event occurs. Here we explored the advantages of quantifying associations using Cox proportional hazards regression, which can account for the age at which a patient first visited the healthcare system (left truncation) and the age at which a patient either last visited the healthcare system or acquired a particular phenotype (right censoring). Results In comprehensive simulations, we found that, compared to logistic regression, Cox regression had greater power at equivalent Type I error. We then scanned for genotype-phenotype associations using logistic regression and Cox regression on 50 phenotypes derived from the EHRs of 49,792 genotyped individuals. Consistent with the findings from our simulations, Cox regression had approximately 10% greater relative sensitivity for detecting known associations from the NHGRI-EBI GWAS Catalog. In terms of effect sizes, the hazard ratios estimated by Cox regression were strongly correlated with the odds ratios estimated by logistic regression. Conclusions As longitudinal health-related data continue to grow, Cox regression may improve our ability to identify the genetic basis for a wide range of human phenotypes.


2020 ◽  
Vol 59 (14) ◽  
pp. 1274-1281
Author(s):  
Christine B. San Giovanni ◽  
Myla Ebeling ◽  
Robert A. Davis ◽  
C. Shaun Wagner ◽  
William T. Basco

Objective. This study tested the sensitivity of obesity diagnosis in electronic health records (EHRs) using body mass index (BMI) classification and identified variables associated with obesity diagnosis. Methods. Eligible children aged 2 to 18 years had a calculable BMI in 2017 and had at least 1 visit in 2016 and 2017. Sensitivity of clinical obesity diagnosis compared with children’s BMI percentile was calculated. Logistic regression was performed to determine variables associated with obesity diagnosis. Results. Analyses included 31 059 children with BMI at or above 95th percentile. Sensitivity of clinical obesity diagnosis was 35.81%. Clinical obesity diagnosis was more likely if the child had a well visit, had Medicaid insurance, was female, Hispanic or Black, had a chronic disease diagnosis, and saw a provider in a practice in an urban area or with academic affiliation. Conclusion. Sensitivity of clinical obesity diagnosis in EHR is low. Clinical obesity diagnosis is associated with nonmodifiable child-specific factors but also modifiable practice-specific factors.


BMC Medicine ◽  
2019 ◽  
Vol 17 (1) ◽  
Author(s):  
B. D. Nicholson ◽  
P. Aveyard ◽  
C. R. Bankhead ◽  
W. Hamilton ◽  
F. D. R. Hobbs ◽  
...  

Abstract Background Excess weight and unexpected weight loss are associated with multiple disease states and increased morbidity and mortality, but weight measurement is not routine in many primary care settings. The aim of this study was to characterise who has had their weight recorded in UK primary care, how frequently, by whom and in relation to which clinical events, symptoms and diagnoses. Methods A longitudinal analysis of UK primary care electronic health records (EHR) data from 2000 to 2017. Descriptive statistics were used to summarise weight recording in terms of patient sociodemographic characteristics, health professional encounters, clinical events, symptoms and diagnoses. Negative binomial regression was used to model the likelihood of having a weight record each year, and Cox regression to the likelihood of repeated weight recording. Results A total of 14,049,871 weight records were identified in the EHR of 4,918,746 patients during the study period, representing 26,998,591 person-years of observation. Around a third of patients had a weight record each year. Forty-nine percent of weight records were repeated within a year with an average time to a repeat weight record of 1.92 years. Weight records were most often taken by nursing staff (38–42%) and GPs (37–39%) as part of a routine clinical care, such as chronic disease reviews (16%), medication reviews (6–8%) and health checks (6–7%), or were associated with consultations for contraception (5–8%), respiratory disease (5%) and obesity (1%). Patient characteristics independently associated with an increased likelihood of weight recording were as follows: female sex, younger and older adults, non-drinkers, ex-smokers, low or high BMI, being more deprived, diagnosed with a greater number of comorbidities and consulting more frequently. The effect of policy-level incentives to record weight did not appear to be sustained after they were removed. Conclusion Weight recording is not a routine activity in UK primary care. It is recorded for around a third of patients each year and is repeated on average every 2 years for these patients. It is more common in females with higher BMI and in those with comorbidity. Incentive payments and their removal appear to be associated with increases and decreases in weight recording.


Allergy ◽  
2016 ◽  
Vol 71 (9) ◽  
pp. 1305-1313 ◽  
Author(s):  
L. Zhou ◽  
N. Dhopeshwarkar ◽  
K. G. Blumenthal ◽  
F. Goss ◽  
M. Topaz ◽  
...  

10.2196/21798 ◽  
2020 ◽  
Vol 8 (10) ◽  
pp. e21798 ◽  
Author(s):  
Feng Xie ◽  
Bibhas Chakraborty ◽  
Marcus Eng Hock Ong ◽  
Benjamin Alan Goldstein ◽  
Nan Liu

Background Risk scores can be useful in clinical risk stratification and accurate allocations of medical resources, helping health providers improve patient care. Point-based scores are more understandable and explainable than other complex models and are now widely used in clinical decision making. However, the development of the risk scoring model is nontrivial and has not yet been systematically presented, with few studies investigating methods of clinical score generation using electronic health records. Objective This study aims to propose AutoScore, a machine learning–based automatic clinical score generator consisting of 6 modules for developing interpretable point-based scores. Future users can employ the AutoScore framework to create clinical scores effortlessly in various clinical applications. Methods We proposed the AutoScore framework comprising 6 modules that included variable ranking, variable transformation, score derivation, model selection, score fine-tuning, and model evaluation. To demonstrate the performance of AutoScore, we used data from the Beth Israel Deaconess Medical Center to build a scoring model for mortality prediction and then compared the data with other baseline models using the receiver operating characteristic analysis. A software package in R 3.5.3 (R Foundation) was also developed to demonstrate the implementation of AutoScore. Results Implemented on the data set with 44,918 individual admission episodes of intensive care, the AutoScore-created scoring models performed comparably well as other standard methods (ie, logistic regression, stepwise regression, least absolute shrinkage and selection operator, and random forest) in terms of predictive accuracy and model calibration but required fewer predictors and presented high interpretability and accessibility. The nine-variable, AutoScore-created, point-based scoring model achieved an area under the curve (AUC) of 0.780 (95% CI 0.764-0.798), whereas the model of logistic regression with 24 variables had an AUC of 0.778 (95% CI 0.760-0.795). Moreover, the AutoScore framework also drives the clinical research continuum and automation with its integration of all necessary modules. Conclusions We developed an easy-to-use, machine learning–based automatic clinical score generator, AutoScore; systematically presented its structure; and demonstrated its superiority (predictive performance and interpretability) over other conventional methods using a benchmark database. AutoScore will emerge as a potential scoring tool in various medical applications.


2016 ◽  
Author(s):  
Fernanda Polubriaginof ◽  
Rami Vanguri ◽  
Kayla Quinnies ◽  
Gillian M. Belbin ◽  
Alexandre Yahi ◽  
...  

AbstractHeritability is essential for understanding the biological causes of disease, but requires laborious patient recruitment and phenotype ascertainment. Electronic health records (EHR) passively capture a wide range of clinically relevant data and provide a novel resource for studying the heritability of traits that are not typically accessible. EHRs contain next-of-kin information collected via patient emergency contact forms, but until now, these data have gone unused in research. We mined emergency contact data at three academic medical centers and identified millions of familial relationships while maintaining patient privacy. Identified relationships were consistent with genetically-derived relatedness. We used EHR data to compute heritability estimates for 500 disease phenotypes. Overall, estimates were consistent with literature and between sites. Inconsistencies were indicative of limitations and opportunities unique to EHR research. These analyses provide a novel validation of the use of EHRs for genetics and disease research.One Sentence SummaryWe demonstrate that next-of-kin information can be used to identify familial relationships in the EHR, providing unique opportunities for precision medicine studies.


Informatics ◽  
2020 ◽  
Vol 7 (2) ◽  
pp. 17 ◽  
Author(s):  
Sheikh S. Abdullah ◽  
Neda Rostamzadeh ◽  
Kamran Sedig ◽  
Amit X. Garg ◽  
Eric McArthur

Recent advancement in EHR-based (Electronic Health Record) systems has resulted in producing data at an unprecedented rate. The complex, growing, and high-dimensional data available in EHRs creates great opportunities for machine learning techniques such as clustering. Cluster analysis often requires dimension reduction to achieve efficient processing time and mitigate the curse of dimensionality. Given a wide range of techniques for dimension reduction and cluster analysis, it is not straightforward to identify which combination of techniques from both families leads to the desired result. The ability to derive useful and precise insights from EHRs requires a deeper understanding of the data, intermediary results, configuration parameters, and analysis processes. Although these tasks are often tackled separately in existing studies, we present a visual analytics (VA) system, called Visual Analytics for Cluster Analysis and Dimension Reduction of High Dimensional Electronic Health Records (VALENCIA), to address the challenges of high-dimensional EHRs in a single system. VALENCIA brings a wide range of cluster analysis and dimension reduction techniques, integrate them seamlessly, and make them accessible to users through interactive visualizations. It offers a balanced distribution of processing load between users and the system to facilitate the performance of high-level cognitive tasks in such a way that would be difficult without the aid of a VA system. Through a real case study, we have demonstrated how VALENCIA can be used to analyze the healthcare administrative dataset stored at ICES. This research also highlights what needs to be considered in the future when developing VA systems that are designed to derive deep and novel insights into EHRs.


2021 ◽  
Author(s):  
Halie M. Rando ◽  
Tellen D. Bennett ◽  
James Brian Byrd ◽  
Carolyn Bramante ◽  
Tiffany J. Callahan ◽  
...  

Since late 2019, the novel coronavirus SARS-CoV-2 has introduced a wide array of health challenges globally. In addition to a complex acute presentation that can affect multiple organ systems, increasing evidence points to long-term sequelae being common and impactful. As the worldwide scientific community forges ahead with efforts to characterize a wide range of outcomes associated with SARS-CoV-2 infection, the proliferation of available data has made it clear that formal definitions are needed in order to design robust and consistent studies of Long COVID that consistently capture variation in long-term outcomes. In the present study, we investigate the definitions used in the literature published to date and compare them against data available from electronic health records and patient-reported information collected via surveys. Long COVID holds the potential to produce a second public health crisis on the heels of the pandemic. Proactive efforts to identify the characteristics of this heterogeneous condition are imperative for a rigorous scientific effort to investigate and mitigate this threat.


2020 ◽  
Author(s):  
Carlos Sáez ◽  
Alba Gutiérrez-Sacristán ◽  
Isaac Kohane ◽  
Juan M García-Gómez ◽  
Paul Avillach

AbstractBackgroundTemporal variability in healthcare processes or protocols is intrinsic to medicine. Such variability can potentially introduce dataset shifts, a data quality issue when reusing electronic health records (EHRs) for secondary purposes. Temporal dataset shifts can present as trends, abrupt or seasonal changes in the statistical distributions of data over time, being particularly complex to address in multi-modal and highly coded data. These changes, if not delineated, can harm population and data-driven research, such as machine learning. Given that biomedical research repositories are increasingly being populated with large historical data from EHRs, there is a need for specific software methods to help delineate temporal dataset shifts to ensure reliable data reuse.FindingsEHRtemporalVariability is an Open Source R-package and Shiny-app designed to explore and identify temporal dataset shifts. EHRtemporalVariability estimates the statistical distributions of coded and numerical data over time, projects their temporal-evolution through non-parametric Information Geometric Temporal plots, and enables the exploration of changes in variables through Data Temporal Heatmaps. We demonstrate the capability of EHRtemporalVariability to delineate dataset shifts in three impact case studies, one of them available for reproducibility.ConclusionsEHRtemporalVariability enables the exploration and identification of dataset shifts, contributing to broadly examine and repurpose large, longitudinal datasets. Our goal is to help ensure reliable data reuse to a wide range of biomedical data users. EHRtemporalVariability is suited to technical users programmatically using the R-package and to those users not familiar with programming using the Shiny user interface.Availabilityhttps://github.com/hms-dbmi/EHRtemporalVariability/ Reproducible vignette: https://cran.r-project.org/web/packages/EHRtemporalVariability/vignettes/EHRtemporalVariability.html On-line demo: http://ehrtemporalvariability.upv.es/


2019 ◽  
Author(s):  
Lauren J. Beesley ◽  
Bhramar Mukherjee

AbstractHealth research using electronic health records (EHR) has gained popularity, but misclassification of EHR-derived disease status and lack of representativeness of the study sample can result in substantial bias in effect estimates and can impact power and type I error. In this paper, we develop new strategies for handling disease status misclassification and selection bias in EHR-based association studies. We first focus on each type of bias separately. For misclassification, we propose three novel likelihood-based bias correction strategies. A distinguishing feature of the EHR setting is that misclassification may be related to patient-specific factors, and the proposed methods leverage data in the EHR to estimate misclassification rates without gold standard labels. For addressing selection bias, we describe how calibration and inverse probability weighting methods from the survey sampling literature can be extended and applied to the EHR setting.Addressing misclassification and selection biases simultaneously is a more challenging problem than dealing with each on its own, and we propose several new strategies to address this situation. For all methods proposed, we derive valid standard errors and provide software for implementation. We provide a new suite of statistical estimation and inference strategies for addressing misclassification and selection bias simultaneously that is tailored to problems arising in EHR data analysis. We apply these methods to data from The Michigan Genomics Initiative (MGI), a longitudinal EHR-linked biorepository.


Sign in / Sign up

Export Citation Format

Share Document