Statistical inference for association studies using electronic health records: handling both selection bias and outcome misclassification

10.1101/2019.12.26.19015859 ◽

2019 ◽

Author(s):

Lauren J. Beesley ◽

Bhramar Mukherjee

Keyword(s):

Electronic Health Records ◽

Selection Bias ◽

Type I Error ◽

Association Studies ◽

Disease Status ◽

Patient Specific ◽

Type I ◽

Health Records ◽

Electronic Health ◽

New Strategies

AbstractHealth research using electronic health records (EHR) has gained popularity, but misclassification of EHR-derived disease status and lack of representativeness of the study sample can result in substantial bias in effect estimates and can impact power and type I error. In this paper, we develop new strategies for handling disease status misclassification and selection bias in EHR-based association studies. We first focus on each type of bias separately. For misclassification, we propose three novel likelihood-based bias correction strategies. A distinguishing feature of the EHR setting is that misclassification may be related to patient-specific factors, and the proposed methods leverage data in the EHR to estimate misclassification rates without gold standard labels. For addressing selection bias, we describe how calibration and inverse probability weighting methods from the survey sampling literature can be extended and applied to the EHR setting.Addressing misclassification and selection biases simultaneously is a more challenging problem than dealing with each on its own, and we propose several new strategies to address this situation. For all methods proposed, we derive valid standard errors and provide software for implementation. We provide a new suite of statistical estimation and inference strategies for addressing misclassification and selection bias simultaneously that is tailored to problems arising in EHR data analysis. We apply these methods to data from The Michigan Genomics Initiative (MGI), a longitudinal EHR-linked biorepository.

Download Full-text

Clinical Research Informatics: Contributions from 2017

Yearbook of Medical Informatics ◽

10.1055/s-0038-1641220 ◽

2018 ◽

Vol 27 (01) ◽

pp. 177-183 ◽

Cited By ~ 1

Author(s):

Christel Daniel ◽

Dipak Kalra ◽

Keyword(s):

Electronic Health Records ◽

Clinical Research ◽

Association Studies ◽

Bias Reduction ◽

Lessons Learned ◽

Editorial Team ◽

Private Industry ◽

Health Records ◽

Clinical Research Informatics ◽

Electronic Health

Objectives: To summarize key contributions to current research in the field of Clinical Research Informatics (CRI) and to select best papers published in 2017. Method: A bibliographic search using a combination of MeSH descriptors and free terms on CRI was performed using PubMed, followed by a double-blind review in order to select a list of candidate best papers to be then peer-reviewed by external reviewers. A consensus meeting between the two section editors and the editorial team was organized to finally conclude on the selection of best papers. Results: Among the 741 returned papers published in 2017 in the various areas of CRI, the full review process selected five best papers. The first best paper reports on the implementation of consent management considering patient preferences for the use of de-identified data of electronic health records for research. The second best paper describes an approach using natural language processing to extract symptoms of severe mental illness from clinical text. The authors of the third best paper describe the challenges and lessons learned when leveraging the EHR4CR platform to support patient inclusion in academic studies in the context of an important collaboration between private industry and public health institutions. The fourth best paper describes a method and an interactive tool for case-crossover analyses of electronic medical records for patient safety. The last best paper proposes a new method for bias reduction in association studies using electronic health records data. Conclusions: Research in the CRI field continues to accelerate and to mature, leading to tools and platforms deployed at national or international scales with encouraging results. Beyond securing these new platforms for exploiting large-scale health data, another major challenge is the limitation of biases related to the use of “real-world” data. Controlling these biases is a prerequisite for the development of learning health systems.

Download Full-text

Adjusting for selection bias due to missing data in electronic health records-based research

Statistical Methods in Medical Research ◽

10.1177/09622802211027601 ◽

2021 ◽

Vol 30 (10) ◽

pp. 2221-2238

Author(s):

Sarah B Peskoe ◽

David Arterburn ◽

Karen J Coleman ◽

Lisa J Herrinton ◽

Michael J Daniels ◽

...

Keyword(s):

Missing Data ◽

Electronic Health Records ◽

Selection Bias ◽

Inverse Probability Weighting ◽

Small Sample ◽

Data Provenance ◽

Probability Weighting ◽

Health Records ◽

Inverse Probability ◽

Electronic Health

While electronic health records data provide unique opportunities for research, numerous methodological issues must be considered. Among these, selection bias due to incomplete/missing data has received far less attention than other issues. Unfortunately, standard missing data approaches (e.g. inverse-probability weighting and multiple imputation) generally fail to acknowledge the complex interplay of heterogeneous decisions made by patients, providers, and health systems that govern whether specific data elements in the electronic health records are observed. This, in turn, renders the missing-at-random assumption difficult to believe in standard approaches. In the clinical literature, the collection of decisions that gives rise to the observed data is referred to as the data provenance. Building on a recently-proposed framework for modularizing the data provenance, we develop a general and scalable framework for estimation and inference with respect to regression models based on inverse-probability weighting that allows for a hierarchy of missingness mechanisms to better align with the complex nature of electronic health records data. We show that the proposed estimator is consistent and asymptotically Normal, derive the form of the asymptotic variance, and propose two consistent estimators. Simulations show that naïve application of standard methods may yield biased point estimates, that the proposed estimators have good small-sample properties, and that researchers may have to contend with a bias-variance trade-off as they consider how to handle missing data. The proposed methods are motivated by an on-going, electronic health records-based study of bariatric surgery.

Download Full-text

Importance of quality control in ‘big data’: implications for statistical inference of electronic health records in clinical cardiology

Cardiovascular Research ◽

10.1093/cvr/cvy290 ◽

2019 ◽

Vol 115 (6) ◽

pp. e63-e65

Author(s):

Glen P Martin ◽

Mamas A Mamas

Keyword(s):

Quality Control ◽

Big Data ◽

Electronic Health Records ◽

Statistical Inference ◽

Health Records ◽

Clinical Cardiology ◽

Electronic Health

Download Full-text

Phenotype validation in electronic health records based genetic association studies

Genetic Epidemiology ◽

10.1002/gepi.22080 ◽

2017 ◽

Vol 41 (8) ◽

pp. 790-800 ◽

Cited By ~ 6

Author(s):

Lu Wang ◽

Scott M. Damrauer ◽

Hong Zhang ◽

Alan X. Zhang ◽

Rui Xiao ◽

...

Keyword(s):

Electronic Health Records ◽

Genetic Association ◽

Association Studies ◽

Genetic Association Studies ◽

Health Records ◽

Electronic Health

Download Full-text

Unlocking the Potential of Electronic Health Records for Translational Research

Yearbook of Medical Informatics ◽

10.1055/s-0038-1639444 ◽

2012 ◽

Vol 21 (01) ◽

pp. 135-138 ◽

Cited By ~ 1

Author(s):

Y. L. Yip ◽

Keyword(s):

Electronic Health Records ◽

Translational Research ◽

Large Scale ◽

Association Studies ◽

System Level ◽

Biomedical Knowledge ◽

Health Records ◽

Excellent Research ◽

Electronic Health ◽

Translational Informatics

SummaryTo review current excellent research and trend in the field of bioinformatics and translational informatics with direct application in the medical domain.Synopsis of the articles selected for the IMIA Yearbook 2012.Six excellent articles were selected in this Yearbook’s section on Bioinformatics and Translational Informatics. They exemplify current key advances in the use of patient information for translational research and health surveillance. First, two proof-of-concept studies demonstrated the cross-institutional and -geographic use of Electronic Health Records (EHR) for clinical trial subjects identification and drug safety signals detection. These reports pave ways to global large-scale population monitoring. Second, there is further evidence on the importance of coupling phenotypic information in EHR with genotypic information (either in biobank or in gene association studies) for new biomedical knowledge discovery. Third, patient data gathered via social media and self-reporting was found to be comparable to existent data and less labor intensive. This alternative means could potentially overcome data collection challenge in cohort and prospective studies. Finally, it can be noted that metagenomic studies are gaining momentum in bioinformatics and system-level analysis of human microbiome sheds important light on certain human diseases.The current literature showed that the traditional bench to bedside translational research is increasing being complemented by the reverse approach, in which bedside information can be used to provide novel biomedical insights.

Download Full-text

A Modeling Framework for Exploring Sampling and Observation Process Biases in Genome and Phenome-wide Association Studies using Electronic Health Records

10.1101/499392 ◽

2018 ◽

Cited By ~ 2

Author(s):

Lauren J. Beesley ◽

Lars G. Fritsche ◽

Bhramar Mukherjee

Keyword(s):

Sensitivity Analysis ◽

Electronic Health Records ◽

Large Scale ◽

Association Studies ◽

Modeling Framework ◽

Health Records ◽

Association Analyses ◽

Observation Process ◽

Special Cases ◽

Electronic Health

AbstractLarge-scale agnostic association analyses based on existing observational health care databases such as electronic health records have been a topic of increasing interest in the scientific community. However, particular challenges of non-probability sampling and phenotype misclassification associated with the use of these data sources are often ignored in standard analyses. In general, the extent of the bias that may be introduced by ignoring these factors is unknown. In this paper, we develop a statistical framework for characterizing the degree of bias expected in association studies based on electronic health records when disease status misclassification and the sampling mechanism are ignored. Through a sensitivity analysis type approach, this framework can be used to obtain plausible values for parameters of interest given results obtained from standard naive analysis methods under varying degree of misclassification and sampling biases. We develop an online tool for performing this sensitivity analysis in some special cases that occur frequently. Simulations demonstrate promising properties of the proposed way of characterizing biases. We apply our approach to study bias in genetic association studies using data from the Michigan Genomics Initiative, a longitudinal biorepository effort within Michigan Medicine.

Download Full-text

An analytic framework for exploring sampling and observation process biases in genome and phenome‐wide association studies using electronic health records

Statistics in Medicine ◽

10.1002/sim.8524 ◽

2020 ◽

Vol 39 (14) ◽

pp. 1965-1979 ◽

Cited By ~ 1

Author(s):

Lauren J. Beesley ◽

Lars G. Fritsche ◽

Bhramar Mukherjee

Keyword(s):

Electronic Health Records ◽

Association Studies ◽

Analytic Framework ◽

Health Records ◽

Observation Process ◽

Electronic Health

Download Full-text

Genetic validation of bipolar disorder identified by automated phenotyping using electronic health records

10.1101/193011 ◽

2017 ◽

Author(s):

Chia-Yen Chen ◽

Phil H. Lee ◽

Victor M. Castro ◽

Jessica Minnier ◽

Alexander W. Charney ◽

...

Keyword(s):

Bipolar Disorder ◽

Electronic Health Records ◽

Genetic Correlation ◽

High Throughput ◽

Association Studies ◽

European Ancestry ◽

Health Records ◽

Total N ◽

High Throughput Phenotyping ◽

Electronic Health

AbstractBipolar disorder (BD) is a heritable mood disorder characterized by episodes of mania and depression. Although genomewide association studies (GWAS) have successfully identified genetic loci contributing to BD risk, sample size has become a rate-limiting obstacle to genetic discovery. Electronic health records (EHRs) represent a vast but relatively untapped resource for high-throughput phenotyping. As part of the International Cohort Collection for Bipolar Disorder (ICCBD), we previously validated automated EHR-based phenotyping algorithms for BD against in-person diagnostic interviews (Castro et al. 2015). Here, we establish the genetic validity of these phenotypes by determining their genetic correlation with traditionally-ascertained samples. Case and control algorithms were derived from structured and narrative text in the Partners Healthcare system comprising more than 4.6 million patients over 20 years. Genomewide genotype data for 3,330 BD cases and 3,952 controls of European ancestry were used to estimate SNP-based heritability (h2g) and genetic correlation(rg) between EHR-based phenotype definitions and traditionally-ascertained BD cases in GWAS by the ICCBD and Psychiatric Genomics Consortium (PGC) using LD score regression. We evaluated BD cases identified using 4 EHR-based algorithms: an NLP-based algorithm (95-NLP) and 3 rule-based algorithms using codified EHR with decreasing levels of stringency - “coded-strict”, “coded-broad”, and “coded-broad based on a single clinical encounter” (coded-broad-SV). The analytic sample comprised 862 95-NLP, 1,968 coded-strict, 2,581 coded-broad, 408 coded-broad-SV BD cases, and 3,952 controls. The estimated h2g were 0.24 (p=0.015), 0.09 (p=0.064), 0.13 (p=0.003), 0.00 (p=0.591) for 95-NLP, coded-strict, coded-broad and coded-broad-SV BD, respectively. The h2g for all EHR-based cases combined except coded-broad-SV (excluded due to 0 h2g) was 0.12 (p=0.004). These h2g were lower or similar to the h2g observed by the ICCBD+PGCBD (0.23, p=3.17E-80, total N=33,181). However, the rg between ICCBD+PGCBD and the EHR-based cases were high for 95-NLP (0.66, p=3.69x10-5), coded-strict (1.00, p=2.40x10-4), and coded-broad (0.74, p=8.11x10-7). The rg between EHR-based BDs ranged from 0.90 to 0.98. These results provide the first genetic validation of automated EHR-based phenotyping for BD and suggest that this approach identifies cases that are highly genetically correlated with those ascertained through conventional methods. High throughput phenotyping using the large data resources available in EHRs represents a viable method for accelerating psychiatric genetic research.

Download Full-text

The Role of Electronic Health Records in Advancing Genomic Medicine

Annual Review of Genomics and Human Genetics ◽

10.1146/annurev-genom-121120-125204 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Jodell E. Linder ◽

Lisa Bastarache ◽

Jacob J. Hughey ◽

Josh F. Peterson

Keyword(s):

Electronic Health Records ◽

Association Studies ◽

Clinical Care ◽

Genomic Medicine ◽

Risk Scores ◽

Annual Review ◽

Publication Date ◽

Health Records ◽

Electronic Health

Recent advances in genomic technology and widespread adoption of electronic health records (EHRs) have accelerated the development of genomic medicine, bringing promising research findings from genome science into clinical practice. Genomic and phenomic data, accrued across large populations through biobanks linked to EHRs, have enabled the study of genetic variation at a phenome-wide scale. Through new quantitative techniques, pleiotropy can be explored with phenome-wide association studies, the occurrence of common complex diseases can be predicted using the cumulative influence of many genetic variants (polygenic risk scores), and undiagnosed Mendelian syndromes can be identified using EHR-based phenotypic signatures (phenotype risk scores). In this review, we trace the role of EHRs from the development of genome-wide analytic techniques to translational efforts to test these new interventions to the clinic. Throughout, we describe the challenges that remain when combining EHRs with genetics to improve clinical care. Expected final online publication date for the Annual Review of Genomics and Human Genetics, Volume 22 is August 2021. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates.

Download Full-text