A Modeling Framework for Exploring Sampling and Observation Process Biases in Genome and Phenome-wide Association Studies using Electronic Health Records
AbstractLarge-scale agnostic association analyses based on existing observational health care databases such as electronic health records have been a topic of increasing interest in the scientific community. However, particular challenges of non-probability sampling and phenotype misclassification associated with the use of these data sources are often ignored in standard analyses. In general, the extent of the bias that may be introduced by ignoring these factors is unknown. In this paper, we develop a statistical framework for characterizing the degree of bias expected in association studies based on electronic health records when disease status misclassification and the sampling mechanism are ignored. Through a sensitivity analysis type approach, this framework can be used to obtain plausible values for parameters of interest given results obtained from standard naive analysis methods under varying degree of misclassification and sampling biases. We develop an online tool for performing this sensitivity analysis in some special cases that occur frequently. Simulations demonstrate promising properties of the proposed way of characterizing biases. We apply our approach to study bias in genetic association studies using data from the Michigan Genomics Initiative, a longitudinal biorepository effort within Michigan Medicine.