scholarly journals Accurate modeling of replication rates in genome-wide association studies by accounting for winner’s curse and study-specific heterogeneity

2019 ◽  
Author(s):  
Jennifer Zou ◽  
Jinjing Zhou ◽  
Sarah Faller ◽  
Robert Brown ◽  
Eleazar Eskin

AbstractGenome-wide association studies (GWAS) have identified thousands of genetic variants associated with complex human traits, but only a fraction of variants identified in discovery studies achieve significance in replication studies. Replication in GWAS studies has been well-studied in the context of winner’s curse, which is the inflation of effect size estimates for significant variants in a study. Multiple methods have been proposed to correct for the effects of winner’s curse. However, winner’s curse is often not sufficient to explain lack of replication. Another reason why studies fail to replicate is that there are fundamental differences between the discovery and replication studies. A confounding factor can create the appearance of a significant finding while actually being an artifact that will not replicate in future studies. We propose a statistical framework that utilizes GWAS replication studies to model winner’s curse and study-specific heterogeneity due to confounders and correct for these effects. We show through simulations and application to 100 human GWAS data sets that modeling both winner’s curse and study-specific heterogeneity explains observed patterns of replication in GWAS studies better than modeling winner’s curse alone.

2017 ◽  
Author(s):  
Cameron Palmer ◽  
Itsik Pe’er

AbstractGenome-wide association studies (GWAS) have identified hundreds of SNPs responsible for variation in human quantitative traits. However, genome-wide-significant associations often fail to replicate across independent cohorts, in apparent inconsistency with their apparent strong effects in discovery cohorts. This limited success of replication raises pervasive questions about the utility of the GWAS field. We identify all 332 studies of quantitative traits from the NHGRI-EBI GWAS Database with attempted replication. We find that the majority of studies provide insufficient data to evaluate replication rates. The remaining papers replicate significantly worse than expected (p < 10−14), even when adjusting for regression-to-the-mean of effect size between discovery- and replication-cohorts termed the Winner’s Curse (p < 10−16). We show this is due in part to misreporting replication cohort-size as a maximum number, rather than per-locus one. In 39 studies accurately reporting per-locus cohort-size for attempted replication of 707 loci in samples with similar ancestry, replication rate matched expectation (predicted 458, observed 457, p = 0.94). In contrast, ancestry differences between replication and discovery (13 studies, 385 loci) cause the most highly-powered decile of loci to replicate worse than expected, due to difference in linkage disequilibrium.Author SummaryThe majority of associations between common genetic variation and human traits come from genome-wide association studies, which have analyzed millions of single-nucleotide polymorphisms in millions of samples. These kinds of studies pose serious statistical challenges to discovering new associations. Finite resources restrict the number of candidate associations that can brought forward into validation samples, introducing the need for a significance threshold. This threshold creates a phenomenon called the Winner’s Curse, in which candidate associations close to the discovery threshold are more likely to have biased overestimates of the variant’s true association in the sampled population. We survey all human quantitative trait association studies that validated at least one signal. We find the majority of these studies do not publish sufficient information to actually support their claims of replication. For studies that did, we computationally correct the Winner’s Curse and evaluate replication performance. While all variants combined replicate significantly less than expected, we find that the subset of studies that (1) perform both discovery and replication in samples of the same ancestry; and (2) report accurate per-variant sample sizes, replicate as expected. This study provides strong, rigorous evidence for the broad reliability of genome-wide association studies. We furthermore provide a model for more efficient selection of variants as candidates for replication, as selecting variants using cursed discovery data enriches for variants with little real evidence for trait association.


2017 ◽  
Vol 27 (9) ◽  
pp. 2795-2808 ◽  
Author(s):  
Wei Jiang ◽  
Weichuan Yu

In genome-wide association studies, we normally discover associations between genetic variants and diseases/traits in primary studies, and validate the findings in replication studies. We consider the associations identified in both primary and replication studies as true findings. An important question under this two-stage setting is how to determine significance levels in both studies. In traditional methods, significance levels of the primary and replication studies are determined separately. We argue that the separate determination strategy reduces the power in the overall two-stage study. Therefore, we propose a novel method to determine significance levels jointly. Our method is a reanalysis method that needs summary statistics from both studies. We find the most powerful significance levels when controlling the false discovery rate in the two-stage study. To enjoy the power improvement from the joint determination method, we need to select single nucleotide polymorphisms for replication at a less stringent significance level. This is a common practice in studies designed for discovery purpose. We suggest this practice is also suitable in studies with validation purpose in order to identify more true findings. Simulation experiments show that our method can provide more power than traditional methods and that the false discovery rate is well-controlled. Empirical experiments on datasets of five diseases/traits demonstrate that our method can help identify more associations. The R-package is available at: http://bioinformatics.ust.hk/RFdr.html .


2019 ◽  
Vol 39 (10) ◽  
pp. 1925-1937 ◽  
Author(s):  
Ruth McPherson

Recent studies have led to a broader understanding of the genetic architecture of coronary artery disease and demonstrate that it largely derives from the cumulative effect of multiple common risk alleles individually of small effect size rather than rare variants with large effects on coronary artery disease risk. The tools applied include genome-wide association studies encompassing over 200 000 individuals complemented by bioinformatic approaches including imputation from whole-genome data sets, expression quantitative trait loci analyses, and interrogation of ENCODE (Encyclopedia of DNA Elements), Roadmap Epigenetic Project, and other data sets. Over 160 genome-wide significant loci associated with coronary artery disease risk have been identified using the genome-wide association studies approach, 90% of which are situated in intergenic regions. Here, I will describe, in part, our research over the last decade performed in collaboration with a series of bright trainees and an extensive number of groups and individuals around the world as it applies to our understanding of the genetic basis of this complex disease. These studies include computational approaches to better understand missing heritability and identify causal pathways, experimental approaches, and progress in understanding at the molecular level the function of the multiple risk loci identified and potential applications of these genomic data in clinical medicine and drug discovery.


2019 ◽  
Author(s):  
Jianan Zhan ◽  
Dan E. Arking ◽  
Joel S. Bader

AbstractBiological experiments often involve hypothesis testing at the scale of thousands to millions of tests. Alleviating the multiple testing burden has been a goal of many methods designed to boost test power by focusing tests on the alternative hypotheses most likely to be true. Very often, these methods either explicitly or implicitly make use of prior probabilities that bias significance for favored sets thought to be enriched for significant finding. Nevertheless, most genomics experiments, and in particular genome-wide association studies (GWAS), still use traditional univariate tests rather than more sophisticated approaches. Here we use GWAS to demonstrate why unbiased tests remain in favor. We calculate test power assuming perfect knowledge of a prior distribution and then derive the population size increase required to provided the same boost without a prior. We show that population size is exponentially more important than prior, providing a rigorous explanation for the observed avoidance of prior-based methods.Author summaryBiological experiments often test thousands to millions of hypotheses. Gene-based tests for human RNA-Seq data, for example, involve approximately 20,000; genome-wide association studies (GWAS) involve about 1 million effective tests. The conventional approach is to perform individual tests and then apply a Bonferroni correction to account for multiple testing. This approach implies a single-test p-value of 2.5 × 10−6 for RNA-Seq experiments, and a p-value of 5 × 10−8 for GWAS, to control the false-positive rate at a conventional value of 0.05. Many methods have been proposed to alleviate the multiple-testing burden by incorporating a prior probability that boosts the significance for a subset of candidate genes or variants. At the extreme limit, only the candidate set is tested, corresponding to a decreased multiple testing burden. Despite decades of methods development, prior-based tests have not been generally used. Here we compare the power increase possible with a prior with the increase possible with a much simpler strategy of increasing a study size. We show that increasing the population size is exponentially more valuable than increasing the strength of prior, even when the true prior is known exactly. These results provide a rigorous explanation for the continued use of simple, robust methods rather than more sophisticated approaches.


2021 ◽  
Author(s):  
Tonya M Brunetti ◽  
Nikita Pozdeyev ◽  
Michelle Daya ◽  
Kathleen C Barnes ◽  
Nicholas Rafaels ◽  
...  

SAIGE-Biobank Re-Usable SAIGE Helper (SAIGE-BRUSH) allows users with little computational expertise to utilize SAIGE for GWAS with parallelization and data collection on biobank data sets. This implementation requires no installation and has additional features not programmed within the original SAIGE framework, such as concurrency, reproducibility, reusability, scalability, association analysis results filtering and output plots. This is all achieved without writing any code from the user. This implementation is currently being utilized by the Biobank at the Colorado Center for Personalized Medicine (CCPM) on Google Cloud but is flexible for a number of architectures available to genetic analysts. Availability: This open source implementation is freely available at https://github.com/tbrunetti/SAIGE-BRUSH and is licensed under the MIT License. Contact: Chris Gignoux at [email protected] & Nick Rafaels at [email protected] Supplemental Material: For detailed user documentation, please visit https://saige-brush.readthedocs.io/en/latest/


2020 ◽  
Vol 65 (12) ◽  
pp. 874-884
Author(s):  
Kezhi Liu ◽  
Ling Zhu ◽  
Minglan Yu ◽  
Xuemei Liang ◽  
Jin Zhang ◽  
...  

Aims: Previous studies have inferred that there is a strong genetic component in insomnia. However, the etiology of insomnia is still unclear. This study systematically analyzed multiple genome-wide association study (GWAS) data sets with core human pathways and functional networks to detect potential gene pathways and networks associated with insomnia. Methods: We used a novel method, multitrait analysis of genome-wide association studies (MTAG), to combine 3 large GWASs of insomnia symptoms/complaints and sleep duration. The i-Gsea4GwasV2 and Reactome FI programs were used to analyze data from the result of MTAG analysis and the nominally significant pathways, respectively. Results: Through analyzing data sets using the MTAG program, our sample size increased from 113,006 subjects to 163,188 subjects. A total of 17 of 1,816 Reactome pathways were identified and showed to be associated with insomnia. We further revealed 11 interconnected functional and topologically interacting clusters (Clusters 0 to 10) that were associated with insomnia. Based on the brain transcriptome data, it was found that the genes in Cluster 4 were enriched for the transcriptional coexpression profile in the prenatal dorsolateral prefrontal cortex ( P = 7 × 10−5), inferolateral temporal cortex ( P = 0.02), medial prefrontal cortex ( P < 1 × 10−5), and amygdala ( P < 1 × 10−5), and detected RPA2, ORC6, PIAS3, and PRIM2 as core nodes in these 4 brain regions. Conclusions: The findings provided new genes, pathways, and brain regions to understand the pathology of insomnia.


Sign in / Sign up

Export Citation Format

Share Document