Annotation-Informed Causal Mixture Modeling (AI-MiXeR) reveals phenotype-specific differences in polygenicity and effect size distribution across functional annotation categories

Mapping Intimacies ◽

10.1101/772202 ◽

2019 ◽

Cited By ~ 1

Author(s):

Alexey A. Shadrin ◽

Oleksandr Frei ◽

Olav B. Smeland ◽

Francesco Bettella ◽

Kevin S. O’Connell ◽

...

Keyword(s):

Functional Annotation ◽

Association Studies ◽

Mixture Modeling ◽

Effect Sizes ◽

Genome Wide Association Studies ◽

Summary Statistics ◽

Complex Phenotypes ◽

Genome Wide ◽

Complex Human Traits ◽

Effect Size Distribution

AbstractDetermining the contribution of functional genetic categories is fundamental to understanding the genetic etiology of complex human traits and diseases. Here we present Annotation Informed MiXeR: a likelihood-based method to estimate the number of variants influencing a phenotype and their effect sizes across different functional annotation categories of the genome using summary statistics from genome-wide association studies. Applying the model to 11 complex phenotypes suggests diverse patterns of functional category-specific genetic architectures across human diseases and traits.

Download Full-text

Estimating Effect Sizes and Expected Replication Probabilities from GWAS Summary Statistics

10.1101/032474 ◽

2015 ◽

Author(s):

Dominic Holland ◽

Yunpeng Wang ◽

Wesley K Thompson ◽

Andrew Schork ◽

Chi-Hua Chen ◽

...

Keyword(s):

Association Studies ◽

Significant Snps ◽

Effect Sizes ◽

Genome Wide Association Studies ◽

Summary Statistics ◽

Sample Sizes ◽

Genetic Components ◽

Complex Phenotypes ◽

Genome Wide ◽

Z Scores

Genome-wide Association Studies (GWAS) result in millions of summary statistics (``z-scores'') for single nucleotide polymorphism (SNP) associations with phenotypes. These rich datasets afford deep insights into the nature and extent of genetic contributions to complex phenotypes such as psychiatric disorders, which are understood to have substantial genetic components that arise from very large numbers of SNPs. The complexity of the datasets, however, poses a significant challenge to maximizing their utility. This is reflected in a need for better understanding the landscape of z-scores, as such knowledge would enhance causal SNP and gene discovery, help elucidate mechanistic pathways, and inform future study design. Here we present a parsimonious methodology for modeling effect sizes and replication probabilities that does not require raw genotype data, relying only on summary statistics from GWAS substudies, and a scheme allowing for direct empirical validation. We show that modeling z-scores as a mixture of Gaussians is conceptually appropriate, in particular taking into account ubiquitous non-null effects that are likely in the datasets due to weak linkage disequilibrium with causal SNPs. The four-parameter model allows for estimating the degree of polygenicity of the phenotype -- the proportion of SNPs (after uniform pruning, so that large LD blocks are not over-represented) likely to be in strong LD with causal/mechanistically associated SNPs -- and predicting the proportion of chip heritability explainable by genome wide significant SNPs in future studies with larger sample sizes. We apply the model to recent GWAS of schizophrenia (N=82,315) and additionally, for purposes of illustration, putamen volume (N=12,596), with approximately 9.3 million SNP z-scores in both cases. We show that, over a broad range of z-scores and sample sizes, the model accurately predicts expectation estimates of true effect sizes and replication probabilities in multistage GWAS designs. We estimate the degree to which effect sizes are over-estimated when based on linear regression association coefficients. We estimate the polygenicity of schizophrenia to be 0.037 and the putamen to be 0.001, while the respective sample sizes required to approach fully explaining the chip heritability are 106and 105. The model can be extended to incorporate prior knowledge such as pleiotropy and SNP annotation. The current findings suggest that the model is applicable to a broad array of complex phenotypes and will enhance understanding of their genetic architectures.

Download Full-text

Phenotype-specific differences in polygenicity and effect size distribution across functional annotation categories revealed by AI-MiXeR

Bioinformatics ◽

10.1093/bioinformatics/btaa568 ◽

2020 ◽

Vol 36 (18) ◽

pp. 4749-4756 ◽

Cited By ~ 2

Author(s):

Alexey A Shadrin ◽

Oleksandr Frei ◽

Olav B Smeland ◽

Francesco Bettella ◽

Kevin S O'Connell ◽

...

Keyword(s):

Functional Annotation ◽

Association Studies ◽

Effect Sizes ◽

Supplementary Information ◽

Genome Wide Association Studies ◽

Protein Coding ◽

Genome Wide ◽

Whole Exome ◽

Causal Variants ◽

Complex Human Traits

Abstract Motivation Determining the relative contributions of functional genetic categories is fundamental to understanding the genetic etiology of complex human traits and diseases. Here, we present Annotation Informed-MiXeR, a likelihood-based method for estimating the number of variants influencing a phenotype and their effect sizes across different functional annotation categories of the genome using summary statistics from genome-wide association studies. Results Extensive simulations demonstrate that the model is valid for a broad range of genetic architectures. The model suggests that complex human phenotypes substantially differ in the number of causal variants, their localization in the genome and their effect sizes. Specifically, the exons of protein-coding genes harbor more than 90% of variants influencing type 2 diabetes and inflammatory bowel disease, making them good candidates for whole-exome studies. In contrast, <10% of the causal variants for schizophrenia, bipolar disorder and attention-deficit/hyperactivity disorder are located in protein-coding exons, indicating a more substantial role of regulatory mechanisms in the pathogenesis of these disorders. Availability and implementation The software is available at: https://github.com/precimed/mixer. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Across-cohort QC analyses of genome-wide association study summary statistics from complex traits

10.1101/033787 ◽

2015 ◽

Author(s):

Guo-Bo Chen ◽

Sang Hong Lee ◽

Matthew R Robinson ◽

Maciej Trzaskowski ◽

Zhi-Xiang Zhu ◽

...

Keyword(s):

Complex Traits ◽

Statistical Power ◽

Association Studies ◽

False Negative ◽

Genome Wide Association ◽

Effect Sizes ◽

Genome Wide Association Studies ◽

Summary Statistics ◽

Unknown Sample ◽

Genome Wide

Genome-wide association studies (GWASs) have been successful in discovering replicable SNP-trait associations for many quantitative traits and common diseases in humans. Typically the effect sizes of SNP alleles are very small and this has led to large genome-wide association meta-analyses (GWAMA) to maximize statistical power. A trend towards ever-larger GWAMA is likely to continue, yet dealing with summary statistics from hundreds of cohorts increases logistical and quality control problems, including unknown sample overlap, and these can lead to both false positive and false negative findings. In this study we propose a new set of metrics and visualization tools for GWAMA, using summary statistics from cohort-level GWASs. We proposed a pair of methods in examining the concordance between demographic information and summary statistics. In method I, we use the population genetics Fststatistic to verify the genetic origin of each cohort and their geographic location, and demonstrate using GWAMA data from the GIANT Consortium that geographic locations of cohorts can be recovered and outlier cohorts can be detected. In method II, we conduct principal component analysis based on reported allele frequencies, and is able to recover the ancestral information for each cohort. In addition, we propose a new statistic that uses the reported allelic effect sizes and their standard errors to identify significant sample overlap or heterogeneity between pairs of cohorts. Finally, to quantify unknown sample overlap across all pairs of cohorts we propose a method that uses randomly generated genetic predictors that does not require the sharing of individual-level genotype data and does not breach individual privacy.

Download Full-text

Evaluating and improving heritability models using summary statistics

10.1101/736496 ◽

2019 ◽

Cited By ~ 1

Author(s):

Doug Speed ◽

John Holmes ◽

David J Balding

Keyword(s):

Association Studies ◽

College Education ◽

Genome Wide Association Studies ◽

Summary Statistics ◽

Promoter Regions ◽

Statistical Framework ◽

Genome Wide ◽

Using Data ◽

Complex Human Traits ◽

The Impact

AbstractThere is currently much debate regarding the best way to model how heritability varies across the genome. The authors of GCTA recommend the GCTA-LDMS-I Model, the authors of LD Score Regression recommend the Baseline LD Model, while we have instead recommended the LDAK Model. Here we provide a statistical framework for assessing heritability models using summary statistics from genome-wide association studies. Using data from studies of 31 complex human traits (average sample size 136,000), we show that the Baseline LD Model is the most realistic of the existing heritability models, but that it can be improved by incorporating features from the LDAK Model. Our framework also provides a method for estimating the selection-related parameter α from summary statistics. We find strong evidence (P<1e-6) of negative genome-wide selection for traits including height, systolic blood pressure and college education, and that the impact of selection is stronger inside functional categories such as coding SNPs and promoter regions.

Download Full-text

A comparative study of data integration methods, integrating genetic association and functional annotation summary statistics

10.1101/2020.11.25.396721 ◽

2020 ◽

Author(s):

Jianhui Gao ◽

Lei Sun

Keyword(s):

Data Integration ◽

Sample Size ◽

Complex Traits ◽

Functional Annotation ◽

Association Studies ◽

Genome Wide Association Studies ◽

Summary Statistics ◽

Uk Biobank ◽

Integration Methods ◽

Genome Wide

AbstractPower of many genome-wide association studies (GWAS) remains low despite of increasing sample size, because the genetic effects for complex traits are small, the case sample size may not be large, and the variants analyzed may be rare. One direction is to integrate available functional annotation meta-score such as CADD and Eigen to increase power of a GWAS. Here we examine four data-integration methods, including meta-analysis, Fisher’s method, weighted p-value, and stratified FDR control, all based on summary statistics only. We focus on robustness study, considering settings where the functional meta-score mayor may not be informative, or possibly be misleading. In addition to extensive simulation studies, we also apply the four methods to 945 binary outcomes in the UK Biobank data, including all 633 traits with ICD-10 codes, 28 self-reported cancers and 284 self-reported non-cancer diseases, integrating publicly available GWAS summary statistics (http://www.nealelab.is/uk-biobank/) with CADD or Eigen scores. While the trade-off between power and robustness observation is expected, our application shows some but limited utility of current functional meta-score in terms of leading to new genome-wide significant association findings.

Download Full-text

T21ESTIMATING POLYGENICITY AND EFFECT-SIZE DISTRIBUTION IN FUNCTIONAL CATEGORIES OF THE GENOME USING SUMMARY STATISTICS DATA FROM GENOME-WIDE ASSOCIATION STUDIES

European Neuropsychopharmacology ◽

10.1016/j.euroneuro.2019.08.220 ◽

2019 ◽

Vol 29 ◽

pp. S229

Author(s):

Alexey Shadrin ◽

Oleksandr Frei ◽

Olav Smeland ◽

Francesco Bettella ◽

Kevin O'Connell ◽

...

Keyword(s):

Size Distribution ◽

Effect Size ◽

Association Studies ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Summary Statistics ◽

Functional Categories ◽

Genome Wide ◽

Effect Size Distribution

Download Full-text

CAUSALdb: a database for disease/trait causal variants identified using summary statistics of genome-wide association studies

Nucleic Acids Research ◽

10.1093/nar/gkz1026 ◽

2019 ◽

Cited By ~ 2

Author(s):

Jianhua Wang ◽

Dandan Huang ◽

Yao Zhou ◽

Hongcheng Yao ◽

Huanhuan Liu ◽

...

Keyword(s):

Fine Mapping ◽

Genetic Variants ◽

Association Studies ◽

Complex Trait ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Summary Statistics ◽

Genome Wide ◽

Credible Sets ◽

Causal Variants

Abstract Genome-wide association studies (GWASs) have revolutionized the field of complex trait genetics over the past decade, yet for most of the significant genotype-phenotype associations the true causal variants remain unknown. Identifying and interpreting how causal genetic variants confer disease susceptibility is still a big challenge. Herein we introduce a new database, CAUSALdb, to integrate the most comprehensive GWAS summary statistics to date and identify credible sets of potential causal variants using uniformly processed fine-mapping. The database has six major features: it (i) curates 3052 high-quality, fine-mappable GWAS summary statistics across five human super-populations and 2629 unique traits; (ii) estimates causal probabilities of all genetic variants in GWAS significant loci using three state-of-the-art fine-mapping tools; (iii) maps the reported traits to a powerful ontology MeSH, making it simple for users to browse studies on the trait tree; (iv) incorporates highly interactive Manhattan and LocusZoom-like plots to allow visualization of credible sets in a single web page more efficiently; (v) enables online comparison of causal relations on variant-, gene- and trait-levels among studies with different sample sizes or populations and (vi) offers comprehensive variant annotations by integrating massive base-wise and allele-specific functional annotations. CAUSALdb is freely available at http://mulinlab.org/causaldb.

Download Full-text

Functional annotation of risk loci identified through genome-wide association studies for prostate cancer

The Prostate ◽

10.1002/pros.21311 ◽

2010 ◽

Vol 71 (9) ◽

pp. 955-963 ◽

Cited By ~ 22

Author(s):

Yizhen Lu ◽

Zheng Zhang ◽

Hongjie Yu ◽

S. Lily Zheng ◽

William B. Isaacs ◽

...

Keyword(s):

Prostate Cancer ◽

Functional Annotation ◽

Association Studies ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Genome Wide

Download Full-text

Challenges of Adjusting Single-Nucleotide Polymorphism Effect Sizes for Linkage Disequilibrium

Human Heredity ◽

10.1159/000513303 ◽

2021 ◽

pp. 1-11

Author(s):

Valentina Escott-Price ◽

Karl Michael Schmidt

Keyword(s):

Linkage Disequilibrium ◽

Association Studies ◽

Statistical Significance ◽

Ordinary Least Squares ◽

Effect Sizes ◽

Risk Scores ◽

Genome Wide Association Studies ◽

Single Nucleotide ◽

Genome Wide ◽

Tikhonov Regularisation

Background: Genome-wide association studies (GWAS) were successful in identifying SNPs showing association with disease, but their individual effect sizes are small and require large sample sizes to achieve statistical significance. Methods of post-GWAS analysis, including gene-based, gene-set and polygenic risk scores, combine the SNP effect sizes in an attempt to boost the power of the analyses. To avoid giving undue weight to SNPs in linkage disequilibrium (LD), the LD needs to be taken into account in these analyses. Objectives: We review methods that attempt to adjust the effect sizes (β-coefficients) of summary statistics, instead of simple LD pruning. Methods: We subject LD adjustment approaches to a mathematical analysis, recognising Tikhonov regularisation as a framework for comparison. Results: Observing the similarity of the processes involved with the more straightforward Tikhonov-regularised ordinary least squares estimate for multivariate regression coefficients, we note that current methods based on a Bayesian model for the effect sizes effectively provide an implicit choice of the regularisation parameter, which is convenient, but at the price of reduced transparency and, especially in smaller LD blocks, a risk of incomplete LD correction. Conclusions: There is no simple answer to the question which method is best, but where interpretability of the LD adjustment is essential, as in research aiming at identifying the genomic aetiology of disorders, our study suggests that a more direct choice of mild regularisation in the correction of effect sizes may be preferable.

Download Full-text

Better estimation of SNP heritability from summary statistics provides a new understanding of the genetic architecture of complex traits

10.1101/284976 ◽

2018 ◽

Cited By ~ 6

Author(s):

Doug Speed ◽

David J Balding

Keyword(s):

Complex Traits ◽

Genetic Architecture ◽

Large Scale ◽

Association Studies ◽

Genome Wide Association Studies ◽

Summary Statistics ◽

Confounding Bias ◽

Conserved Regions ◽

Genome Wide ◽

Variation Explained

LD Score Regression (LDSC) has been widely applied to the results of genome-wide association studies. However, its estimates of SNP heritability are derived from an unrealistic model in which each SNP is expected to contribute equal heritability. As a consequence, LDSC tends to over-estimate confounding bias, under-estimate the total phenotypic variation explained by SNPs, and provide misleading estimates of the heritability enrichment of SNP categories. Therefore, we present SumHer, software for estimating SNP heritability from summary statistics using more realistic heritability models. After demonstrating its superiority over LDSC, we apply SumHer to the results of 24 large-scale association studies (average sample size 121 000). First we show that these studies have tended to substantially over-correct for confounding, and as a result the number of genome-wide significant loci has under-reported by about 20%. Next we estimate enrichment for 24 categories of SNPs defined by functional annotations. A previous study using LDSC reported that conserved regions were 13-fold enriched, and found a further twelve categories with above 2-fold enrichment. By contrast, our analysis using SumHer finds that conserved regions are only 1.6-fold (SD 0.06) enriched, and that no category has enrichment above 1.7-fold. SumHer provides an improved understanding of the genetic architecture of complex traits, which enables more efficient analysis of future genetic data.

Download Full-text