Gene- and pathway-based association tests for multiple traits with GWAS summary statistics

Mapping Intimacies ◽

10.1101/052068 ◽

2016 ◽

Author(s):

Il-Youp Kwak ◽

Wei Pan

Keyword(s):

Association Analysis ◽

Complex Traits ◽

Meta Analysis ◽

Real Data ◽

R Package ◽

Summary Statistics ◽

Multiple Traits ◽

Numerical Studies ◽

Wide Range ◽

Intermediate Traits

AbstractTo identify novel genetic variants associated with complex traits and to shed new insights on underlying biology, in addition to the most popular single SNP-single trait association analysis, it would be useful to explore multiple correlated (intermediate) traits at the gene-or pathway-level by mining existing single GWAS or meta-analyzed GWAS data. For this purpose, we present an adaptive gene-based test and a pathway-based test for association analysis of multiple traits with GWAS summary statistics. The proposed tests are adaptive at both the SNP-and trait-levels; that is, they account for possibly varying association patterns (e.g. signal sparsity levels) across SNPs and traits, thus maintaining high power across a wide range of situations. Furthermore, the proposed methods are general: they can be applied to mixed types of traits, and to Z-statistics or p-values as summary statistics obtained from either a single GWAS or a meta-analysis of multiple GWAS. Our numerical studies with simulated and real data demonstrated the promising performance of the proposed methods.The methods are implemented in R package aSPU, freely and publicly available on CRAN at: https://cran.r-project.org/web/packages/aSPU/.

Download Full-text

An iterative approach to detect pleiotropy and perform Mendelian Randomization analysis using GWAS summary statistics

Bioinformatics ◽

10.1093/bioinformatics/btaa985 ◽

2020 ◽

Author(s):

Xiaofeng Zhu ◽

Xiaoyin Li ◽

Rong Xu ◽

Tao Wang

Keyword(s):

Complex Traits ◽

Mendelian Randomization ◽

Causal Effect ◽

Association Studies ◽

Real Data ◽

Supplementary Information ◽

Genome Wide Association Studies ◽

Summary Statistics ◽

Causal Relationships ◽

Multiple Traits

Abstract Motivation The overall association evidence of a genetic variant with multiple traits can be evaluated by cross-phenotype association analysis using summary statistics from genome-wide association studies. Further dissecting the association pathways from a variant to multiple traits is important to understand the biological causal relationships among complex traits. Results Here, we introduce a flexible and computationally efficient Iterative Mendelian Randomization and Pleiotropy (IMRP) approach to simultaneously search for horizontal pleiotropic variants and estimate causal effect. Extensive simulations and real data applications suggest that IMRP has similar or better performance than existing Mendelian Randomization methods for both causal effect estimation and pleiotropic variant detection. The developed pleiotropy test is further extended to detect colocalization for multiple variants at a locus. IMRP will greatly facilitate our understanding of causal relationships underlying complex traits, in particular, when a large number of genetic instrumental variables are used for evaluating multiple traits. Availability and implementation The software IMRP is available at https://github.com/XiaofengZhuCase/IMRP. The simulation codes can be downloaded at http://hal.case.edu/∼xxz10/zhu-web/ under the link: MR Simulations software. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Gene-based association tests using GWAS summary statistics

Bioinformatics ◽

10.1093/bioinformatics/btz172 ◽

2019 ◽

Vol 35 (19) ◽

pp. 3701-3708 ◽

Cited By ~ 6

Author(s):

Gulnara R Svishcheva ◽

Nadezhda M Belonogova ◽

Irina V Zorkoltseva ◽

Anatoly V Kirichenko ◽

Tatiana I Axenovich

Keyword(s):

Association Analysis ◽

Association Studies ◽

R Package ◽

Supplementary Information ◽

Genome Wide Association Studies ◽

Summary Statistics ◽

New Material ◽

Artery Disease ◽

The Many ◽

Functional Linear Regression

Abstract Motivation A huge number of genome-wide association studies (GWAS) summary statistics freely available in databases provide a new material for gene-based association analysis aimed at identifying rare genetic variants. Only a few of the many popular gene-based methods developed for individual genotype and phenotype data are adapted for the practical use of the GWAS summary statistics as input. Results We analytically prove and numerically illustrate that all popular powerful methods developed for gene-based association analysis of individual phenotype and genotype data can be modified to utilize GWAS summary statistics. We have modified and implemented all of the popular methods, including burden and kernel machine-based tests, multiple and functional linear regression, principal components analysis and others, in the R package sumFREGAT. Using real summary statistics for coronary artery disease, we show that the new package is able to detect genes not found by the existing packages. Availability and implementation The R package sumFREGAT is freely and publicly available at: https://CRAN.R-project.org/package=sumFREGAT. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

A Powerful Procedure for Pathway-based Meta-Analysis Using Summary Statistics Identifies 43 Pathways Associated with Type II Diabetes in European Populations

10.1101/041244 ◽

2016 ◽

Author(s):

Han Zhang ◽

William Wheeler ◽

Paula L Hyland ◽

Yifan Yang ◽

Jianxin Shi ◽

...

Keyword(s):

Pathway Analysis ◽

Complex Traits ◽

Meta Analysis ◽

Genetic Data ◽

European Ancestry ◽

Genome Wide Association Studies ◽

Summary Statistics ◽

Individual Level ◽

Testing Procedures ◽

European Populations

AbstractMeta-analysis of multiple genome-wide association studies (GWAS) has become an effective approach for detecting single nucleotide polymorphism (SNP) associations with complex traits. However, it is difficult to integrate the readily accessible SNP-level summary statistics from a meta-analysis into more powerful multi-marker testing procedures, which generally require individual-level genetic data. We developed a general procedure called Summary based Adaptive Rank Truncated Product (sARTP) for conducting gene and pathway meta-analysis that uses only SNP-level summary statistics in combination with genotype correlation estimated from a panel of individual-level genetic data. We demonstrated the validity and power advantage of sARTP through empirical and simulated data. We conducted a comprehensive pathway-based meta-analysis with sARTP on type 2 diabetes (T2D) by integrating SNP-level summary statistics from two large studies consisting of 19,809 T2D cases and 111,181 controls with European ancestry. Among 4,713 candidate pathways from which genes in neighborhoods of 170 GWAS established T2D loci were excluded, we detected 43 T2D globally significant pathways (with Bonferroni corrected p-values < 0.05), which included the insulin signaling pathway and T2D pathway defined by KEGG, as well as the pathways defined according to specific gene expression patterns on pancreatic adenocarcinoma, hepatocellular carcinoma, and bladder carcinoma. Using summary data from 8 eastern Asian T2D GWAS with 6,952 cases and 11,865 controls, we showed 7 out of the 43 pathways identified in European populations remained to be significant in eastern Asians at the false discovery rate of 0.1. We created an R package and a web-based tool for sARTP with the capability to analyze pathways with thousands of genes and tens of thousands of SNPs.Author SummaryAs GWAS continue to grow in sample size, it is evident that these studies need to be utilized more effectively for detecting individual susceptibility variants, and more importantly to provide insight into global genetic architecture of complex traits. Towards this goal, identifying association with respect to a collection of variants in biological pathways can be particularly insightful for understanding how networks of genes might be affecting pathophysiology of diseases. Here we present a new pathway analysis procedure that can be conducted using summary-level association statistics, which have become the main vehicle for performing meta-analysis of individual genetic variants across studies in large consortia. Through simulation studies we showed the proposed method was more powerful than the existing state-of-art method. We carried out a comprehensive pathway analysis of 4,713 candidate pathways on their association with T2D using two large studies with European ancestry and identified 43 T2D-associated pathways. Further examinations of those 43 pathways in 8 Asian studies showed that some pathways were trans-ethnically associated with T2D. This analysis clearly highlights novel T2D-associated pathways beyond what has been known from single-variant association analysis reported from largest GWAS to date.

Download Full-text

Comparison of methods for estimating genetic correlation between complex traits using GWAS summary statistics

10.1101/2020.10.12.336867 ◽

2020 ◽

Cited By ~ 1

Author(s):

Yiliang Zhang ◽

Youshu Cheng ◽

Wei Jiang ◽

Yixuan Ye ◽

Qiongshi Lu ◽

...

Keyword(s):

Genetic Correlation ◽

Complex Traits ◽

Association Studies ◽

Genetic Correlations ◽

Real Data ◽

Estimation Methods ◽

Easy Access ◽

Genome Wide Association Studies ◽

Summary Statistics ◽

Correlation Estimation

AbstractGenetic correlation is the correlation of additive genetic effects on two phenotypes. It is an informative metric to quantify the overall genetic similarity between complex traits, which provides insights into their polygenic genetic architecture. Several methods have been proposed to estimate genetic correlations based on data collected from genome-wide association studies (GWAS). Due to the easy access of GWAS summary statistics and computational efficiency, methods only requiring GWAS summary statistics as input have become more popular than methods utilizing individual-level genotype data. Here, we present a benchmark study for different summary-statistics-based genetic correlation estimation methods through simulation and real data applications. We focus on two major technical challenges in estimating genetic correlation: marker dependency caused by linkage disequilibrium (LD) and sample overlap between different studies. To assess the performance of different methods in the presence of these two challenges, we first conducted comprehensive simulations with diverse LD patterns and sample overlaps. Then we applied these methods to real GWAS summary statistics for a wide spectrum of complex traits. Based on these experiments, we conclude that methods relying on accurate LD estimation are less robust in real data applications compared to other methods due to the imprecision of LD obtained from reference panels. Our findings offer a guidance on how to appropriately choose the method for genetic correlation estimation in post-GWAS analysis in interpretation.

Download Full-text

Super RaSE: Super Random Subspace Ensemble Classification

10.20944/preprints202110.0042.v1 ◽

2021 ◽

Author(s):

Jianan Zhu ◽

Yang Feng

Keyword(s):

Real Data ◽

Classification Problem ◽

R Package ◽

Ensemble Classification ◽

Random Subspace ◽

Base Classifier ◽

Wide Range ◽

Flexible Framework ◽

Random Subspace Ensemble ◽

Sparse Classification

We propose a new ensemble classification algorithm, named Super Random Subspace Ensemble (Super RaSE), to tackle the sparse classification problem. The proposed algorithm is motivated by the Random Subspace Ensemble algorithm (RaSE). The RaSE method was shown to be a flexible framework that can be coupled with any existing base classification. However, the success of RaSE largely depends on the proper choice of the base classifier, which is unfortunately unknown to us. In this work, we show that Super RaSE avoids the need to choose a base classifier by randomly sampling a collection of classifiers together with the subspace. As a result, Super RaSE is more flexible and robust than RaSE. In addition to the vanilla Super RaSE, we also develop the iterative Super RaSE, which adaptively changes the base classifier distribution as well as the subspace distribution. We show the Super RaSE algorithm and its iterative version perform competitively for a wide range of simulated datasets and two real data examples. The new Super RaSE algorithm and its iterative version are implemented in a new version of the R package RaSEn.

Download Full-text

MungeSumstats: A Bioconductor package for the standardisation and quality control of many GWAS summary statistics

Bioinformatics ◽

10.1093/bioinformatics/btab665 ◽

2021 ◽

Author(s):

Alan E Murphy ◽

Brian M Schilder ◽

Nathan G Skene

Keyword(s):

Quality Control ◽

Association Studies ◽

Meta Analysis ◽

Genetic Research ◽

Secondary Analysis ◽

R Package ◽

Supplementary Information ◽

Genome Wide Association Studies ◽

Summary Statistics ◽

Summary Statistic

Abstract Motivation Genome-wide association studies (GWAS) summary statistics have popularised and accelerated genetic research. However, a lack of standardisation of the file formats used has proven problematic when running secondary analysis tools or performing meta-analysis studies. Results To address this issue, we have developed MungeSumstats, a Bioconductor R package for the standardisation and quality control of GWAS summary statistics. MungeSumstats can handle the most common summary statistic formats, including variant call format (VCF) producing a reformatted, standardised, tabular summary statistic file, VCF or R native data object. Availability MungeSumstats is available on Bioconductor (v 3.13) and can also be found on Github at: https://neurogenomics.github.io/MungeSumstats Supplementary information The analysis deriving the most common summary statistic formats is available at: https://al-murphy.github.io/SumstatFormats

Download Full-text

Improved imputation of summary statistics for admixed populations

10.1101/203927 ◽

2017 ◽

Cited By ~ 3

Author(s):

Sina Rüeger ◽

Aaron McDaid ◽

Zoltán Kutalik

Keyword(s):

Meta Analysis ◽

Country Of Origin ◽

Real Data ◽

Genotype Imputation ◽

Reference Panel ◽

Summary Statistics ◽

Panel Size ◽

Noticeable Improvement

AbstractMotivationSummary statistics imputation can be used to infer association summary statistics of an already conducted, genotype-based meta-analysis to higher ge-nomic resolution. This is typically needed when genotype imputation is not feasible for some cohorts. Oftentimes, cohorts of such a meta-analysis are variable in terms of (country of) origin or ancestry. This violates the assumption of current methods that an external LD matrix and the covariance of the Z-statistics are identical.ResultsTo address this issue, we present variance matching, an extention to the existing summary statistics imputation method, which manipulates the LD matrix needed for summary statistics imputation. Based on simulations using real data we find that accounting for ancestry admixture yields noticeable improvement only when the total reference panel size is > 1000. We show that for population specific variants this effect is more pronounced with increasing FST.

Download Full-text

SSP: An R package to estimate sampling effort in studies of ecological communities

10.1101/2020.03.19.996991 ◽

2020 ◽

Author(s):

Edlin J. Guerra-Castro ◽

Juan Carlos Cajas ◽

Nuno Simões ◽

Juan J Cruz-Motta ◽

Maite Mascaró

Keyword(s):

Simulated Data ◽

Real Data ◽

R Package ◽

Sampling Effort ◽

Ecological Communities ◽

Ecological Data ◽

Data Set ◽

Pilot Studies ◽

Ecological Features ◽

Wide Range

ABSTRACTSSP (simulation-based sampling protocol) is an R package that uses simulation of ecological data and dissimilarity-based multivariate standard error (MultSE) as an estimator of precision to evaluate the adequacy of different sampling efforts for studies that will test hypothesis using permutational multivariate analysis of variance. The procedure consists in simulating several extensive data matrixes that mimic some of the relevant ecological features of the community of interest using a pilot data set. For each simulated data, several sampling efforts are repeatedly executed and MultSE calculated. The mean value, 0.025 and 0.975 quantiles of MultSE for each sampling effort across all simulated data are then estimated and standardized regarding the lowest sampling effort. The optimal sampling effort is identified as that in which the increase in sampling effort do not improve the precision beyond a threshold value (e.g. 2.5 %). The performance of SSP was validated using real data, and in all examples the simulated data mimicked well the real data, allowing to evaluate the relationship MultSE – n beyond the sampling size of the pilot studies. SSP can be used to estimate sample size in a wide range of situations, ranging from simple (e.g. single site) to more complex (e.g. several sites for different habitats) experimental designs. The latter constitutes an important advantage, since it offers new possibilities for complex sampling designs, as it has been advised for multi-scale studies in ecology.

Download Full-text

LDpred2: better, faster, stronger

10.1101/2020.04.28.066720 ◽

2020 ◽

Cited By ~ 3

Author(s):

Florian Privé ◽

Julyan Arbel ◽

Bjarni J. Vilhjálmsson

Keyword(s):

Human Genetics ◽

Predictive Accuracy ◽

Predictive Performance ◽

Real Data ◽

R Package ◽

Summary Statistics ◽

Genetics Research ◽

Genome Wide ◽

Polygenic Scores ◽

Central Tool

AbstractPolygenic scores have become a central tool in human genetics research. LDpred is a popular method for deriving polygenic scores based on summary statistics and a matrix of correlation between genetic variants. However, LDpred has limitations that may reduce its predictive performance. Here we present LDpred2, a new version of LDpred that addresses these issues. We also provide two new options in LDpred2: a “sparse” option that can learn effects that are exactly 0, and an “auto” option that directly learns the two LDpred parameters from data. We benchmark predictive performance of LDpred2 against the previous version on simulated and real data, demonstrating substantial improvements in robustness and predictive accuracy compared to LDpred1. We then show that LDpred2 also outperforms other polygenic score methods recently developed, with a mean AUC over the 8 real traits analyzed here of 65.1%, compared to 63.8% for lassosum, 62.9% for PRS-CS and 61.5% for SBayesR. Note that, in contrast to what was recommended in the first version of this paper, we now recommend to run LDpred2 genome-wide instead of per chromosome. LDpred2 is implemented in R package bigsnpr.

Download Full-text

Primo: integration of multiple GWAS and omics QTL summary statistics for elucidation of molecular mechanisms of trait-associated SNPs and detection of pleiotropy in complex traits

Genome Biology ◽

10.1186/s13059-020-02125-w ◽

2020 ◽

Vol 21 (1) ◽

Cited By ~ 1

Author(s):

Kevin J. Gleason ◽

Fan Yang ◽

Brandon L. Pierce ◽

Xin He ◽

Lin S. Chen

Keyword(s):

Linkage Disequilibrium ◽

Association Analysis ◽

Complex Traits ◽

Molecular Mechanisms ◽

Integrative Analysis ◽

Pleiotropic Effects ◽

Summary Statistics ◽

Susceptibility Loci ◽

Conditional Association ◽

Study Heterogeneity

Abstract To provide a comprehensive mechanistic interpretation of how known trait-associated SNPs affect complex traits, we propose a method, Primo, for integrative analysis of GWAS summary statistics with multiple sets of omics QTL summary statistics from different cellular conditions or studies. Primo examines association patterns of SNPs to complex and omics traits. In gene regions harboring known susceptibility loci, Primo performs conditional association analysis to account for linkage disequilibrium. Primo allows for unknown study heterogeneity and sample correlations. We show two applications using Primo to examine the molecular mechanisms of known susceptibility loci and to detect and interpret pleiotropic effects.

Download Full-text