scholarly journals Scalable generalized linear mixed model for region-based association tests in large biobanks and cohorts

2019 ◽  
Author(s):  
Wei Zhou ◽  
Zhangchen Zhao ◽  
Jonas B. Nielsen ◽  
Lars G. Fritsche ◽  
Jonathon LeFaive ◽  
...  

AbstractWith very large sample sizes, population-based cohorts and biobanks provide an exciting opportunity to identify genetic components of complex traits. To analyze rare variants, gene or region-based multiple variant aggregate tests are commonly used to increase association test power. However, due to the substantial computation cost, existing region-based rare variant tests cannot analyze hundreds of thousands of samples while accounting for confounders, such as population stratification and sample relatedness. Here we propose a scalable generalized mixed model region-based association test that can handle large sample sizes and accounts for unbalanced case-control ratios for binary traits. This method, SAIGE-GENE, utilizes state-of-the-art optimization strategies to reduce computational and memory cost, and hence is applicable to exome-wide and genome-wide region-based analysis for hundreds of thousands of samples. Through the analysis of the HUNT study of 69,716 Norwegian samples and the UK Biobank data of 408,910 White British samples, we show that SAIGE-GENE can efficiently analyze large sample data (N > 400,000) with type I error rates well controlled.

Author(s):  
Yang Hai ◽  
Yalu Wen

Abstract Motivation Accurate disease risk prediction is essential for precision medicine. Existing models either assume that diseases are caused by groups of predictors with small-to-moderate effects or a few isolated predictors with large effects. Their performance can be sensitive to the underlying disease mechanisms, which are usually unknown in advance. Results We developed a Bayesian linear mixed model (BLMM), where genetic effects were modelled using a hybrid of the sparsity regression and linear mixed model with multiple random effects. The parameters in BLMM were inferred through a computationally efficient variational Bayes algorithm. The proposed method can resemble the shape of the true effect size distributions, captures the predictive effects from both common and rare variants, and is robust against various disease models. Through extensive simulations and the application to a whole-genome sequencing dataset obtained from the Alzheimer’s Disease Neuroimaging Initiatives, we have demonstrated that BLMM has better prediction performance than existing methods and can detect variables and/or genetic regions that are predictive. Availability The R-package is available at https://github.com/yhai943/BLMM Supplementary information Supplementary data are available at Bioinformatics online.


2017 ◽  
Author(s):  
Wei Zhou ◽  
Jonas B. Nielsen ◽  
Lars G. Fritsche ◽  
Rounak Dey ◽  
Maiken E. Gabrielsen ◽  
...  

AbstractIn genome-wide association studies (GWAS) for thousands of phenotypes in large biobanks, most binary traits have substantially fewer cases than controls. Both of the widely used approaches, linear mixed model and the recently proposed logistic mixed model, perform poorly – producing large type I error rates – in the analysis of phenotypes with unbalanced case-control ratios. Here we propose a scalable and accurate generalized mixed model association test that uses the saddlepoint approximation (SPA) to calibrate the distribution of score test statistics. This method, SAIGE, provides accurate p-values even when case-control ratios are extremely unbalanced. It utilizes state-of-art optimization strategies to reduce computational time and memory cost of generalized mixed model. The computation cost linearly depends on sample size, and hence can be applicable to GWAS for thousands of phenotypes by large biobanks. Through the analysis of UK Biobank data of 408,961 white British European-ancestry samples for >1400 binary phenotypes, we show that SAIGE can efficiently analyze large sample data, controlling for unbalanced case-control ratios and sample relatedness.


2021 ◽  
Author(s):  
Jian Yang ◽  
Longda Jiang ◽  
Zhili Zheng

Abstract Compared to linear mixed model-based genome-wide association (GWA) methods, generalized linear mixed model (GLMM)-based methods have better statistical properties when applied to binary traits but are computationally much slower. Here, leveraging efficient sparse matrix-based algorithms, we developed a GLMM-based GWA tool (called fastGWA-GLMM) that is orders of magnitude faster than the state-of-the-art tool (e.g., ~37 times faster when n=400,000) with more scalable memory usage. We show by simulation that the fastGWA-GLMM test-statistics of both common and rare variants are well-calibrated under the null, even for traits with an extreme case-control ratio (e.g., 0.1%). We applied fastGWA-GLMM to the UK Biobank data of 456,348 individuals, 11,842,647 variants and 2,989 binary traits (full summary statistics available at http://fastgwa.info/ukbimpbin) and identified 259 rare variants associated with 75 traits, demonstrating the use of imputed genotype data in a large cohort to discover rare variants for binary complex traits.


2019 ◽  
Author(s):  
Elise Lim ◽  
Han Chen ◽  
Josée Dupuis ◽  
Ching-Ti Liu

AbstractAdvanced technology in whole-genome sequencing has offered the opportunity to comprehensively investigate the genetic contribution, particularly rare variants, to complex traits. Many rare variants analysis methods have been developed to jointly model the marginal effect but methods to detect gene-environment (GE) interactions are underdeveloped. Identifying the modification effects of environmental factors on genetic risk poses a considerable challenge. To tackle this challenge, we develop a unified method to detect GE interactions of a set of rare variants using generalized linear mixed effect model. The proposed method can accommodate both binary and continuous traits in related or unrelated samples. Under this model, genetic main effects, sample relatedness and GE interactions are modeled as random effects. We adopt a kernel-based method to leverage the joint information across rare variants and implement variance component score tests to reduce the computational burden. Our simulation study shows that the proposed method maintains correct type I error rates and high power under various scenarios, such as differing the direction of main genotype and GE interaction effects and the proportion of causal variants in the model for both continuous and binary traits. We illustrate our method to test gene-based interaction with smoking on body mass index or overweight status in the Framingham Heart Study and replicate theCHRNB4gene association reported in previous large consortium meta-analysis of single nucleotide polymorphism (SNP)-smoking interaction. Our proposed set-based GE test is computationally efficient and is applicable to both binary and continuous phenotypes, while appropriately accounting for familial or cryptic relatedness.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Jisu Shin ◽  
Sang Hong Lee

AbstractGenetic variation in response to the environment, that is, genotype-by-environment interaction (GxE), is fundamental in the biology of complex traits and diseases. However, existing methods are computationally demanding and infeasible to handle biobank-scale data. Here, we introduce GxEsum, a method for estimating the phenotypic variance explained by genome-wide GxE based on GWAS summary statistics. Through comprehensive simulations and analysis of UK Biobank with 288,837 individuals, we show that GxEsum can handle a large-scale biobank dataset with controlled type I error rates and unbiased GxE estimates, and its computational efficiency can be hundreds of times higher than existing GxE methods.


2019 ◽  
Author(s):  
Jan A. Freudenthal ◽  
Markus J. Ankenbrand ◽  
Dominik G. Grimm ◽  
Arthur Korte

AbstractMotivationGenome-wide association studies (GWAS) are one of the most commonly used methods to detect associations between complex traits and genomic polymorphisms. As both genotyping and phenotyping of large populations has become easier, typical modern GWAS have to cope with massive amounts of data. Thus, the computational demand for these analyses grew remarkably during the last decades. This is especially true, if one wants to implement permutation-based significance thresholds, instead of using the naïve Bonferroni threshold. Permutation-based methods have the advantage to provide an adjusted multiple hypothesis correction threshold that takes the underlying phenotypic distribution into account and will thus remove the need to find the correct transformation for non Gaussian phenotypes. To enable efficient analyses of large datasets and the possibility to compute permutation-based significance thresholds, we used the machine learning framework TensorFlow to develop a linear mixed model (GWAS-Flow) that can make use of the available CPU or GPU infrastructure to decrease the time of the analyses especially for large datasets.ResultsWe were able to show that our application GWAS-Flow outperforms custom GWAS scripts in terms of speed without loosing accuracy. Apart from p-values, GWAS-Flow also computes summary statistics, such as the effect size and its standard error for each individual marker. The CPU-based version is the default choice for small data, while the GPU-based version of GWAS-Flow is especially suited for the analyses of big data.AvailabilityGWAS-Flow is freely available on GitHub (https://github.com/Joyvalley/GWAS_Flow) and is released under the terms of the MIT-License.


2019 ◽  
Vol 21 (3) ◽  
pp. 851-862 ◽  
Author(s):  
Charalampos Papachristou ◽  
Swati Biswas

Abstract Dissecting the genetic mechanism underlying a complex disease hinges on discovering gene–environment interactions (GXE). However, detecting GXE is a challenging problem especially when the genetic variants under study are rare. Haplotype-based tests have several advantages over the so-called collapsing tests for detecting rare variants as highlighted in recent literature. Thus, it is of practical interest to compare haplotype-based tests for detecting GXE including the recent ones developed specifically for rare haplotypes. We compare the following methods: haplo.glm, hapassoc, HapReg, Bayesian hierarchical generalized linear model (BhGLM) and logistic Bayesian LASSO (LBL). We simulate data under different types of association scenarios and levels of gene–environment dependence. We find that when the type I error rates are controlled to be the same for all methods, LBL is the most powerful method for detecting GXE. We applied the methods to a lung cancer data set, in particular, in region 15q25.1 as it has been suggested in the literature that it interacts with smoking to affect the lung cancer susceptibility and that it is associated with smoking behavior. LBL and BhGLM were able to detect a rare haplotype–smoking interaction in this region. We also analyzed the sequence data from the Dallas Heart Study, a population-based multi-ethnic study. Specifically, we considered haplotype blocks in the gene ANGPTL4 for association with trait serum triglyceride and used ethnicity as a covariate. Only LBL found interactions of haplotypes with race (Hispanic). Thus, in general, LBL seems to be the best method for detecting GXE among the ones we studied here. Nonetheless, it requires the most computation time.


Biostatistics ◽  
2019 ◽  
Author(s):  
Jingchunzi Shi ◽  
Michael Boehnke ◽  
Seunggeun Lee

Summary Trans-ethnic meta-analysis is a powerful tool for detecting novel loci in genetic association studies. However, in the presence of heterogeneity among different populations, existing gene-/region-based rare variants meta-analysis methods may be unsatisfactory because they do not consider genetic similarity or dissimilarity among different populations. In response, we propose a score test under the modified random effects model for gene-/region-based rare variants associations. We adapt the kernel regression framework to construct the model and incorporate genetic similarities across populations into modeling the heterogeneity structure of the genetic effect coefficients. We use a resampling-based copula method to approximate asymptotic distribution of the test statistic, enabling efficient estimation of p-values. Simulation studies show that our proposed method controls type I error rates and increases power over existing approaches in the presence of heterogeneity. We illustrate our method by analyzing T2D-GENES consortium exome sequence data to explore rare variant associations with several traits.


1980 ◽  
Vol 5 (3) ◽  
pp. 269-287 ◽  
Author(s):  
Scott E. Maxwell

Five methods of performing pairwise multiple comparisons in repeated measures designs were investigated. Tukey's Wholly Significant Difference (WSD) test, recommended by most experimental design texts, requires that all differences between pairs of means have a common variance. However, this assumption is equivalent to the sphericity condition that is necessary and sufficient for the validity of the mixed-model approach to the omnibus test. Monte Carlo methods revealed that Tukey's WSD leads to an inflated alpha level when the sphericity assumption is not met. Consideration of both Type I and Type II error rates found in the simulated conditions for the five procedures suggests that a Bonferroni method utilizing a separate error term for each comparison should be employed.


2020 ◽  
Author(s):  
Brandon LeBeau

<p>The linear mixed model is a commonly used model for longitudinal or nested data due to its ability to account for the dependency of nested data. Researchers typically rely on the random effects to adequately account for the dependency due to correlated data, however serial correlation can also be used. If the random effect structure is misspecified (perhaps due to convergence problems), can the addition of serial correlation overcome this misspecification and allow for unbiased estimation and accurate inferences? This study explored this question with a simulation. Simulation results show that the fixed effects are unbiased, however inflation of the empirical type I error rate occurs when a random effect is missing from the model. Implications for applied researchers are discussed.</p>


Sign in / Sign up

Export Citation Format

Share Document