Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies

Mapping Intimacies ◽

10.1101/212357 ◽

2017 ◽

Cited By ~ 7

Author(s):

Wei Zhou ◽

Jonas B. Nielsen ◽

Lars G. Fritsche ◽

Rounak Dey ◽

Maiken E. Gabrielsen ◽

...

Keyword(s):

Large Scale ◽

Mixed Model ◽

Linear Mixed Model ◽

Association Studies ◽

Case Control ◽

Error Rates ◽

European Ancestry ◽

Computational Time ◽

Type I ◽

Genome Wide Association Studies

AbstractIn genome-wide association studies (GWAS) for thousands of phenotypes in large biobanks, most binary traits have substantially fewer cases than controls. Both of the widely used approaches, linear mixed model and the recently proposed logistic mixed model, perform poorly – producing large type I error rates – in the analysis of phenotypes with unbalanced case-control ratios. Here we propose a scalable and accurate generalized mixed model association test that uses the saddlepoint approximation (SPA) to calibrate the distribution of score test statistics. This method, SAIGE, provides accurate p-values even when case-control ratios are extremely unbalanced. It utilizes state-of-art optimization strategies to reduce computational time and memory cost of generalized mixed model. The computation cost linearly depends on sample size, and hence can be applicable to GWAS for thousands of phenotypes by large biobanks. Through the analysis of UK Biobank data of 408,961 white British European-ancestry samples for >1400 binary phenotypes, we show that SAIGE can efficiently analyze large sample data, controlling for unbalanced case-control ratios and sample relatedness.

Download Full-text

297 GWAS for complex models accounting for populations structure with GBLUP and ssGBLUP

Journal of Animal Science ◽

10.1093/jas/skaa278.057 ◽

2020 ◽

Vol 98 (Supplement_4) ◽

pp. 32-32

Author(s):

Juan P Steibel ◽

Ignacio Aguilar

Keyword(s):

Hypothesis Testing ◽

Large Scale ◽

Mixed Model ◽

Prediction Models ◽

Association Studies ◽

Least Square ◽

Type I ◽

Phenotypic Variance ◽

Genome Wide Association Studies ◽

Formal Hypothesis Testing

Abstract Genomic Best Linear Unbiased Prediction (GBLUP) is the method of choice for incorporating genomic information into the genetic evaluation of livestock species. Furthermore, single step GBLUP (ssGBLUP) is adopted by many breeders’ associations and private entities managing large scale breeding programs. While prediction of breeding values remains the primary use of genomic markers in animal breeding, a secondary interest focuses on performing genome-wide association studies (GWAS). The goal of GWAS is to uncover genomic regions that harbor variants that explain a large proportion of the phenotypic variance, and thus become candidates for discovering and studying causative variants. Several methods have been proposed and successfully applied for embedding GWAS into genomic prediction models. Most methods commonly avoid formal hypothesis testing and resort to estimation of SNP effects, relying on visual inspection of graphical outputs to determine candidate regions. However, with the advent of high throughput phenomics and transcriptomics, a more formal testing approach with automatic discovery thresholds is more appealing. In this work we present the methodological details of a method for performing formal hypothesis testing for GWAS in GBLUP models. First, we present the method and its equivalencies and differences with other GWAS methods. Moreover, we demonstrate through simulation analyses that the proposed method controls type I error rate at the nominal level. Second, we demonstrate two possible computational implementations based on mixed model equations for ssGBLUP and based on the generalized least square equations (GLS). We show that ssGBLUP can deal with datasets with extremely large number of animals and markers and with multiple traits. GLS implementations are well suited for dealing with smaller number of animals with tens of thousands of phenotypes. Third, we show several useful extensions, such as: testing multiple markers at once, testing pleiotropic effects and testing association of social genetic effects.

Download Full-text

Statistical Learning Methods Applicable to Genome-Wide Association Studies on Unbalanced Case-Control Disease Data

Genes ◽

10.3390/genes12050736 ◽

2021 ◽

Vol 12 (5) ◽

pp. 736

Author(s):

Xiaotian Dai ◽

Guifang Fu ◽

Shaofei Zhao ◽

Yifei Zeng

Keyword(s):

Type I Error ◽

Association Studies ◽

Case Control ◽

Error Rates ◽

Genome Wide Association ◽

Type I ◽

Genome Wide Association Studies ◽

Learning Approaches ◽

Genome Wide ◽

Control Disease

Despite the fact that imbalance between case and control groups is prevalent in genome-wide association studies (GWAS), it is often overlooked. This imbalance is getting more significant and urgent as the rapid growth of biobanks and electronic health records have enabled the collection of thousands of phenotypes from large cohorts, in particular for diseases with low prevalence. The unbalanced binary traits pose serious challenges to traditional statistical methods in terms of both genomic selection and disease prediction. For example, the well-established linear mixed models (LMM) yield inflated type I error rates in the presence of unbalanced case-control ratios. In this article, we review multiple statistical approaches that have been developed to overcome the inaccuracy caused by the unbalanced case-control ratio, with the advantages and limitations of each approach commented. In addition, we also explore the potential for applying several powerful and popular state-of-the-art machine-learning approaches, which have not been applied to the GWAS field yet. This review paves the way for better analysis and understanding of the unbalanced case-control disease data in GWAS.

Download Full-text

Efficient mixed model approach for large-scale genome-wide association studies of ordinal categorical phenotypes

10.1101/2020.10.09.333146 ◽

2020 ◽

Author(s):

Wenjian Bi ◽

Wei Zhou ◽

Rounak Dey ◽

Bhramar Mukherjee ◽

Joshua N Sampson ◽

...

Keyword(s):

Mixed Model ◽

Type I Error ◽

Association Studies ◽

Error Rates ◽

Genome Wide Association ◽

Alternative Methods ◽

Type I ◽

Genome Wide Association Studies ◽

Type I Error Rates ◽

Genome Wide

AbstractIn genome-wide association studies (GWAS), ordinal categorical phenotypes are widely used to measure human behaviors, satisfaction, and preferences. However, due to the lack of analysis tools, methods designed for binary and quantitative traits have often been used inappropriately to analyze categorical phenotypes, which produces inflated type I error rates or is less powerful. To accurately model the dependence of an ordinal categorical phenotype on covariates, we propose an efficient mixed model association test, Proportional Odds Logistic Mixed Model (POLMM). POLMM is demonstrated to be computationally efficient to analyze large datasets with hundreds of thousands of genetic related samples, can control type I error rates at a stringent significance level regardless of the phenotypic distribution, and is more powerful than other alternative methods. We applied POLMM to 258 ordinal categorical phenotypes on array-genotypes and imputed samples from 408,961 individuals in UK Biobank. In total, we identified 5,885 genome-wide significant variants, of which 424 variants (7.2%) are rare variants with MAF < 0.01.

Download Full-text

Testing for genetic associations in arbitrarily structured populations

10.1101/012682 ◽

2014 ◽

Author(s):

Minsun Song ◽

Wei Hao ◽

John D. Storey

Keyword(s):

Large Scale ◽

Mixed Model ◽

Linear Mixed Model ◽

Association Studies ◽

Principal Component ◽

Statistical Test ◽

Structured Populations ◽

Birth Cohort Study ◽

Genome Wide Association Studies ◽

Genetic Associations

We present a new statistical test of association between a trait and genetic markers, which we theoretically and practically prove to be robust to arbitrarily complex population structure. The statistical test involves a set of parameters that can be directly estimated from large-scale genotyping data, such as that measured in genome-wide association studies (GWAS). We also derive a new set of methodologies, called a genotype-conditional association test (GCAT), shown to provide accurate association tests in populations with complex structures, manifested in both the genetic and environmental contributions to the trait. We demonstrate the proposed method on a large simulation study and on the Northern Finland Birth Cohort study. In the Finland study, we identify several new significant loci that other methods do not detect. Our proposed framework provides a substantially different approach to the problem from existing methods, such as the linear mixed model and principal component approaches.

Download Full-text

GWASpro: a high-performance genome-wide association analysis server

Bioinformatics ◽

10.1093/bioinformatics/bty989 ◽

2018 ◽

Vol 35 (14) ◽

pp. 2512-2514 ◽

Cited By ~ 4

Author(s):

Bongsong Kim ◽

Xinbin Dai ◽

Wenchao Zhang ◽

Zhaohong Zhuang ◽

Darlene L Sanchez ◽

...

Keyword(s):

High Performance ◽

Large Scale ◽

Linear Mixed Model ◽

Association Studies ◽

Learning Curves ◽

Experimental Designs ◽

Genome Wide Association ◽

Supplementary Information ◽

Genome Wide Association Studies ◽

Genome Wide

Abstract Summary We present GWASpro, a high-performance web server for the analyses of large-scale genome-wide association studies (GWAS). GWASpro was developed to provide data analyses for large-scale molecular genetic data, coupled with complex replicated experimental designs such as found in plant science investigations and to overcome the steep learning curves of existing GWAS software tools. GWASpro supports building complex design matrices, by which complex experimental designs that may include replications, treatments, locations and times, can be accounted for in the linear mixed model. GWASpro is optimized to handle GWAS data that may consist of up to 10 million markers and 10 000 samples from replicable lines or hybrids. GWASpro provides an interface that significantly reduces the learning curve for new GWAS investigators. Availability and implementation GWASpro is freely available at https://bioinfo.noble.org/GWASPRO. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Exome-wide association studies in general and long-lived populations identify genetic variants related to human age

10.1101/2020.07.19.188789 ◽

2020 ◽

Author(s):

Patrick Sin-Chan ◽

Nehal Gosalia ◽

Chuan Gao ◽

Cristopher V. Van Hout ◽

Bin Ye ◽

...

Keyword(s):

Exome Sequencing ◽

Large Scale ◽

Mixed Model ◽

Linear Mixed Model ◽

Association Studies ◽

Model Systems ◽

P Value ◽

Ashkenazi Jews ◽

Association Analyses ◽

Age Related

SUMMARYAging is characterized by degeneration in cellular and organismal functions leading to increased disease susceptibility and death. Although our understanding of aging biology in model systems has increased dramatically, large-scale sequencing studies to understand human aging are now just beginning. We applied exome sequencing and association analyses (ExWAS) to identify age-related variants on 58,470 participants of the DiscovEHR cohort. Linear Mixed Model regression analyses of age at last encounter revealed variants in genes known to be linked with clonal hematopoiesis of indeterminate potential, which are associated with myelodysplastic syndromes, as top signals in our analysis, suggestive of age-related somatic mutation accumulation in hematopoietic cells despite patients lacking clinical diagnoses. In addition to APOE, we identified rare DISP2 rs183775254 (p = 7.40×10−10) and ZYG11A rs74227999 (p = 2.50×10−08) variants that were negatively associated with age in either both sexes combined and females, respectively, which were replicated with directional consistency in two independent cohorts. Epigenetic mapping showed these variants are located within cell-type-specific enhancers, suggestive of important transcriptional regulatory functions. To discover variants associated with extreme age, we performed exome-sequencing on persons of Ashkenazi Jewish descent ascertained for extensive lifespans. Case-Control analyses in 525 Ashkenazi Jews cases (Males ≥ 92 years, Females ≥ 95years) were compared to 482 controls. Our results showed variants in APOE (rs429358, rs6857), and TMTC2 (rs7976168) passed Bonferroni-adjusted p-value, as well as several nominally-associated population-specific variants. Collectively, our Age-ExWAS, the largest performed to date, confirmed and identified previously unreported candidate variants associated with human age.

Download Full-text

GWAS-Flow: A GPU accelerated framework for efficient permutation based genome-wide association studies

10.1101/783100 ◽

2019 ◽

Cited By ~ 2

Author(s):

Jan A. Freudenthal ◽

Markus J. Ankenbrand ◽

Dominik G. Grimm ◽

Arthur Korte

Keyword(s):

Complex Traits ◽

Mixed Model ◽

Linear Mixed Model ◽

Association Studies ◽

Large Datasets ◽

Genome Wide Association ◽

Small Data ◽

Genome Wide Association Studies ◽

Genome Wide ◽

Non Gaussian

AbstractMotivationGenome-wide association studies (GWAS) are one of the most commonly used methods to detect associations between complex traits and genomic polymorphisms. As both genotyping and phenotyping of large populations has become easier, typical modern GWAS have to cope with massive amounts of data. Thus, the computational demand for these analyses grew remarkably during the last decades. This is especially true, if one wants to implement permutation-based significance thresholds, instead of using the naïve Bonferroni threshold. Permutation-based methods have the advantage to provide an adjusted multiple hypothesis correction threshold that takes the underlying phenotypic distribution into account and will thus remove the need to find the correct transformation for non Gaussian phenotypes. To enable efficient analyses of large datasets and the possibility to compute permutation-based significance thresholds, we used the machine learning framework TensorFlow to develop a linear mixed model (GWAS-Flow) that can make use of the available CPU or GPU infrastructure to decrease the time of the analyses especially for large datasets.ResultsWe were able to show that our application GWAS-Flow outperforms custom GWAS scripts in terms of speed without loosing accuracy. Apart from p-values, GWAS-Flow also computes summary statistics, such as the effect size and its standard error for each individual marker. The CPU-based version is the default choice for small data, while the GPU-based version of GWAS-Flow is especially suited for the analyses of big data.AvailabilityGWAS-Flow is freely available on GitHub (https://github.com/Joyvalley/GWAS_Flow) and is released under the terms of the MIT-License.

Download Full-text

Variable selection in heterogeneous datasets: A truncated-rank sparse linear mixed model with applications to genome-wide association studies

Methods ◽

10.1016/j.ymeth.2018.04.021 ◽

2018 ◽

Vol 145 ◽

pp. 2-9 ◽

Cited By ~ 1

Author(s):

Haohan Wang ◽

Bryon Aragam ◽

Eric P. Xing

Keyword(s):

Variable Selection ◽

Mixed Model ◽

Linear Mixed Model ◽

Association Studies ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Genome Wide ◽

Heterogeneous Datasets

Download Full-text

Genome-Wide Association Studies Reveal Susceptibility Loci for Digital Dermatitis in Holstein Cattle

Animals ◽

10.3390/ani10112009 ◽

2020 ◽

Vol 10 (11) ◽

pp. 2009

Author(s):

Ellen Lai ◽

Alexa L. Danner ◽

Thomas R. Famula ◽

Anita M. Oberbauer

Keyword(s):

Predictive Value ◽

Mixed Model ◽

Linear Mixed Model ◽

Bos Taurus ◽

Association Studies ◽

Bayesian Regression ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Digital Dermatitis ◽

Genome Wide

Digital dermatitis (DD) causes lameness in dairy cattle. To detect the quantitative trait loci (QTL) associated with DD, genome-wide association studies (GWAS) were performed using high-density single nucleotide polymorphism (SNP) genotypes and binary case/control, quantitative (average number of FW per hoof trimming record) and recurrent (cases with ≥2 DD episodes vs. controls) phenotypes from cows across four dairies (controls n = 129 vs. FW n = 85). Linear mixed model (LMM) and random forest (RF) approaches identified the top SNPs, which were used as predictors in Bayesian regression models to assess the SNP predictive value. The LMM and RF analyses identified QTL regions containing candidate genes on Bos taurus autosome (BTA) 2 for the binary and recurrent phenotypes and BTA7 and 20 for the quantitative phenotype that related to epidermal integrity, immune function, and wound healing. Although larger sample sizes are necessary to reaffirm these small effect loci amidst a strong environmental effect, the sample cohort used in this study was sufficient for estimating SNP effects with a high predictive value.

Download Full-text

Genetics of connective tissue diseases

10.1093/med/9780199642489.003.0042 ◽

2013 ◽

Author(s):

Myles Lewis ◽

Tim Vyse

Keyword(s):

Systemic Sclerosis ◽

Connective Tissue ◽

Large Scale ◽

Apoptotic Cell ◽

Association Studies ◽

Group Analysis ◽

Connective Tissue Diseases ◽

Susceptibility Genes ◽

Type I ◽

Genome Wide Association Studies

The advent of genome-wide association studies (GWAS) has been an exciting breakthrough in our understanding of the genetic aetiology of autoimmune diseases. Substantial overlap has been found in susceptibility genes across multiple diseases, from connective tissue diseases and rheumatoid arthritis (RA) to inflammatory bowel disease, coeliac disease, and psoriasis. Major technological advances now permit genotyping of millions of single nucleotide polymorphisms (SNPs). Group analysis of SNPs by haplotypes, aided by completion of the Hapmap project, has improved our ability to pinpoint causal genetic variants. International collaboration to pool large-scale cohorts of patients has enabled GWAS in systemic lupus erythematosus (SLE), systemic sclerosis and Behçet's disease, with studies in progress for ANCA-associated vasculitis. These 'hypothesis-free' studies have revealed many novel disease-associated genes. In both SLE and systemic sclerosis, identified genes map to known pathways including antigen presentation (MHC, TNFSF4), autoreactivity of B and T lymphocytes (BLK, BANK1), type I interferon production (STAT4, IRF5) and the NFκ‎B pathway (TNIP1). In SLE alone, additional genes appear to be involved in dysregulated apoptotic cell clearance (ITGAM, TREX1, C1q, C4) and recognition of immune complexes (FCGR2A, FCGR3B). Future developments include whole-genome sequencing to identify rare variants, and efforts to understand functional consequences of susceptibility genes. Putative environmental triggers for connective tissue diseases include infectious agents, especially Epstein-Barr virus; cigarette smoking; occupational exposure to toxins including silica; and low vitamin D, due to its immunomodulatory effects. Despite numerous studies looking at toxin exposure and connective tissue diseases, conclusive evidence is lacking, due to either rarity of exposure or rarity of disease.

Download Full-text