scholarly journals Improved imputation of summary statistics for admixed populations

2017 ◽  
Author(s):  
Sina Rüeger ◽  
Aaron McDaid ◽  
Zoltán Kutalik

AbstractMotivationSummary statistics imputation can be used to infer association summary statistics of an already conducted, genotype-based meta-analysis to higher ge-nomic resolution. This is typically needed when genotype imputation is not feasible for some cohorts. Oftentimes, cohorts of such a meta-analysis are variable in terms of (country of) origin or ancestry. This violates the assumption of current methods that an external LD matrix and the covariance of the Z-statistics are identical.ResultsTo address this issue, we present variance matching, an extention to the existing summary statistics imputation method, which manipulates the LD matrix needed for summary statistics imputation. Based on simulations using real data we find that accounting for ancestry admixture yields noticeable improvement only when the total reference panel size is > 1000. We show that for population specific variants this effect is more pronounced with increasing FST.

PLoS Genetics ◽  
2020 ◽  
Vol 16 (11) ◽  
pp. e1009049
Author(s):  
Simone Rubinacci ◽  
Olivier Delaneau ◽  
Jonathan Marchini

Genotype imputation is the process of predicting unobserved genotypes in a sample of individuals using a reference panel of haplotypes. In the last 10 years reference panels have increased in size by more than 100 fold. Increasing reference panel size improves accuracy of markers with low minor allele frequencies but poses ever increasing computational challenges for imputation methods. Here we present IMPUTE5, a genotype imputation method that can scale to reference panels with millions of samples. This method continues to refine the observation made in the IMPUTE2 method, that accuracy is optimized via use of a custom subset of haplotypes when imputing each individual. It achieves fast, accurate, and memory-efficient imputation by selecting haplotypes using the Positional Burrows Wheeler Transform (PBWT). By using the PBWT data structure at genotyped markers, IMPUTE5 identifies locally best matching haplotypes and long identical by state segments. The method then uses the selected haplotypes as conditioning states within the IMPUTE model. Using the HRC reference panel, which has ∼65,000 haplotypes, we show that IMPUTE5 is up to 30x faster than MINIMAC4 and up to 3x faster than BEAGLE5.1, and uses less memory than both these methods. Using simulated reference panels we show that IMPUTE5 scales sub-linearly with reference panel size. For example, keeping the number of imputed markers constant, increasing the reference panel size from 10,000 to 1 million haplotypes requires less than twice the computation time. As the reference panel increases in size IMPUTE5 is able to utilize a smaller number of reference haplotypes, thus reducing computational cost.


2018 ◽  
Author(s):  
Chris Chatzinakos ◽  
Donghyung Lee ◽  
Na Cai ◽  
Vladimir I. Vladimirov ◽  
Bradley T. Webb ◽  
...  

ABSTRACTGenotype imputation across populations of mixed ancestry is critical for optimal discovery in large-scale genome-wide association studies (GWAS). Methods for direct imputation of GWAS summary statistics were previously shown to be practically as accurate as summary statistics produced after raw genotype imputation, while incurring orders of magnitude lower computational burden. Given that direct imputation needs a precise estimation of linkage-disequilibrium (LD) and that most of the methods using a small reference panel e.g., ~2,500 subject coming from the 1000 Genome Project, there is a great need for much larger and more diverse reference panels. To accurately estimate the LD needed for an exhaustive analysis of any cosmopolitan cohort, we developed DISTMIX2. DISTMIX2: i) uses a much larger and more diverse reference panel and ii) estimates weights of ethnic mixture based solely on Z-scores (when AFs are not available). We applied DISTMIX2 to GWAS summary statistics from the Psychiatric Genetic Consortium (PGC). DISTMIX2 uncovered signals in numerous new regions, with most of these findings coming from the rarer variants. Rarer variants provide much sharper location for the signals compared with common variants, as the LD for rare variants extends over a lower distance than for common ones. For example, while the original PGC post-traumatic stress disorder (PTSD) study found only 3 marginal signals for common variants, we now uncover a very strong signal for a rare variant in PKN2, a gene associated with neuronal and hippocampal development. Thus, DISTMIX2 provides a robust and fast (re)imputation approach for most Psychiatric GWAS studies.


2016 ◽  
Author(s):  
Il-Youp Kwak ◽  
Wei Pan

AbstractTo identify novel genetic variants associated with complex traits and to shed new insights on underlying biology, in addition to the most popular single SNP-single trait association analysis, it would be useful to explore multiple correlated (intermediate) traits at the gene-or pathway-level by mining existing single GWAS or meta-analyzed GWAS data. For this purpose, we present an adaptive gene-based test and a pathway-based test for association analysis of multiple traits with GWAS summary statistics. The proposed tests are adaptive at both the SNP-and trait-levels; that is, they account for possibly varying association patterns (e.g. signal sparsity levels) across SNPs and traits, thus maintaining high power across a wide range of situations. Furthermore, the proposed methods are general: they can be applied to mixed types of traits, and to Z-statistics or p-values as summary statistics obtained from either a single GWAS or a meta-analysis of multiple GWAS. Our numerical studies with simulated and real data demonstrated the promising performance of the proposed methods.The methods are implemented in R package aSPU, freely and publicly available on CRAN at: https://cran.r-project.org/web/packages/aSPU/.


2018 ◽  
Author(s):  
Brian L. Browning ◽  
Ying Zhou ◽  
Sharon R. Browning

AbstractGenotype imputation is commonly performed in genome-wide association studies because it greatly increases the number of markers that can be tested for association with a trait. In general, one should perform genotype imputation using the largest reference panel that is available because the number of accurately imputed variants increases with reference panel size. However, one impediment to using larger reference panels is the increased computational cost of imputation. We present a new genotype imputation method, Beagle 5.0, which greatly reduces the computational cost of imputation from large reference panels. We compare Beagle 5.0 with Beagle 4.1, Impute4, Minimac3, and Minimac4 using 1000 Genomes Project data, Haplotype Reference Consortium data, and simulated data for 10k, 100k, 1M, and 10M reference samples. All methods produce nearly identical accuracy, but Beagle 5.0 has the lowest computation time and the best scaling of computation time with increasing reference panel size. For 10k, 100k, 1M, and 10M reference samples and 1000 phased target samples, Beagle 5.0’s computation time is 3× (10k), 12× (100k), 43× (1M), and 533× (10M) faster than the fastest alternative method. Cost data from the Amazon Elastic Compute Cloud show that Beagle 5.0 can perform genome-wide imputation from 10M reference samples into 1000 phased target samples at a cost of less than one US cent per sample.Beagle 5.0 is freely available from https://faculty.washington.edu/browning/beagle/beagle.html.


2021 ◽  
Vol 12 ◽  
Author(s):  
Katharina Stahl ◽  
Damian Gola ◽  
Inke R. König

Despite the widespread use of genotype imputation tools and the availability of different approaches, late developments of currently used programs have not been compared comprehensively. We therefore assessed the performance of 35 combinations of phasing and imputation programs, including versions of SHAPEIT, Eagle, Beagle, minimac, PBWT, and IMPUTE, for genetic imputation of completely missing SNPs with a HRC reference panel regarding quality and speed. We used a data set comprising 1,149 fully sequenced individuals from the German population, subsetting the SNPs to approximate the Illumina Infinium-Omni5 array. Five hundred fifty-three thousand two hundred and thirty-four SNPs across two selected chromosomes were utilized for comparison between imputed and sequenced genotypes. We found that all tested programs with the exception of PBWT impute genotypes with very high accuracy (mean error rate < 0.005). PBTW hardly ever imputes the less frequent allele correctly (mean concordance for genotypes including the minor allele <0.0002). For all programs, imputation accuracy drops for rare alleles with a frequency <0.05. Even though overall concordance is high, concordance drops with genotype probability, indicating that low genotype probabilities are rare. The mean concordance of SNPs with a genotype probability <95% drops below 0.9, at which point disregarding imputed genotypes might prove favorable. For fast and accurate imputation, a combination of Eagle2.4.1 using a reference panel for phasing and Beagle5.1 for imputation performs best. Replacing Beagle5.1 with minimac3, minimac4, Beagle4.1, or IMPUTE4 results in a small gain in accuracy at a high cost of speed.


2019 ◽  
Author(s):  
Simone Rubinacci ◽  
Olivier Delaneau ◽  
Jonathan Marchini

AbstractGenotype imputation is the process of predicting unobserved genotypes in a sample of individuals using a reference panel of haplotypes. In the last 10 years reference panels have increased in size by more than 100 fold. Increasing reference panel size improves accuracy of markers with low minor allele frequencies but poses ever increasing computational challenges for imputation methods.Here we present IMPUTE5, a genotype imputation method that can scale to reference panels with millions of samples. This method continues to refine the observation made in the IMPUTE2 method, that accuracy is optimized via use of a custom subset of haplotypes when imputing each individual. It achieves fast, accurate, and memory-efficient imputation by selecting haplotypes using the Positional Burrows Wheeler Transform (PBWT). By using the PBWT data structure at genotyped markers, IMPUTE5 identifies locally best matching haplotypes and long identical by state segments. The method then uses the selected haplotypes as conditioning states within the IMPUTE model.Using the HRC reference panel, which has ~65,000 haplotypes, we show that IMPUTE5 is up to 30x faster than MINIMAC4 and up to 3x faster than BEAGLE5.1, and uses less memory than both these methods. Using simulated reference panels we show that IMPUTE5 scales sub-linearly with reference panel size. For example, keeping the number of imputed markers constant, increasing the reference panel size from 10,000 to 1 million haplotypes requires less than twice the computation time. As the reference panel increases in size IMPUTE5 is able to utilize a smaller number of reference haplotypes, thus reducing computational cost.Author summaryGenome-wide association studies (GWAS) typically use microarray technology to measure genotypes at several hundred thousand positions in the genome. However reference panels of genetic variation consist of haplotype data at >100 fold more positions in the genome. Genotype imputation makes genotype predictions at all the reference panel sites using the GWAS data. Reference panels are continuing to grow in size and this improves accuracy of the predictions, however methods need to be able to scale to increased size. We have developed a new version of the popular IMPUTE software than can handle referenece panels with millions of haplotypes, and has better performance than other published approaches. A notable property of the new method is that it scales sub-linearly with reference panel size. Keeping the number of imputed markers constant, a 100 fold increase in reference panel size requires less than twice the computation time.


2018 ◽  
Author(s):  
CR Tench ◽  
Radu Tanasescu ◽  
CS Constantinescu ◽  
DP Auer ◽  
WJ Cottam

AbstractMeta-analysis of published neuroimaging results is commonly performed using coordinate based meta-analysis (CBMA). Most commonly CBMA algorithms detect spatial clustering of reported coordinates across multiple studies by assuming that results relating to the common hypothesis fall in similar anatomical locations. The null hypothesis is that studies report uncorrelated results, which is simulated by random coordinates. It is assumed that multiple clusters are independent yet it is likely that multiple results reported per study are not, and in fact represent a network effect. Here the multiple reported effect sizes (reported peak Z scores) are assumed multivariate normal, and maximum likelihood used to estimate the parameters of the covariance matrix. The hypothesis is that the effect sizes are correlated. The parameters are covariance of effect size, considered as edges of a network, while clusters are considered as nodes. In this way coordinate based meta-analysis of networks (CBMAN) estimates a network of reported meta-effects, rather than multiple independent effects (clusters).CBMAN uses only the same data as CBMA, yet produces extra information in terms of the correlation between clusters. Here it is validated on numerically simulated data, and demonstrated on real data used previously to demonstrate CBMA. The CBMA and CBMAN clusters are similar, despite the very different hypothesis.


2018 ◽  
Vol 36 (Supplement 1) ◽  
pp. e94
Author(s):  
M. Kleber ◽  
L.P. Lyytikainen ◽  
G.E. Delgado ◽  
C. Drechsler ◽  
C. Wanner ◽  
...  

2020 ◽  
Vol 20 (1) ◽  
Author(s):  
Andreas Heinecke ◽  
Marta Tallarita ◽  
Maria De Iorio

Abstract Background Network meta-analysis (NMA) provides a powerful tool for the simultaneous evaluation of multiple treatments by combining evidence from different studies, allowing for direct and indirect comparisons between treatments. In recent years, NMA is becoming increasingly popular in the medical literature and underlying statistical methodologies are evolving both in the frequentist and Bayesian framework. Traditional NMA models are often based on the comparison of two treatment arms per study. These individual studies may measure outcomes at multiple time points that are not necessarily homogeneous across studies. Methods In this article we present a Bayesian model based on B-splines for the simultaneous analysis of outcomes across time points, that allows for indirect comparison of treatments across different longitudinal studies. Results We illustrate the proposed approach in simulations as well as on real data examples available in the literature and compare it with a model based on P-splines and one based on fractional polynomials, showing that our approach is flexible and overcomes the limitations of the latter. Conclusions The proposed approach is computationally efficient and able to accommodate a large class of temporal treatment effect patterns, allowing for direct and indirect comparisons of widely varying shapes of longitudinal profiles.


Sign in / Sign up

Export Citation Format

Share Document