Improved imputation of summary statistics for admixed populations

Mapping Intimacies ◽

10.1101/203927 ◽

2017 ◽

Cited By ~ 3

Author(s):

Sina Rüeger ◽

Aaron McDaid ◽

Zoltán Kutalik

Keyword(s):

Meta Analysis ◽

Country Of Origin ◽

Real Data ◽

Genotype Imputation ◽

Reference Panel ◽

Summary Statistics ◽

Panel Size ◽

Noticeable Improvement

AbstractMotivationSummary statistics imputation can be used to infer association summary statistics of an already conducted, genotype-based meta-analysis to higher ge-nomic resolution. This is typically needed when genotype imputation is not feasible for some cohorts. Oftentimes, cohorts of such a meta-analysis are variable in terms of (country of) origin or ancestry. This violates the assumption of current methods that an external LD matrix and the covariance of the Z-statistics are identical.ResultsTo address this issue, we present variance matching, an extention to the existing summary statistics imputation method, which manipulates the LD matrix needed for summary statistics imputation. Based on simulations using real data we find that accounting for ancestry admixture yields noticeable improvement only when the total reference panel size is > 1000. We show that for population specific variants this effect is more pronounced with increasing FST.

Download Full-text

Genotype imputation using the Positional Burrows Wheeler Transform

PLoS Genetics ◽

10.1371/journal.pgen.1009049 ◽

2020 ◽

Vol 16 (11) ◽

pp. e1009049

Author(s):

Simone Rubinacci ◽

Olivier Delaneau ◽

Jonathan Marchini

Keyword(s):

Computational Cost ◽

Computation Time ◽

Imputation Method ◽

Genotype Imputation ◽

Reference Panel ◽

Imputation Methods ◽

Panel Size ◽

Burrows Wheeler Transform ◽

Made In ◽

Memory Efficient

Genotype imputation is the process of predicting unobserved genotypes in a sample of individuals using a reference panel of haplotypes. In the last 10 years reference panels have increased in size by more than 100 fold. Increasing reference panel size improves accuracy of markers with low minor allele frequencies but poses ever increasing computational challenges for imputation methods. Here we present IMPUTE5, a genotype imputation method that can scale to reference panels with millions of samples. This method continues to refine the observation made in the IMPUTE2 method, that accuracy is optimized via use of a custom subset of haplotypes when imputing each individual. It achieves fast, accurate, and memory-efficient imputation by selecting haplotypes using the Positional Burrows Wheeler Transform (PBWT). By using the PBWT data structure at genotyped markers, IMPUTE5 identifies locally best matching haplotypes and long identical by state segments. The method then uses the selected haplotypes as conditioning states within the IMPUTE model. Using the HRC reference panel, which has ∼65,000 haplotypes, we show that IMPUTE5 is up to 30x faster than MINIMAC4 and up to 3x faster than BEAGLE5.1, and uses less memory than both these methods. Using simulated reference panels we show that IMPUTE5 scales sub-linearly with reference panel size. For example, keeping the number of imputed markers constant, increasing the reference panel size from 10,000 to 1 million haplotypes requires less than twice the computation time. As the reference panel increases in size IMPUTE5 is able to utilize a smaller number of reference haplotypes, thus reducing computational cost.

Download Full-text

Increasing the resolution and precision of psychiatric GWAS by re-imputing summary statistics using a large, diverse reference panel

10.1101/496570 ◽

2018 ◽

Author(s):

Chris Chatzinakos ◽

Donghyung Lee ◽

Na Cai ◽

Vladimir I. Vladimirov ◽

Bradley T. Webb ◽

...

Keyword(s):

Large Scale ◽

Association Studies ◽

Genome Project ◽

Genotype Imputation ◽

Reference Panel ◽

Genome Wide Association Studies ◽

Summary Statistics ◽

Common Variants ◽

Post Traumatic Stress ◽

Mixed Ancestry

ABSTRACTGenotype imputation across populations of mixed ancestry is critical for optimal discovery in large-scale genome-wide association studies (GWAS). Methods for direct imputation of GWAS summary statistics were previously shown to be practically as accurate as summary statistics produced after raw genotype imputation, while incurring orders of magnitude lower computational burden. Given that direct imputation needs a precise estimation of linkage-disequilibrium (LD) and that most of the methods using a small reference panel e.g., ~2,500 subject coming from the 1000 Genome Project, there is a great need for much larger and more diverse reference panels. To accurately estimate the LD needed for an exhaustive analysis of any cosmopolitan cohort, we developed DISTMIX2. DISTMIX2: i) uses a much larger and more diverse reference panel and ii) estimates weights of ethnic mixture based solely on Z-scores (when AFs are not available). We applied DISTMIX2 to GWAS summary statistics from the Psychiatric Genetic Consortium (PGC). DISTMIX2 uncovered signals in numerous new regions, with most of these findings coming from the rarer variants. Rarer variants provide much sharper location for the signals compared with common variants, as the LD for rare variants extends over a lower distance than for common ones. For example, while the original PGC post-traumatic stress disorder (PTSD) study found only 3 marginal signals for common variants, we now uncover a very strong signal for a rare variant in PKN2, a gene associated with neuronal and hippocampal development. Thus, DISTMIX2 provides a robust and fast (re)imputation approach for most Psychiatric GWAS studies.

Download Full-text

Gene- and pathway-based association tests for multiple traits with GWAS summary statistics

10.1101/052068 ◽

2016 ◽

Author(s):

Il-Youp Kwak ◽

Wei Pan

Keyword(s):

Association Analysis ◽

Complex Traits ◽

Meta Analysis ◽

Real Data ◽

R Package ◽

Summary Statistics ◽

Multiple Traits ◽

Numerical Studies ◽

Wide Range ◽

Intermediate Traits

AbstractTo identify novel genetic variants associated with complex traits and to shed new insights on underlying biology, in addition to the most popular single SNP-single trait association analysis, it would be useful to explore multiple correlated (intermediate) traits at the gene-or pathway-level by mining existing single GWAS or meta-analyzed GWAS data. For this purpose, we present an adaptive gene-based test and a pathway-based test for association analysis of multiple traits with GWAS summary statistics. The proposed tests are adaptive at both the SNP-and trait-levels; that is, they account for possibly varying association patterns (e.g. signal sparsity levels) across SNPs and traits, thus maintaining high power across a wide range of situations. Furthermore, the proposed methods are general: they can be applied to mixed types of traits, and to Z-statistics or p-values as summary statistics obtained from either a single GWAS or a meta-analysis of multiple GWAS. Our numerical studies with simulated and real data demonstrated the promising performance of the proposed methods.The methods are implemented in R package aSPU, freely and publicly available on CRAN at: https://cran.r-project.org/web/packages/aSPU/.

Download Full-text

A one penny imputed genome from next generation reference panels

10.1101/357806 ◽

2018 ◽

Cited By ~ 1

Author(s):

Brian L. Browning ◽

Ying Zhou ◽

Sharon R. Browning

Keyword(s):

Association Studies ◽

Computational Cost ◽

Computation Time ◽

Genotype Imputation ◽

Reference Panel ◽

Genome Wide Association Studies ◽

Panel Size ◽

Genome Wide ◽

New Genotype ◽

Reference Samples

AbstractGenotype imputation is commonly performed in genome-wide association studies because it greatly increases the number of markers that can be tested for association with a trait. In general, one should perform genotype imputation using the largest reference panel that is available because the number of accurately imputed variants increases with reference panel size. However, one impediment to using larger reference panels is the increased computational cost of imputation. We present a new genotype imputation method, Beagle 5.0, which greatly reduces the computational cost of imputation from large reference panels. We compare Beagle 5.0 with Beagle 4.1, Impute4, Minimac3, and Minimac4 using 1000 Genomes Project data, Haplotype Reference Consortium data, and simulated data for 10k, 100k, 1M, and 10M reference samples. All methods produce nearly identical accuracy, but Beagle 5.0 has the lowest computation time and the best scaling of computation time with increasing reference panel size. For 10k, 100k, 1M, and 10M reference samples and 1000 phased target samples, Beagle 5.0’s computation time is 3× (10k), 12× (100k), 43× (1M), and 533× (10M) faster than the fastest alternative method. Cost data from the Amazon Elastic Compute Cloud show that Beagle 5.0 can perform genome-wide imputation from 10M reference samples into 1000 phased target samples at a cost of less than one US cent per sample.Beagle 5.0 is freely available from https://faculty.washington.edu/browning/beagle/beagle.html.

Download Full-text

Assessment of Imputation Quality: Comparison of Phasing and Imputation Algorithms in Real Data

Frontiers in Genetics ◽

10.3389/fgene.2021.724037 ◽

2021 ◽

Vol 12 ◽

Author(s):

Katharina Stahl ◽

Damian Gola ◽

Inke R. König

Keyword(s):

Imputation Accuracy ◽

Real Data ◽

Genotype Imputation ◽

Reference Panel ◽

German Population ◽

Data Set ◽

Genotype Probability ◽

Small Gain ◽

High Concordance ◽

Genotype Probabilities

Despite the widespread use of genotype imputation tools and the availability of different approaches, late developments of currently used programs have not been compared comprehensively. We therefore assessed the performance of 35 combinations of phasing and imputation programs, including versions of SHAPEIT, Eagle, Beagle, minimac, PBWT, and IMPUTE, for genetic imputation of completely missing SNPs with a HRC reference panel regarding quality and speed. We used a data set comprising 1,149 fully sequenced individuals from the German population, subsetting the SNPs to approximate the Illumina Infinium-Omni5 array. Five hundred fifty-three thousand two hundred and thirty-four SNPs across two selected chromosomes were utilized for comparison between imputed and sequenced genotypes. We found that all tested programs with the exception of PBWT impute genotypes with very high accuracy (mean error rate < 0.005). PBTW hardly ever imputes the less frequent allele correctly (mean concordance for genotypes including the minor allele <0.0002). For all programs, imputation accuracy drops for rare alleles with a frequency <0.05. Even though overall concordance is high, concordance drops with genotype probability, indicating that low genotype probabilities are rare. The mean concordance of SNPs with a genotype probability <95% drops below 0.9, at which point disregarding imputed genotypes might prove favorable. For fast and accurate imputation, a combination of Eagle2.4.1 using a reference panel for phasing and Beagle5.1 for imputation performs best. Replacing Beagle5.1 with minimac3, minimac4, Beagle4.1, or IMPUTE4 results in a small gain in accuracy at a high cost of speed.

Download Full-text

Genotype imputation using the Positional Burrows Wheeler Transform

10.1101/797944 ◽

2019 ◽

Cited By ~ 7

Author(s):

Simone Rubinacci ◽

Olivier Delaneau ◽

Jonathan Marchini

Keyword(s):

Association Studies ◽

Computational Cost ◽

Computation Time ◽

Fold Increase ◽

Genotype Imputation ◽

Reference Panel ◽

Genome Wide Association Studies ◽

Panel Size ◽

Genome Wide ◽

Burrows Wheeler Transform

AbstractGenotype imputation is the process of predicting unobserved genotypes in a sample of individuals using a reference panel of haplotypes. In the last 10 years reference panels have increased in size by more than 100 fold. Increasing reference panel size improves accuracy of markers with low minor allele frequencies but poses ever increasing computational challenges for imputation methods.Here we present IMPUTE5, a genotype imputation method that can scale to reference panels with millions of samples. This method continues to refine the observation made in the IMPUTE2 method, that accuracy is optimized via use of a custom subset of haplotypes when imputing each individual. It achieves fast, accurate, and memory-efficient imputation by selecting haplotypes using the Positional Burrows Wheeler Transform (PBWT). By using the PBWT data structure at genotyped markers, IMPUTE5 identifies locally best matching haplotypes and long identical by state segments. The method then uses the selected haplotypes as conditioning states within the IMPUTE model.Using the HRC reference panel, which has ~65,000 haplotypes, we show that IMPUTE5 is up to 30x faster than MINIMAC4 and up to 3x faster than BEAGLE5.1, and uses less memory than both these methods. Using simulated reference panels we show that IMPUTE5 scales sub-linearly with reference panel size. For example, keeping the number of imputed markers constant, increasing the reference panel size from 10,000 to 1 million haplotypes requires less than twice the computation time. As the reference panel increases in size IMPUTE5 is able to utilize a smaller number of reference haplotypes, thus reducing computational cost.Author summaryGenome-wide association studies (GWAS) typically use microarray technology to measure genotypes at several hundred thousand positions in the genome. However reference panels of genetic variation consist of haplotype data at >100 fold more positions in the genome. Genotype imputation makes genotype predictions at all the reference panel sites using the GWAS data. Reference panels are continuing to grow in size and this improves accuracy of the predictions, however methods need to be able to scale to increased size. We have developed a new version of the popular IMPUTE software than can handle referenece panels with millions of haplotypes, and has better performance than other published approaches. A notable property of the new method is that it scales sub-linearly with reference panel size. Keeping the number of imputed markers constant, a 100 fold increase in reference panel size requires less than twice the computation time.

Download Full-text

Marker Genotype Imputation in a Low-Marker-Density Panel with a High-Marker-Density Reference Panel: Accuracy Evaluation in Barley Breeding Lines

Crop Science ◽

10.2135/cropsci2009.08.0434 ◽

2010 ◽

Vol 50 (4) ◽

pp. 1269-1278 ◽

Cited By ~ 16

Author(s):

Hiroyoshi Iwata ◽

Jean-Luc Jannink

Keyword(s):

Genotype Imputation ◽

Reference Panel ◽

Accuracy Evaluation ◽

Marker Density ◽

Marker Genotype ◽

Breeding Lines ◽

Barley Breeding ◽

High Marker Density

Download Full-text

Coordinate based meta-analysis of networks in neuroimaging studies

10.1101/407270 ◽

2018 ◽

Author(s):

CR Tench ◽

Radu Tanasescu ◽

CS Constantinescu ◽

DP Auer ◽

WJ Cottam

Keyword(s):

Spatial Clustering ◽

Meta Analysis ◽

Simulated Data ◽

Real Data ◽

Network Effect ◽

Effect Sizes ◽

Multivariate Normal ◽

Z Scores ◽

Independent Effects ◽

Multiple Clusters

AbstractMeta-analysis of published neuroimaging results is commonly performed using coordinate based meta-analysis (CBMA). Most commonly CBMA algorithms detect spatial clustering of reported coordinates across multiple studies by assuming that results relating to the common hypothesis fall in similar anatomical locations. The null hypothesis is that studies report uncorrelated results, which is simulated by random coordinates. It is assumed that multiple clusters are independent yet it is likely that multiple results reported per study are not, and in fact represent a network effect. Here the multiple reported effect sizes (reported peak Z scores) are assumed multivariate normal, and maximum likelihood used to estimate the parameters of the covariance matrix. The hypothesis is that the effect sizes are correlated. The parameters are covariance of effect size, considered as edges of a network, while clusters are considered as nodes. In this way coordinate based meta-analysis of networks (CBMAN) estimates a network of reported meta-effects, rather than multiple independent effects (clusters).CBMAN uses only the same data as CBMA, yet produces extra information in terms of the correlation between clusters. Here it is validated on numerically simulated data, and demonstrated on real data used previously to demonstrate CBMA. The CBMA and CBMAN clusters are similar, despite the very different hypothesis.

Download Full-text

GENOME WIDE ASSOCIATION STUDY META-ANALYSIS OF HOMOARGININE USING THE HRC REFERENCE PANEL

Journal of Hypertension ◽

10.1097/01.hjh.0000539237.28179.dd ◽

2018 ◽

Vol 36 (Supplement 1) ◽

pp. e94

Author(s):

M. Kleber ◽

L.P. Lyytikainen ◽

G.E. Delgado ◽

C. Drechsler ◽

C. Wanner ◽

...

Keyword(s):

Association Study ◽

Genome Wide Association Study ◽

Meta Analysis ◽

Reference Panel ◽

Genome Wide Association ◽

Genome Wide

Download Full-text

Bayesian splines versus fractional polynomials in network meta-analysis

BMC Medical Research Methodology ◽

10.1186/s12874-020-01113-9 ◽

2020 ◽

Vol 20 (1) ◽

Author(s):

Andreas Heinecke ◽

Marta Tallarita ◽

Maria De Iorio

Keyword(s):

Meta Analysis ◽

Indirect Comparison ◽

Real Data ◽

Multiple Time ◽

Computationally Efficient ◽

Fractional Polynomials ◽

Time Points ◽

Model Based ◽

Multiple Treatments ◽

Indirect Comparisons

Abstract Background Network meta-analysis (NMA) provides a powerful tool for the simultaneous evaluation of multiple treatments by combining evidence from different studies, allowing for direct and indirect comparisons between treatments. In recent years, NMA is becoming increasingly popular in the medical literature and underlying statistical methodologies are evolving both in the frequentist and Bayesian framework. Traditional NMA models are often based on the comparison of two treatment arms per study. These individual studies may measure outcomes at multiple time points that are not necessarily homogeneous across studies. Methods In this article we present a Bayesian model based on B-splines for the simultaneous analysis of outcomes across time points, that allows for indirect comparison of treatments across different longitudinal studies. Results We illustrate the proposed approach in simulations as well as on real data examples available in the literature and compare it with a model based on P-splines and one based on fractional polynomials, showing that our approach is flexible and overcomes the limitations of the latter. Conclusions The proposed approach is computationally efficient and able to accommodate a large class of temporal treatment effect patterns, allowing for direct and indirect comparisons of widely varying shapes of longitudinal profiles.

Download Full-text