Kinpute: Using identity by descent to improve genotype imputation

Mapping Intimacies ◽

10.1101/399147 ◽

2018 ◽

Author(s):

Mark Abney ◽

Aisha El Sherbiny

Keyword(s):

Sequence Data ◽

Imputation Accuracy ◽

Genotype Imputation ◽

Specific Reference ◽

Identity By Descent ◽

Imputation Methods ◽

Identical By Descent ◽

Novel Method ◽

Optimal Set ◽

Genotype Probabilities

1AbstractMotivationGenotype imputation, though generally accurate, often results in many genotypes being poorly imputed, particularly in studies where the individuals are not well represented by standard reference panels. When individuals in the study share regions of the genome identical by descent (IBD), it is possible to use this information in combination with a study specific reference panel (SSRP) to improve the imputation results. Kinpute uses IBD information—due to either recent, familial relatedness or distant, unknown ancestors— in conjunction with the output from linkage disequilibrium (LD) based imputation methods to compute more accurate genotype probabilities. Kinpute uses a novel method for IBD imputation, which works even in the absence of a pedigree, and results in substantially improved imputation quality.ResultsGiven initial estimates of average IBD between subjects in the study sample, Kinpute uses a novel algorithm to select an optimal set of individuals to sequence and use as an SSRP. Kinpute is designed to use as input both this SSRP and the genotype probabilities output from other LD based imputation software, and uses a new method to combine the LD imputed genotype probabilities with IBD configurations to substantially improve imputation. We tested Kinpute on a human population isolate where 98 individuals have been sequenced. In half of this sample, whose sequence data was masked, we used Impute2 to perform LD based imputation and Kinpute was used to obtain higher accuracy genotype probabilities. Measures of imputation accuracy improved significantly, particularly for those genotypes that Impute2 imputed with low certainty.AvailabilityKinpute is an open-source and freely available C++ software package that can be downloaded from https://github.com/markabney/Kinpute/releases.

Download Full-text

Kinpute: using identity by descent to improve genotype imputation

Bioinformatics ◽

10.1093/bioinformatics/btz221 ◽

2019 ◽

Vol 35 (21) ◽

pp. 4321-4326

Author(s):

Mark Abney ◽

Aisha ElSherbiny

Keyword(s):

Sequence Data ◽

Imputation Accuracy ◽

Genotype Imputation ◽

Supplementary Information ◽

Specific Reference ◽

Imputation Methods ◽

Identical By Descent ◽

Novel Method ◽

Optimal Set ◽

Genotype Probabilities

Abstract Motivation Genotype imputation, though generally accurate, often results in many genotypes being poorly imputed, particularly in studies where the individuals are not well represented by standard reference panels. When individuals in the study share regions of the genome identical by descent (IBD), it is possible to use this information in combination with a study-specific reference panel (SSRP) to improve the imputation results. Kinpute uses IBD information—due to recent, familial relatedness or distant, unknown ancestors—in conjunction with the output from linkage disequilibrium (LD) based imputation methods to compute more accurate genotype probabilities. Kinpute uses a novel method for IBD imputation, which works even in the absence of a pedigree, and results in substantially improved imputation quality. Results Given initial estimates of average IBD between subjects in the study sample, Kinpute uses a novel algorithm to select an optimal set of individuals to sequence and use as an SSRP. Kinpute is designed to use as input both this SSRP and the genotype probabilities output from other LD-based imputation software, and uses a new method to combine the LD imputed genotype probabilities with IBD configurations to substantially improve imputation. We tested Kinpute on a human population isolate where 98 individuals have been sequenced. In half of this sample, whose sequence data was masked, we used Impute2 to perform LD-based imputation and Kinpute was used to obtain higher accuracy genotype probabilities. Measures of imputation accuracy improved significantly, particularly for those genotypes that Impute2 imputed with low certainty. Availability and implementation Kinpute is an open-source and freely available C++ software package that can be downloaded from. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Evaluation of Vicinity-based Hidden Markov Models for Genotype Imputation

10.1101/2021.09.28.462261 ◽

2021 ◽

Author(s):

Su Wang ◽

Miran Kim ◽

Xiaoqian Jiang ◽

Arif Ozgun Harmanci

Keyword(s):

Markov Models ◽

Rare Variants ◽

Hidden Markov ◽

Large Population ◽

Imputation Accuracy ◽

Cost Effective ◽

Genotype Imputation ◽

Specific Reference ◽

Imputation Methods ◽

The Cost

The decreasing cost of DNA sequencing has led to a great increase in our knowledge about genetic variation. While population-scale projects bring important insight into genotype-phenotype relationships, the cost of performing whole-genome sequencing on large samples is still prohibitive. In-silico genotype imputation coupled with genotyping-by-arrays is a cost-effective and accurate alternative for genotyping of common and uncommon variants. Imputation methods compare the genotypes of the typed variants with the large population-specific reference panels and estimate the genotypes of untyped variants by making use of the linkage disequilibrium patterns. Most accurate imputation methods are based on the Li-Stephens hidden Markov model, HMM, that treats the sequence of each chromosome as a mosaic of the haplotypes from the reference panel. Here we assess the accuracy of local-HMMs, where each untyped variant is imputed using the typed variants in a small window around itself (as small as 1 centimorgan). Locality-based imputation is used recently by machine learning-based genotype imputation approaches. We assess how the parameters of the local-HMMs impact the imputation accuracy in a comprehensive set of benchmarks and show that local-HMMs can accurately impute common and uncommon variants and can be relaxed to impute rare variants as well. The source code for the local HMM implementations is publicly available at https://github.com/harmancilab/LoHaMMer.

Download Full-text

Optimizing Genomic Selection in Dezhou Donkey Using Low Coverage Whole Genome Sequencing

10.21203/rs.3.rs-607740/v1 ◽

2021 ◽

Author(s):

Changheng Zhao ◽

Jun Teng ◽

Xinhao Zhang ◽

Dan Wang ◽

Xinyi Zhang ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genomic Selection ◽

Genome Sequencing ◽

Sequence Data ◽

Low Cost ◽

Imputation Accuracy ◽

Genotype Imputation ◽

Whole Genome Sequence ◽

Whole Genome ◽

Low Coverage

Abstract Background Low coverage whole genome sequencing is a low-cost genotyping technology. Combining with genotype imputation approaches, it is likely to become a critical component of cost-efficient genomic selection programs in agricultural livestock. Here, we used the low-coverage sequence data of 617 Dezhou donkeys to investigate the performance of genotype imputation for low coverage whole genome sequence data and genomic selection based on the imputed genotype data. The specific aims were: (i) to measure the accuracy of genotype imputation under different sequencing depths, sample sizes, MAFs, and imputation pipelines; and (ii) to assess the accuracy of genomic selection under different marker densities derived from the imputed sequence data, different strategies for constructing the genomic relationship matrixes, and single- vs multi-trait models. Results We found that a high imputation accuracy (> 0.95) can be achieved for sequence data with sequencing depth as low as 1x and the number of sequenced individuals equal to 400. For genomic selection, the best performance was obtained by using a marker density of 410K and a G matrix constructed using marker dosage information. Multi-trait GBLUP performed better than single-trait GBLUP. Conclusions Our study demonstrates that low coverage whole genome sequencing would be a cost-effective method for genomic selection in Dezhou Donkey.

Download Full-text

Population-specific recombination maps from segments of identity by descent

10.1101/868091 ◽

2019 ◽

Cited By ~ 1

Author(s):

Ying Zhou ◽

Brian L. Browning ◽

Sharon R. Browning

Keyword(s):

Sequence Data ◽

Pearson Correlation ◽

Computational Cost ◽

Genotype Imputation ◽

Large Set ◽

Recombination Rates ◽

Identity By Descent ◽

European Americans ◽

Heart Study ◽

Similar Accuracy

ABSTRACTRecombination rates vary significantly across the genome, and estimates of recombination rates are needed for downstream analyses such as haplotype phasing and genotype imputation. Existing methods for recombination rate estimation are limited by insufficient amounts of informative genetic data or by high computational cost. We present a method for using segments of identity by descent to infer recombination rates. Our method can be applied to sequenced population cohorts to obtain high-resolution, population-specific recombination maps. We use our method to generate new recombination maps for European Americans and for African Americans from TOPMed sequence data from the Framingham Heart Study (1626 unrelated individuals) and the Jackson Heart Study (2046 unrelated individuals). We compare our maps to existing maps using the Pearson correlation between estimated recombination rates. In Europeans we use the deCODE map, which is based on a very large set of Icelandic family data (126,407 meioses), as a gold standard against which to compare other maps. Our European American map has higher accuracy at fine-scale resolution (1-10kb) than linkage disequilibrium maps from the HapMap and 1000 Genomes projects. Our African American map has much higher accuracy than an admixture-based map that is derived from a similar number individuals, and similar accuracy at fine scales (1-10kb) to an admixture-based map that is derived from 15 times as many individuals.

Download Full-text

ModStore:Genotype Imputationasa Service Powered by SG10K Reference Panel

Current Bioinformatics ◽

10.2174/1574893615999200831112522 ◽

2020 ◽

Vol 15 ◽

Author(s):

Weiwen Zhang ◽

Long Wang ◽

Theint Theint Aye

Keyword(s):

Association Study ◽

High Performance ◽

Genome Wide Association Study ◽

Imputation Accuracy ◽

Genotype Imputation ◽

Reference Panel ◽

Genome Wide Association ◽

Specific Reference ◽

Data Set ◽

Genome Wide

Background: Asia is the largest continent in the world with a large group of populations. However, we are still in lack of an imputation server with an Asian-specific reference panel to estimate genotypes for genome wide association study in Asia. Currently, two well-known imputation servers are available, i.e., Michigan imputation server in the US and Sanger in the UK. However, the quality of imputation for Southeast Asia's populations is not satisfying by using their genotype imputation services and reference panels. Objective: In this paper, we develop ModStore imputation server with a specially designed reference panel to offer genotype imputation as a service, aiming to increase the power of genome wide association study of Singapore in the context of National Precision Medicine. Method: We present the implementation and customization of ModStore imputation server on high performance computing infrastructure. Meanwhile, we construct a reference panel based on whole-genome sequencing of Singaporeans, referred to as the SG10K reference panel, for improving the imputation accuracy of Southeast Asia's populations. Results: Experiment results show that by using the SG10K reference panel, over 79% improvement of mean Rsq can be achieved for the imputation of three Singapore ethnic populations data set, i.e., Malay, Chinese, and Indian, under MAF<0.005 compared to the 1000 Genome reference panel. Conclusion: With ModStore imputation server, genotype imputation can be performed more accurately for data derived from array-based pharmacogenomics and pre-existing Southeast Asia's population-scale genetic.

Download Full-text

Assessment of Imputation Quality: Comparison of Phasing and Imputation Algorithms in Real Data

Frontiers in Genetics ◽

10.3389/fgene.2021.724037 ◽

2021 ◽

Vol 12 ◽

Author(s):

Katharina Stahl ◽

Damian Gola ◽

Inke R. König

Keyword(s):

Imputation Accuracy ◽

Real Data ◽

Genotype Imputation ◽

Reference Panel ◽

German Population ◽

Data Set ◽

Genotype Probability ◽

Small Gain ◽

High Concordance ◽

Genotype Probabilities

Despite the widespread use of genotype imputation tools and the availability of different approaches, late developments of currently used programs have not been compared comprehensively. We therefore assessed the performance of 35 combinations of phasing and imputation programs, including versions of SHAPEIT, Eagle, Beagle, minimac, PBWT, and IMPUTE, for genetic imputation of completely missing SNPs with a HRC reference panel regarding quality and speed. We used a data set comprising 1,149 fully sequenced individuals from the German population, subsetting the SNPs to approximate the Illumina Infinium-Omni5 array. Five hundred fifty-three thousand two hundred and thirty-four SNPs across two selected chromosomes were utilized for comparison between imputed and sequenced genotypes. We found that all tested programs with the exception of PBWT impute genotypes with very high accuracy (mean error rate < 0.005). PBTW hardly ever imputes the less frequent allele correctly (mean concordance for genotypes including the minor allele <0.0002). For all programs, imputation accuracy drops for rare alleles with a frequency <0.05. Even though overall concordance is high, concordance drops with genotype probability, indicating that low genotype probabilities are rare. The mean concordance of SNPs with a genotype probability <95% drops below 0.9, at which point disregarding imputed genotypes might prove favorable. For fast and accurate imputation, a combination of Eagle2.4.1 using a reference panel for phasing and Beagle5.1 for imputation performs best. Replacing Beagle5.1 with minimac3, minimac4, Beagle4.1, or IMPUTE4 results in a small gain in accuracy at a high cost of speed.

Download Full-text

Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program

Nature ◽

10.1038/s41586-021-03205-y ◽

2021 ◽

Vol 590 (7845) ◽

pp. 290-299 ◽

Cited By ~ 22

Author(s):

Daniel Taliun ◽

◽

Daniel N. Harris ◽

Michael D. Kessler ◽

Jedidiah Carlson ◽

...

Keyword(s):

Rare Variants ◽

Sequence Data ◽

Association Studies ◽

Genotype Imputation ◽

Genome Wide Association Studies ◽

Phenotypic Data ◽

Treatment And Prevention ◽

Genome Wide ◽

Diverse Backgrounds ◽

Unmapped Reads

AbstractThe Trans-Omics for Precision Medicine (TOPMed) programme seeks to elucidate the genetic architecture and biology of heart, lung, blood and sleep disorders, with the ultimate goal of improving diagnosis, treatment and prevention of these diseases. The initial phases of the programme focused on whole-genome sequencing of individuals with rich phenotypic data and diverse backgrounds. Here we describe the TOPMed goals and design as well as the available resources and early insights obtained from the sequence data. The resources include a variant browser, a genotype imputation server, and genomic and phenotypic data that are available through dbGaP (Database of Genotypes and Phenotypes)1. In the first 53,831 TOPMed samples, we detected more than 400 million single-nucleotide and insertion or deletion variants after alignment with the reference genome. Additional previously undescribed variants were detected through assembly of unmapped reads and customized analysis in highly variable loci. Among the more than 400 million detected variants, 97% have frequencies of less than 1% and 46% are singletons that are present in only one individual (53% among unrelated individuals). These rare variants provide insights into mutational processes and recent human evolutionary history. The extensive catalogue of genetic variation in TOPMed studies provides unique opportunities for exploring the contributions of rare and noncoding sequence variants to phenotypic variation. Furthermore, combining TOPMed haplotypes with modern imputation methods improves the power and reach of genome-wide association studies to include variants down to a frequency of approximately 0.01%.

Download Full-text

Assessing single nucleotide polymorphism selection methods for the development of a low-density panel optimized for imputation in South African Drakensberger beef cattle

Journal of Animal Science ◽

10.1093/jas/skab118 ◽

2021 ◽

Author(s):

Simon F Lashmar ◽

Donagh P Berry ◽

Rian Pierneef ◽

Farai C Muchadeyi ◽

Carina Visser

Keyword(s):

South African ◽

Clustering Algorithm ◽

Imputation Accuracy ◽

Developed Countries ◽

Genotype Imputation ◽

Selection Strategy ◽

Nucleotide Polymorphisms ◽

Single Nucleotide ◽

Single Nucleotide Polymorphism Selection ◽

The Mean

Abstract A major obstacle in applying genomic selection (GS) to uniquely adapted local breeds in less-developed countries has been the cost of genotyping at high densities of single nucleotide polymorphisms (SNP). Cost reduction can be achieved by imputing genotypes from lower to higher densities. Locally adapted breeds tend to be admixed and exhibit a high degree of genomic heterogeneity thus necessitating the optimization of SNP selection for downstream imputation. The aim of this study was to quantify the achievable imputation accuracy for a sample of 1,135 South African (SA) Drakensberger using several custom-derived lower-density panels varying in both SNP density and how the SNP were selected. From a pool of 120,608 genotyped SNP, subsets of SNP were chosen 1) at random, 2) with even genomic dispersion, 3) by maximizing the mean minor allele frequency (MAF), 4) using a combined score of MAF and linkage disequilibrium (LD), 5) using a partitioning-around-medoids (PAM) algorithm, and finally 6) using a hierarchical LD-based clustering algorithm. Imputation accuracy to higher density improved as SNP density increased; animal-wise imputation accuracy defined as the within-animal correlation between the imputed and actual alleles ranged from 0.625 to 0.990 when 2,500 randomly selected SNP were chosen versus a range of 0.918 to 0.999 when 50,000 randomly selected SNP were used. At a panel density of 10,000 SNP, the mean (standard deviation) animal-wise allele concordance rate was 0.976 (0.018) versus 0.982 (0.014) when the worst (i.e., random) as opposed to the best (i.e., combination of MAF and LD) SNP selection strategy was employed. A difference of 0.071 units was observed between the mean correlation-based accuracy of imputed SNP categorized as low (0.01<MAF≤0.1) versus high MAF (0.4<MAF≤0.5). Greater mean imputation accuracy was achieved for SNP located on autosomal extremes when these regions were populated with more SNP. The presented results suggested that genotype imputation can be a practical cost-saving strategy for indigenous breeds such as the South African Drakensberger. Based on the results, a genotyping panel consisting of approximately 10,000 SNP selected based on a combination of MAF and LD would suffice in achieving a less than 3% imputation error rate for a breed characterized by genomic admixture on the condition that these SNP are selected based on breed-specific selection criteria.

Download Full-text

EagleImp: Fast and Accurate Genome-wide Phasing and Imputation in a Single Tool

10.1101/2022.01.11.475810 ◽

2022 ◽

Author(s):

Lars Wienbrandt ◽

David Ellinghaus

Keyword(s):

Memory Management ◽

Imputation Accuracy ◽

Simulated Data ◽

Genotype Imputation ◽

Whole Genome Sequencing Data ◽

Common Variants ◽

Sequencing Data ◽

1000 Genomes ◽

Genome Wide ◽

Reference Genomes

Background: Reference-based phasing and genotype imputation algorithms have been developed with sublinear theoretical runtime behaviour, but runtimes are still high in practice when large genome-wide reference datasets are used. Methods: We developed EagleImp, a software with algorithmic and technical improvements and new features for accurate and accelerated phasing and imputation in a single tool. Results: We compared accuracy and runtime of EagleImp with Eagle2, PBWT and prominent imputation servers using whole-genome sequencing data from the 1000 Genomes Project, the Haplotype Reference Consortium and simulated data with more than 1 million reference genomes. EagleImp is 2 to 10 times faster (depending on the single or multiprocessor configuration selected) than Eagle2/PBWT, with the same or better phasing and imputation quality in all tested scenarios. For common variants investigated in typical GWAS studies, EagleImp provides same or higher imputation accuracy than the Sanger Imputation Service, Michigan Imputation Server and the newly developed TOPMed Imputation Server, despite larger (not publicly available) reference panels. It has many new features, including automated chromosome splitting and memory management at runtime to avoid job aborts, fast reading and writing of large files, and various user-configurable algorithm and output options. Conclusions: Due to the technical optimisations, EagleImp can perform fast and accurate reference-based phasing and imputation for future very large reference panels with more than 1 million genomes. EagleImp is freely available for download from https://github.com/ikmb/eagleimp.

Download Full-text

Estimating Semi-Parametric Missing Values with Iterative Imputation

International Journal of Data Warehousing and Mining ◽

10.4018/jdwm.2010070101 ◽

2010 ◽

Vol 6 (3) ◽

pp. 1-10 ◽

Cited By ~ 3

Author(s):

Shichao Zhang

Keyword(s):

Prior Knowledge ◽

Efficient Method ◽

Missing Values ◽

Imputation Accuracy ◽

Imputation Method ◽

Imputation Methods ◽

Real Dataset ◽

Regression Imputation ◽

Target Values ◽

Nonparametric Imputation

In this paper, the author designs an efficient method for imputing iteratively missing target values with semi-parametric kernel regression imputation, known as the semi-parametric iterative imputation algorithm (SIIA). While there is little prior knowledge on the datasets, the proposed iterative imputation method, which impute each missing value several times until the algorithms converges in each model, utilize a substantially useful amount of information. Additionally, this information includes occurrences involving missing values as well as capturing the real dataset distribution easier than the parametric or nonparametric imputation techniques. Experimental results show that the author’s imputation methods outperform the existing methods in terms of imputation accuracy, in particular in the situation with high missing ratio.

Download Full-text