Efficient Imputation of Missing Markers in Low-Coverage Genotyping-by-Sequencing Data from Multiparental Crosses

B. Emma Huang; Chitra Raghavan; Ramil Mauleon; Karl W. Broman; Hei Leung

doi:10.1534/genetics.113.158014

Parentage assignment with genotyping-by-sequencing data

10.1101/270561 ◽

2018 ◽

Cited By ~ 3

Author(s):

Andrew Whalen ◽

Gregor Gorjanc ◽

John M Hickey

Keyword(s):

False Positive Rate ◽

Simulated Data ◽

Genotyping By Sequencing ◽

Unrelated Individual ◽

Parentage Assignment ◽

Sequencing Data ◽

High Coverage ◽

Array Data ◽

Assignment Method ◽

Low Coverage

AbstractIn this paper we evaluate using genotype-by-sequencing (GBS) data to perform parentage assignment in lieu of traditional array data. The use of GBS data raises two issues: First, for low-coverage GBS data, it may not be possible to call the genotype at many loci, a critical first step for detecting opposing homozygous markers. Second, the amount of sequencing coverage may vary across individuals, making it challenging to directly compare the likelihood scores between putative parents. To address these issues we extend the probabilistic framework of Huisman (2017) and evaluate putative parents by comparing their (potentially noisy) genotypes to a series of proposal distributions. These distributions describe the expected genotype probabilities for the relatives of an individual. We assign putative parents as a parent if they are classified as a parent (as opposed to e.g., an unrelated individual), and if the assignment score passes a threshold. We evaluated this method on simulated data and found that (1) high-coverage GBS data performs similarly to array data and requires only a small number of markers to correctly assign parents and (2) low-coverage GBS data (as low as 0.1x) can also be used, provided that it is obtained across a large number of markers. When analysing the low-coverage GBS data, we also found a high number of false positives if the true parent is not contained within the list of candidate parents, but that this false positive rate can be greatly reduced by hand tuning the assignment threshold. We provide this parentage assignment method as a standalone program called AlphaAssign.

Download Full-text

Accounting for Errors in Low Coverage High-Throughput Sequencing Data when Constructing Genetic Maps using Biparental Outcrossed Populations

10.1101/249722 ◽

2018 ◽

Author(s):

Timothy P. Bilton ◽

Matthew R. Schofield ◽

Michael A. Black ◽

David Chagné ◽

Phillip L. Wilcox ◽

...

Keyword(s):

High Throughput ◽

Genetic Linkage ◽

High Throughput Sequencing ◽

Diploid Species ◽

Genotyping By Sequencing ◽

Genetic Maps ◽

Linkage Maps ◽

Sequencing Data ◽

Genetic Linkage Maps ◽

Low Coverage

ABSTRACTNext generation sequencing is an efficient method that allows for substantially more markers than previous technologies, providing opportunities for building high density genetic linkage maps, which facilitate the development of non-model species’ genomic assemblies and the investigation of their genes. However, constructing genetic maps using data generated via high-throughput sequencing technology (e.g., genotyping-by-sequencing) is complicated by the presence of sequencing errors and genotyping errors resulting from missing parental alleles due to low sequencing depth. If unaccounted for, these errors lead to inflated genetic maps. In addition, map construction in many species is performed using full-sib family populations derived from the outcrossing of two individuals, where unknown parental phase and varying segregation types further complicate construction. We present a new methodology for modeling low coverage sequencing data in the construction of genetic linkage maps using full-sib populations of diploid species, implemented in a package called GUSMap. Our model is based on an extension of the Lander-Green hidden Markov model that accounts for errors present in sequencing data. Results show that GUSMap was able to give accurate estimates of the recombination fractions and overall map distance, while most existing mapping packages produced inflated genetic maps in the presence of errors. Our results demonstrate the feasibility of using low coverage sequencing data to produce genetic maps without requiring extensive filtering of potentially erroneous genotypes, provided that the associated errors are correctly accounted for in the model.

Download Full-text

The genome sequence of the outbreeding globe artichoke constructed de novo incorporating a phase-aware low-pass sequencing strategy of F1 progeny

Scientific Reports ◽

10.1038/srep19427 ◽

2016 ◽

Vol 6 (1) ◽

Cited By ~ 48

Author(s):

Davide Scaglione ◽

Sebastian Reyes-Chin-Wo ◽

Alberto Acquadro ◽

Lutz Froenicke ◽

Ezio Portis ◽

...

Keyword(s):

Genome Sequence ◽

De Novo ◽

Genotyping By Sequencing ◽

Read Depth ◽

Globe Artichoke ◽

Sequencing Data ◽

Crop Species ◽

Low Pass ◽

Sequencing Strategy ◽

Low Coverage

Abstract Globe artichoke (Cynara cardunculus var. scolymus) is an out-crossing, perennial, multi-use crop species that is grown worldwide and belongs to the Compositae, one of the most successful Angiosperm families. We describe the first genome sequence of globe artichoke. The assembly, comprising of 13,588 scaffolds covering 725 of the 1,084 Mb genome, was generated using ~133-fold Illumina sequencing data and encodes 26,889 predicted genes. Re-sequencing (30×) of globe artichoke and cultivated cardoon (C. cardunculus var. altilis) parental genotypes and low-coverage (0.5 to 1×) genotyping-by-sequencing of 163 F1 individuals resulted in 73% of the assembled genome being anchored in 2,178 genetic bins ordered along 17 chromosomal pseudomolecules. This was achieved using a novel pipeline, SOILoCo (Scaffold Ordering by Imputation with Low Coverage), to detect heterozygous regions and assign parental haplotypes with low sequencing read depth and of unknown phase. SOILoCo provides a powerful tool for de novo genome analysis of outcrossing species. Our data will enable genome-scale analyses of evolutionary processes among crops, weeds and wild species within and beyond the Compositae and will facilitate the identification of economically important genes from related species.

Download Full-text

Accucopy: accurate and fast inference of allele-specific copy number alterations from low-coverage low-purity tumor sequencing data

BMC Bioinformatics ◽

10.1186/s12859-020-03924-5 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Xinping Fan ◽

Guanghao Luo ◽

Yu S. Huang

Keyword(s):

Copy Number ◽

Bayesian Learning ◽

Kernel Smoothing ◽

Gaussian Mixture ◽

Copy Number Alterations ◽

Sequencing Data ◽

Copy Numbers ◽

Allele Specific ◽

Tumor Sequencing ◽

Low Coverage

Abstract Background Copy number alterations (CNAs), due to their large impact on the genome, have been an important contributing factor to oncogenesis and metastasis. Detecting genomic alterations from the shallow-sequencing data of a low-purity tumor sample remains a challenging task. Results We introduce Accucopy, a method to infer total copy numbers (TCNs) and allele-specific copy numbers (ASCNs) from challenging low-purity and low-coverage tumor samples. Accucopy adopts many robust statistical techniques such as kernel smoothing of coverage differentiation information to discern signals from noise and combines ideas from time-series analysis and the signal-processing field to derive a range of estimates for the period in a histogram of coverage differentiation information. Statistical learning models such as the tiered Gaussian mixture model, the expectation–maximization algorithm, and sparse Bayesian learning were customized and built into the model. Accucopy is implemented in C++ /Rust, packaged in a docker image, and supports non-human samples, more at http://www.yfish.org/software/. Conclusions We describe Accucopy, a method that can predict both TCNs and ASCNs from low-coverage low-purity tumor sequencing data. Through comparative analyses in both simulated and real-sequencing samples, we demonstrate that Accucopy is more accurate than Sclust, ABSOLUTE, and Sequenza.

Download Full-text

Parentage Analysis in Giant Grouper (Epinephelus lanceolatus) Using Microsatellite and SNP Markers from Genotyping-by-Sequencing Data

Genes ◽

10.3390/genes12071042 ◽

2021 ◽

Vol 12 (7) ◽

pp. 1042

Author(s):

Zhuoying Weng ◽

Yang Yang ◽

Xi Wang ◽

Lina Wu ◽

Sijie Hua ◽

...

Keyword(s):

Fishery Management ◽

Genotyping By Sequencing ◽

Parentage Analysis ◽

Snp Markers ◽

Individual Identification ◽

Pedigree Information ◽

Nucleotide Polymorphisms ◽

Sequencing Data ◽

Polymorphic Snps ◽

Mixed Family

Pedigree information is necessary for the maintenance of diversity for wild and captive populations. Accurate pedigree is determined by molecular marker-based parentage analysis, which may be influenced by the polymorphism and number of markers, integrity of samples, relatedness of parents, or different analysis programs. Here, we described the first development of 208 single nucleotide polymorphisms (SNPs) and 11 microsatellites for giant grouper (Epinephelus lanceolatus) taking advantage of Genotyping-by-sequencing (GBS), and compared the power of SNPs and microsatellites for parentage and relatedness analysis, based on a mixed family composed of 4 candidate females, 4 candidate males and 289 offspring. CERVUS, PAPA and COLONY were used for mutually verification. We found that SNPs had a better potential for relatedness estimation, exclusion of non-parentage and individual identification than microsatellites, and > 98% accuracy of parentage assignment could be achieved by 100 polymorphic SNPs (MAF cut-off < 0.4) or 10 polymorphic microsatellites (mean Ho = 0.821, mean PIC = 0.651). This study provides a reference for the development of molecular markers for parentage analysis taking advantage of next-generation sequencing, and contributes to the molecular breeding, fishery management and population conservation.

Download Full-text

Publisher Correction: Efficient phasing and imputation of low-coverage sequencing data using large reference panels

Nature Genetics ◽

10.1038/s41588-021-00788-0 ◽

2021 ◽

Author(s):

Simone Rubinacci ◽

Diogo M. Ribeiro ◽

Robin J. Hofmeister ◽

Olivier Delaneau

Keyword(s):

Sequencing Data ◽

Low Coverage

Download Full-text

Detecting selection in low-coverage high-throughput sequencing data using principal component analysis

BMC Bioinformatics ◽

10.1186/s12859-021-04375-2 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Jonas Meisner ◽

Anders Albrechtsen ◽

Kristian Hanghøj

Keyword(s):

Principal Component Analysis ◽

High Throughput ◽

East Asian ◽

Principal Component ◽

Component Analysis ◽

Human Populations ◽

Population Genetic Study ◽

Sequencing Data ◽

High Quality ◽

Low Coverage

Abstract Background Identification of selection signatures between populations is often an important part of a population genetic study. Leveraging high-throughput DNA sequencing larger sample sizes of populations with similar ancestries has become increasingly common. This has led to the need of methods capable of identifying signals of selection in populations with a continuous cline of genetic differentiation. Individuals from continuous populations are inherently challenging to group into meaningful units which is why existing methods rely on principal components analysis for inference of the selection signals. These existing methods require called genotypes as input which is problematic for studies based on low-coverage sequencing data. Materials and methods We have extended two principal component analysis based selection statistics to genotype likelihood data and applied them to low-coverage sequencing data from the 1000 Genomes Project for populations with European and East Asian ancestry to detect signals of selection in samples with continuous population structure. Results Here, we present two selections statistics which we have implemented in the framework. These methods account for genotype uncertainty, opening for the opportunity to conduct selection scans in continuous populations from low and/or variable coverage sequencing data. To illustrate their use, we applied the methods to low-coverage sequencing data from human populations of East Asian and European ancestries and show that the implemented selection statistics can control the false positive rate and that they identify the same signatures of selection from low-coverage sequencing data as state-of-the-art software using high quality called genotypes. Conclusion We show that selection scans of low-coverage sequencing data of populations with similar ancestry perform on par with that obtained from high quality genotype data. Moreover, we demonstrate that outperform selection statistics obtained from called genotypes from low-coverage sequencing data without the need for ad-hoc filtering.

Download Full-text

AFLAP: Assembly-Free Linkage Analysis Pipeline using k-mers from whole genome sequencing data

10.1101/2020.09.14.296525 ◽

2020 ◽

Author(s):

Kyle Fletcher ◽

Lin Zhang ◽

Juliana Gil ◽

Rongkui Han ◽

Keri Cavanaugh ◽

...

Keyword(s):

Linkage Analysis ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Genetic Map ◽

Genotyping By Sequencing ◽

Genetic Maps ◽

Whole Genome ◽

Sequencing Data ◽

Analysis Pipeline ◽

Genome Assemblies

AbstractBackgroundGenetic maps are an important resource for validation of genome assemblies, trait discovery, and breeding. Next generation sequencing has enabled production of high-density genetic maps constructed with 10,000s of markers. Most current approaches require a genome assembly to identify markers. Our Assembly Free Linkage Analysis Pipeline (AFLAP) removes this requirement by using uniquely segregating k-mers as markers to rapidly construct a genotype table and perform subsequent linkage analysis. This avoids potential biases including preferential read alignment and variant calling.ResultsThe performance of AFLAP was determined in simulations and contrasted to a conventional workflow. We tested AFLAP using 100 F2 individuals of Arabidopsis thaliana, sequenced to low coverage. Genetic maps generated using k-mers contained over 130,000 markers that were concordant with the genomic assembly. The utility of AFLAP was then demonstrated by generating an accurate genetic map using genotyping-by-sequencing data of 235 recombinant inbred lines of Lactuca spp. AFLAP was then applied to 83 F1 individuals of the oomycete Bremia lactucae, sequenced to >5x coverage. The genetic map contained over 90,000 markers ordered in 19 large linkage groups. This genetic map was used to fragment, order, orient, and scaffold the genome, resulting in a much-improved reference assembly.ConclusionsAFLAP can be used to generate high density linkage maps and improve genome assemblies of any organism when a mapping population is available using whole genome sequencing or genotyping-by-sequencing data. Genetic maps produced for B. lactucae were accurately aligned to the genome and guided significant improvements of the reference assembly.

Download Full-text

Batch effects in population genomic studies with low‐coverage whole genome sequencing data: causes, detection, and mitigation

Molecular Ecology Resources ◽

10.1111/1755-0998.13559 ◽

2021 ◽

Author(s):

Runyang Nicolas Lou ◽

Nina Overgaard Therkildsen

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Batch Effects ◽

Sequencing Data ◽

Population Genomic ◽

Genomic Studies ◽

Low Coverage

Download Full-text

Efficient phasing and imputation of low-coverage sequencing data using large reference panels

10.1101/2020.04.14.040329 ◽

2020 ◽

Cited By ~ 2

Author(s):

S. Rubinacci ◽

D.M. Ribeiro ◽

R. Hofmeister ◽

O. Delaneau

Keyword(s):

Paradigm Shift ◽

Rare Variants ◽

Association Studies ◽

Cost Effective ◽

Human Populations ◽

Sequencing Data ◽

Snp Arrays ◽

Genomic Studies ◽

Low Coverage ◽

The Impact

AbstractLow-coverage whole genome sequencing followed by imputation has been proposed as a cost-effective genotyping approach for disease and population genetics studies. However, its competitiveness against SNP arrays is undermined as current imputation methods are computationally expensive and unable to leverage large reference panels.Here, we describe a method, GLIMPSE, for phasing and imputation of low-coverage sequencing datasets from modern reference panels. We demonstrate its remarkable performance across different coverages and human populations. It achieves imputation of a full genome for less than $1, outperforming existing methods by orders of magnitude, with an increased accuracy of more than 20% at rare variants. We also show that 1x coverage enables effective association studies and is better suited than dense SNP arrays to access the impact of rare variations. Overall, this study demonstrates the promising potential of low-coverage imputation and suggests a paradigm shift in the design of future genomic studies.

Download Full-text