Parentage assignment with genotyping-by-sequencing data

Mapping Intimacies ◽

10.1101/270561 ◽

2018 ◽

Cited By ~ 3

Author(s):

Andrew Whalen ◽

Gregor Gorjanc ◽

John M Hickey

Keyword(s):

False Positive Rate ◽

Simulated Data ◽

Genotyping By Sequencing ◽

Unrelated Individual ◽

Parentage Assignment ◽

Sequencing Data ◽

High Coverage ◽

Array Data ◽

Assignment Method ◽

Low Coverage

AbstractIn this paper we evaluate using genotype-by-sequencing (GBS) data to perform parentage assignment in lieu of traditional array data. The use of GBS data raises two issues: First, for low-coverage GBS data, it may not be possible to call the genotype at many loci, a critical first step for detecting opposing homozygous markers. Second, the amount of sequencing coverage may vary across individuals, making it challenging to directly compare the likelihood scores between putative parents. To address these issues we extend the probabilistic framework of Huisman (2017) and evaluate putative parents by comparing their (potentially noisy) genotypes to a series of proposal distributions. These distributions describe the expected genotype probabilities for the relatives of an individual. We assign putative parents as a parent if they are classified as a parent (as opposed to e.g., an unrelated individual), and if the assignment score passes a threshold. We evaluated this method on simulated data and found that (1) high-coverage GBS data performs similarly to array data and requires only a small number of markers to correctly assign parents and (2) low-coverage GBS data (as low as 0.1x) can also be used, provided that it is obtained across a large number of markers. When analysing the low-coverage GBS data, we also found a high number of false positives if the true parent is not contained within the list of candidate parents, but that this false positive rate can be greatly reduced by hand tuning the assignment threshold. We provide this parentage assignment method as a standalone program called AlphaAssign.

Download Full-text

Accurate allele frequencies from ultra-low coverage pool-seq samples in evolve-and-resequence experiments

10.1101/244004 ◽

2018 ◽

Author(s):

Susanne Tilk ◽

Alan Bergland ◽

Aaron Goodman ◽

Paul Schmidt ◽

Dmitri Petrov ◽

...

Keyword(s):

Allele Frequency ◽

Model Organism ◽

Software Tool ◽

Allele Frequencies ◽

Model Organisms ◽

Sequencing Data ◽

High Coverage ◽

Next Generation Sequencing Technology ◽

Low Coverage ◽

Pooled Samples

AbstractEvolve-and-resequence (E+R) experiments leverage next-generation sequencing technology to track the allele frequency dynamics of populations as they evolve. While previous work has shown that adaptive alleles can be detected by comparing frequency trajectories from many replicate populations, this power comes at the expense of high-coverage (>100x) sequencing of many pooled samples, which can be cost-prohibitive. Here, we show that accurate estimates of allele frequencies can be achieved with very shallow sequencing depths (<5x) via inference of known founder haplotypes in small genomic windows. This technique can be used to efficiently estimate frequencies for any number of bi-allelic SNPs in populations of any model organism founded with sequenced homozygous strains. Using both experimentally-pooled and simulated samples of Drosophila melanogaster, we show that haplotype inference can improve allele frequency accuracy by orders of magnitude for up to 50 generations of recombination, and is robust to moderate levels of missing data, as well as different selection regimes. Finally, we show that a simple linear model generated from these simulations can predict the accuracy of haplotype-derived allele frequencies in other model organisms and experimental designs. To make these results broadly accessible for use in E+R experiments, we introduce HAF-pipe, an open-source software tool for calculating haplotype-derived allele frequencies from raw sequencing data. Ultimately, by reducing sequencing costs without sacrificing accuracy, our method facilitates E+R designs with higher replication and resolution, and thereby, increased power to detect adaptive alleles.

Download Full-text

Efficient Imputation of Missing Markers in Low-Coverage Genotyping-by-Sequencing Data from Multiparental Crosses

Genetics ◽

10.1534/genetics.113.158014 ◽

2014 ◽

Vol 197 (1) ◽

pp. 401-404 ◽

Cited By ~ 27

Author(s):

B. Emma Huang ◽

Chitra Raghavan ◽

Ramil Mauleon ◽

Karl W. Broman ◽

Hei Leung

Keyword(s):

Genotyping By Sequencing ◽

Sequencing Data ◽

Low Coverage

Download Full-text

Exclusion and genomic relatedness methods for assignment of parentage using genotyping-by-sequencing data

10.1101/582585 ◽

2019 ◽

Author(s):

K. G. Dodds ◽

J. C. McEwan ◽

R. Brauning ◽

T. C. van Stijn ◽

S. J. Rowe ◽

...

Keyword(s):

Genotyping By Sequencing ◽

Parentage Analysis ◽

Random Selection ◽

Parentage Assignment ◽

Sequencing Data ◽

Putative Parent ◽

The Difference ◽

Genomic Relatedness ◽

Selection Of ◽

Mismatch Rate

SummaryGenotypes are often used to assign parentage in agricultural and ecological settings. Sequencing can be used to obtain genotypes but does not provide unambiguous genotype calls, especially when sequencing depth is low in order to reduce costs. In that case, standard parentage analysis methods no longer apply. A strategy for using low-depth sequencing data for parentage assignment is developed here. It entails the use of relatedness estimates along with a metric termed excess mismatch rate which, for parent-offspring pairs or trios, is the difference between the observed mismatch rate and the rate expected under a model of inheritance and allele reads without error. When more than one putative parent has similar statistics, bootstrapping can provide a measure of the relatedness similarity. Putative parent-offspring trios can be further checked for consistency by comparing the offspring’s estimated inbreeding to half the parent relatedness. Suitable thresholds are required for each metric. These methods were applied to a deer breeding operation consisting of two herds of different breeds. Relatedness estimates were more in line with expectation when the herds were analysed separately than when combined, although this did not alter which parents were the best matches with each offspring. Parentage results were largely consistent with those based on a microsatellite parentage panel with three discordant parent assignments out of 1561. Two models are investigated to allow the parentage metrics to be calculated with non-random selection of alleles. The tools and strategies given here allow parentage to be assigned from low-depth sequencing data.

Download Full-text

Characterization and Mapping of retr04, retr05 and retr06 Broad-Spectrum Resistances to Turnip Mosaic Virus in Brassica juncea, and the Development of Robust Methods for Utilizing Recalcitrant Genotyping Data

Frontiers in Plant Science ◽

10.3389/fpls.2021.787354 ◽

2022 ◽

Vol 12 ◽

Author(s):

Lawrence E. Bramham ◽

Tongtong Wang ◽

Erin E. Higgins ◽

Isobel A. P. Parkin ◽

Guy C. Barker ◽

...

Keyword(s):

Mosaic Virus ◽

Brassica Juncea ◽

Allelic Variation ◽

Snp Array ◽

Genotyping By Sequencing ◽

Turnip Mosaic Virus ◽

Interactive Effects ◽

Sequencing Data ◽

Array Data ◽

Tumv Resistance

Turnip mosaic virus (TuMV) induces disease in susceptible hosts, notably impacting cultivation of important crop species of the Brassica genus. Few effective plant viral disease management strategies exist with the majority of current approaches aiming to mitigate the virus indirectly through control of aphid vector species. Multiple sources of genetic resistance to TuMV have been identified previously, although the majority are strain-specific and have not been exploited commercially. Here, two Brassica juncea lines (TWBJ14 and TWBJ20) with resistance against important TuMV isolates (UK 1, vVIR24, CDN 1, and GBR 6) representing the most prevalent pathotypes of TuMV (1, 3, 4, and 4, respectively) and known to overcome other sources of resistance, have been identified and characterized. Genetic inheritance of both resistances was determined to be based on a recessive two-gene model. Using both single nucleotide polymorphism (SNP) array and genotyping by sequencing (GBS) methods, quantitative trait loci (QTL) analyses were performed using first backcross (BC1) genetic mapping populations segregating for TuMV resistance. Pairs of statistically significant TuMV resistance-associated QTLs with additive interactive effects were identified on chromosomes A03 and A06 for both TWBJ14 and TWBJ20 material. Complementation testing between these B. juncea lines indicated that one resistance-linked locus was shared. Following established resistance gene nomenclature for recessive TuMV resistance genes, these new resistance-associated loci have been termed retr04 (chromosome A06, TWBJ14, and TWBJ20), retr05 (A03, TWBJ14), and retr06 (A03, TWBJ20). Genotyping by sequencing data investigated in parallel to robust SNP array data was highly suboptimal, with informative data not established for key BC1 parental samples. This necessitated careful consideration and the development of new methods for processing compromised data. Using reductive screening of potential markers according to allelic variation and the recombination observed across BC1 samples genotyped, compromised GBS data was rendered functional with near-equivalent QTL outputs to the SNP array data. The reductive screening strategy employed here offers an alternative to methods relying upon imputation or artificial correction of genotypic data and may prove effective for similar biparental QTL mapping studies.

Download Full-text

Accounting for Errors in Low Coverage High-Throughput Sequencing Data when Constructing Genetic Maps using Biparental Outcrossed Populations

10.1101/249722 ◽

2018 ◽

Author(s):

Timothy P. Bilton ◽

Matthew R. Schofield ◽

Michael A. Black ◽

David Chagné ◽

Phillip L. Wilcox ◽

...

Keyword(s):

High Throughput ◽

Genetic Linkage ◽

High Throughput Sequencing ◽

Diploid Species ◽

Genotyping By Sequencing ◽

Genetic Maps ◽

Linkage Maps ◽

Sequencing Data ◽

Genetic Linkage Maps ◽

Low Coverage

ABSTRACTNext generation sequencing is an efficient method that allows for substantially more markers than previous technologies, providing opportunities for building high density genetic linkage maps, which facilitate the development of non-model species’ genomic assemblies and the investigation of their genes. However, constructing genetic maps using data generated via high-throughput sequencing technology (e.g., genotyping-by-sequencing) is complicated by the presence of sequencing errors and genotyping errors resulting from missing parental alleles due to low sequencing depth. If unaccounted for, these errors lead to inflated genetic maps. In addition, map construction in many species is performed using full-sib family populations derived from the outcrossing of two individuals, where unknown parental phase and varying segregation types further complicate construction. We present a new methodology for modeling low coverage sequencing data in the construction of genetic linkage maps using full-sib populations of diploid species, implemented in a package called GUSMap. Our model is based on an extension of the Lander-Green hidden Markov model that accounts for errors present in sequencing data. Results show that GUSMap was able to give accurate estimates of the recombination fractions and overall map distance, while most existing mapping packages produced inflated genetic maps in the presence of errors. Our results demonstrate the feasibility of using low coverage sequencing data to produce genetic maps without requiring extensive filtering of potentially erroneous genotypes, provided that the associated errors are correctly accounted for in the model.

Download Full-text

Rapid Genotype Refinement for Whole-Genome Sequencing Data using Multi-Variate Normal Distributions

10.1101/031484 ◽

2015 ◽

Author(s):

Rudy Arthur ◽

Jared O'Connell ◽

Ole Schulz-Trieglaff ◽

Anthony J Cox

Keyword(s):

Markov Models ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

High Coverage ◽

Multivariate Gaussian Distribution ◽

Data Set ◽

Normal Distributions ◽

Computationally Expensive ◽

Low Coverage

Whole-genome low-coverage sequencing has been combined with linkage-disequilibrium (LD) based genotype refinement to accurately and cost-effectively infer genotypes in large cohorts of individuals. Most genotype refinement methods are based on hidden Markov models, which are accurate but computationally expensive. We introduce an algorithm that models LD using a simple multivariate Gaussian distribution. The key feature of our algorithm is its speed, it is hundreds of times faster than other methods on the same data set and its scaling behaviour is linear in the number of samples. We demonstrate the performance of the method on both low-coverage and high-coverage samples.

Download Full-text

AKSmooth: Enhancing low-coverage bisulfite sequencing data via kernel-based smoothing

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720014420050 ◽

2014 ◽

Vol 12 (06) ◽

pp. 1442005

Author(s):

Junfang Chen ◽

Pavlo Lutsik ◽

Ruslan Akulenko ◽

Jörn Walter ◽

Volkhard Helms

Keyword(s):

Dna Methylation ◽

Bisulfite Sequencing ◽

Human Colon ◽

Sequencing Data ◽

Human Colon Cancer ◽

High Coverage ◽

Comprehensive Picture ◽

Genome Bisulfite Sequencing ◽

Bisulfite Sequencing Data ◽

Low Coverage

Whole-genome bisulfite sequencing (WGBS) is an approach of growing importance. It is the only approach that provides a comprehensive picture of the genome-wide DNA methylation profile. However, obtaining a sufficient amount of genome and read coverage typically requires high sequencing costs. Bioinformatics tools can reduce this cost burden by improving the quality of sequencing data. We have developed a statistical method Ajusted Local Kernel Smoother (AKSmooth) that can accurately and efficiently reconstruct the single CpG methylation estimate across the entire methylome using low-coverage bisulfite sequencing (Bi-Seq) data. We demonstrate the AKSmooth performance on the low-coverage (~ 4×) DNA methylation profiles of three human colon cancer samples and matched controls. Under the best set of parameters, AKSmooth-curated data showed high concordance with the gold standard high-coverage sample (Pearson 0.90), outperforming the popular analogous method. In addition, AKSmooth showed computational efficiency with runtime benchmark over 4.5 times better than the reference tool. To summarize, AKSmooth is a simple and efficient tool that can provide an accurate human colon methylome estimation profile from low-coverage WGBS data. The proposed method is implemented in R and is available at https://github.com/Junfang/AKSmooth .

Download Full-text

Exclusion and Genomic Relatedness Methods for Assignment of Parentage Using Genotyping-by-Sequencing Data

G3 Genes|Genome|Genetics ◽

10.1534/g3.119.400501 ◽

2019 ◽

Vol 9 (10) ◽

pp. 3239-3247 ◽

Cited By ~ 3

Author(s):

Ken G. Dodds ◽

John C. McEwan ◽

Rudiger Brauning ◽

Tracey C. van Stijn ◽

Suzanne J. Rowe ◽

...

Keyword(s):

Genotyping By Sequencing ◽

Parentage Analysis ◽

Random Selection ◽

Parentage Assignment ◽

Sequencing Data ◽

Putative Parent ◽

The Difference ◽

Genomic Relatedness ◽

Selection Of ◽

Mismatch Rate

Genotypes are often used to assign parentage in agricultural and ecological settings. Sequencing can be used to obtain genotypes but does not provide unambiguous genotype calls, especially when sequencing depth is low in order to reduce costs. In that case, standard parentage analysis methods no longer apply. A strategy for using low-depth sequencing data for parentage assignment is developed here. It entails the use of relatedness estimates along with a metric termed excess mismatch rate which, for parent-offspring pairs or trios, is the difference between the observed mismatch rate and the rate expected under a model of inheritance and allele reads without error. When more than one putative parent has similar statistics, bootstrapping can provide a measure of the relatedness similarity. Putative parent-offspring trios can be further checked for consistency by comparing the offspring’s estimated inbreeding to half the parent relatedness. Suitable thresholds are required for each metric. These methods were applied to a deer breeding operation consisting of two herds of different breeds. Relatedness estimates were more in line with expectation when the herds were analyzed separately than when combined, although this did not alter which parents were the best matches with each offspring. Parentage results were largely consistent with those based on a microsatellite parentage panel with three discordant parent assignments out of 1561. Two models are investigated to allow the parentage metrics to be calculated with non-random selection of alleles. The tools and strategies given here allow parentage to be assigned from low-depth sequencing data.

Download Full-text

Low-coverage sequencing cost-effectively detects known and novel variation in underrepresented populations

10.1101/2020.04.27.064832 ◽

2020 ◽

Cited By ~ 2

Author(s):

Alicia R. Martin ◽

Elizabeth G. Atkinson ◽

Sinéad B. Chapman ◽

Anne Stevenson ◽

Rocky E. Stroud ◽

...

Keyword(s):

Whole Genome Sequencing Data ◽

Data Generation ◽

Sequencing Data ◽

Underrepresented Populations ◽

High Coverage ◽

Genetic Studies ◽

Variant Discovery ◽

Whole Genomes ◽

Low Coverage ◽

Novel Variation

AbstractBackgroundGenetic studies of biomedical phenotypes in underrepresented populations identify disproportionate numbers of novel associations. However, current genomics infrastructure--including most genotyping arrays and sequenced reference panels--best serves populations of European descent. A critical step for facilitating genetic studies in underrepresented populations is to ensure that genetic technologies accurately capture variation in all populations. Here, we quantify the accuracy of low-coverage sequencing in diverse African populations.ResultsWe sequenced the whole genomes of 91 individuals to high-coverage (≥20X) from the Neuropsychiatric Genetics of African Population-Psychosis (NeuroGAP-Psychosis) study, in which participants were recruited from Ethiopia, Kenya, South Africa, and Uganda. We empirically tested two data generation strategies, GWAS arrays versus low-coverage sequencing, by calculating the concordance of imputed variants from these technologies with those from deep whole genome sequencing data. We show that low-coverage sequencing at a depth of ≥4X captures variants of all frequencies more accurately than all commonly used GWAS arrays investigated and at a comparable cost. Lower depths of sequencing (0.5-1X) performed comparable to commonly used low-density GWAS arrays. Low-coverage sequencing is also sensitive to novel variation, with 4X sequencing detecting 45% of singletons and 95% of common variants identified in high-coverage African whole genomes.ConclusionThese results indicate that low-coverage sequencing approaches surmount the problems induced by the ascertainment of common genotyping arrays, including those that capture variation most common in Europeans and Africans. Low-coverage sequencing effectively identifies novel variation (particularly in underrepresented populations), and presents opportunities to enhance variant discovery at a similar cost to traditional approaches.

Download Full-text

Aligned genomic data compression via improved modeling

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720014420025 ◽

2014 ◽

Vol 12 (06) ◽

pp. 1442002 ◽

Cited By ~ 9

Author(s):

Idoia Ochoa ◽

Mikel Hernaez ◽

Tsachy Weissman

Keyword(s):

Compression Ratio ◽

Genomic Data ◽

Compression Rate ◽

Compressed Domain ◽

Sequencing Data ◽

High Coverage ◽

The Cost ◽

Next Generation Sequencing Ngs ◽

Low Coverage ◽

Generation Sequencing

With the release of the latest Next-Generation Sequencing (NGS) machine, the HiSeq X by Illumina, the cost of sequencing the whole genome of a human is expected to drop to a mere $1000. This milestone in sequencing history marks the era of affordable sequencing of individuals and opens the doors to personalized medicine. In accord, unprecedented volumes of genomic data will require storage for processing. There will be dire need not only of compressing aligned data, but also of generating compressed files that can be fed directly to downstream applications to facilitate the analysis of and inference on the data. Several approaches to this challenge have been proposed in the literature; however, focus thus far has been on the low coverage regime and most of the suggested compressors are not based on effective modeling of the data. We demonstrate the benefit of data modeling for compressing aligned reads. Specifically, we show that, by working with data models designed for the aligned data, we can improve considerably over the best compression ratio achieved by previously proposed algorithms. Our results indicate that the pareto-optimal barrier for compression rate and speed claimed by Bonfield and Mahoney (2013) [Bonfield JK and Mahoneys MV, Compression of FASTQ and SAM format sequencing data, PLOS ONE, 8(3):e59190, 2013.] does not apply for high coverage aligned data. Furthermore, our improved compression ratio is achieved by splitting the data in a manner conducive to operations in the compressed domain by downstream applications.

Download Full-text