Estimating genotype error rates from high-coverage next-generation sequence data

Jeffrey D. Wall; Ling Fung Tang; Brandon Zerbe; Mark N. Kvale; Pui-Yan Kwok; Catherine Schaefer; Neil Risch

doi:10.1101/gr.168393.113

SiNPle: Fast and Sensitive Variant Calling for Deep Sequencing Data

Genes ◽

10.3390/genes10080561 ◽

2019 ◽

Vol 10 (8) ◽

pp. 561 ◽

Cited By ~ 5

Author(s):

Luca Ferretti ◽

Chandana Tennakoon ◽

Adrian Silesian ◽

Graham Freimanis andPaolo Ribeca

Keyword(s):

Deep Sequencing ◽

High Throughput Sequencing ◽

Sequence Data ◽

Analytical Formula ◽

Real Life ◽

Variant Calling ◽

Error Rates ◽

High Coverage ◽

Sequencing Errors ◽

Very High

Current high-throughput sequencing technologies can generate sequence data and provide information on the genetic composition of samples at very high coverage. Deep sequencing approaches enable the detection of rare variants in heterogeneous samples, such as viral quasi-species, but also have the undesired effect of amplifying sequencing errors and artefacts. Distinguishing real variants from such noise is not straightforward. Variant callers that can handle pooled samples can be in trouble at extremely high read depths, while at lower depths sensitivity is often sacrificed to specificity. In this paper, we propose SiNPle (Simplified Inference of Novel Polymorphisms from Large coveragE), a fast and effective software for variant calling. SiNPle is based on a simplified Bayesian approach to compute the posterior probability that a variant is not generated by sequencing errors or PCR artefacts. The Bayesian model takes into consideration individual base qualities as well as their distribution, the baseline error rates during both the sequencing and the PCR stage, the prior distribution of variant frequencies and their strandedness. Our approach leads to an approximate but extremely fast computation of posterior probabilities even for very high coverage data, since the expression for the posterior distribution is a simple analytical formula in terms of summary statistics for the variants appearing at each site in the genome. These statistics can be used to filter out putative SNPs and indels according to the required level of sensitivity. We tested SiNPle on several simulated and real-life viral datasets to show that it is faster and more sensitive than existing methods. The source code for SiNPle is freely available to download and compile, or as a Conda/Bioconda package.

Download Full-text

Single-molecule sequencing of theDrosophila serratagenome

10.1101/090969 ◽

2016 ◽

Author(s):

Scott L. Allen ◽

Emily K. Delaney ◽

Artyom Kopp ◽

Stephen F. Chenoweth

Keyword(s):

Single Molecule ◽

De Novo ◽

Sequence Data ◽

Error Rates ◽

Model Species ◽

High Coverage ◽

Single Molecule Sequencing ◽

Long Read ◽

Drosophila Serrata ◽

High Degree

ABSTRACTLong read sequencing technology promises to greatly enhancede novoassembly of genomes for non-model species. While error rates have been a large stumbling block, sequencing at high coverage allows reads to be self-corrected. Here we sequence andde novoassemble the genome ofDrosophila serrata, a non-model species from themontiumsubgroup that has been well studied for clines and sexual selection. Using 11 PacBio SMRT cells, we generated 12 Gbp of raw sequence data comprising approximately 65x whole genome coverage. Read lengths averaged 8,940 bp (NRead50 12,200) with the longest read at 53 Kbp. We self-corrected reads using the PBDagCon algorithm and assembled the genome using the MHAP algorithm within the PBcR assembler. Total genome length was 198 Mbp with an N50 just under 1 Mbp. Contigs displayed a high degree of arm-level conservation withD. melanogaster. We also provide an initial annotation for this genome usingin silicogene predictions that were supported by RNA-seq data.

Download Full-text

ADVANCES IN THE USE OF NEXT-GENERATION SEQUENCE DATA IN PLANT SYSTEMATICS AND EVOLUTION

Acta Horticulturae ◽

10.17660/actahortic.2010.859.23 ◽

2010 ◽

pp. 193-206 ◽

Cited By ~ 3

Author(s):

D.E. Soltis ◽

G. Burleigh ◽

W.B. Barbazuk ◽

M.J. Moore ◽

P.S. Soltis

Keyword(s):

Sequence Data ◽

Plant Systematics ◽

Next Generation

Download Full-text

Genome diversity in Ukraine

GigaScience ◽

10.1093/gigascience/giaa159 ◽

2021 ◽

Vol 10 (1) ◽

Author(s):

Taras K Oleksyk ◽

Walter W Wolfsberger ◽

Alexandra M Weber ◽

Khrystyna Shchubelka ◽

Olga T Oleksyk ◽

...

Keyword(s):

Sequence Data ◽

Copy Number Variations ◽

Genomic Variation ◽

High Coverage ◽

Genome Data ◽

New Information ◽

Genome Wide ◽

Public Data ◽

Genome Wide Data ◽

Multiple Samples

Abstract Background The main goal of this collaborative effort is to provide genome-wide data for the previously underrepresented population in Eastern Europe, and to provide cross-validation of the data from genome sequences and genotypes of the same individuals acquired by different technologies. We collected 97 genome-grade DNA samples from consented individuals representing major regions of Ukraine that were consented for public data release. BGISEQ-500 sequence data and genotypes by an Illumina GWAS chip were cross-validated on multiple samples and additionally referenced to 1 sample that has been resequenced by Illumina NovaSeq6000 S4 at high coverage. Results The genome data have been searched for genomic variation represented in this population, and a number of variants have been reported: large structural variants, indels, copy number variations, single-nucletide polymorphisms, and microsatellites. To our knowledge, this study provides the largest to-date survey of genetic variation in Ukraine, creating a public reference resource aiming to provide data for medical research in a large understudied population. Conclusions Our results indicate that the genetic diversity of the Ukrainian population is uniquely shaped by evolutionary and demographic forces and cannot be ignored in future genetic and biomedical studies. These data will contribute a wealth of new information bringing forth a wealth of novel, endemic and medically related alleles.

Download Full-text

A long reads-based de-novo assembly of the genome of the Arlee homozygous line reveals chromosomal rearrangements in rainbow trout

G3 Genes|Genome|Genetics ◽

10.1093/g3journal/jkab052 ◽

2021 ◽

Author(s):

Guangtu Gao ◽

Susana Magadan ◽

Geoffrey C Waldbieser ◽

Ramey C Youngblood ◽

Paul A Wheeler ◽

...

Keyword(s):

Rainbow Trout ◽

Chromosome Number ◽

Genome Assembly ◽

De Novo Assembly ◽

De Novo ◽

Sequence Data ◽

Structural Variations ◽

High Coverage ◽

Haploid Chromosome Number ◽

Long Reads

Abstract Currently, there is still a need to improve the contiguity of the rainbow trout reference genome and to use multiple genetic backgrounds that will represent the genetic diversity of this species. The Arlee doubled haploid line was originated from a domesticated hatchery strain that was originally collected from the northern California coast. The Canu pipeline was used to generate the Arlee line genome de-novo assembly from high coverage PacBio long-reads sequence data. The assembly was further improved with Bionano optical maps and Hi-C proximity ligation sequence data to generate 32 major scaffolds corresponding to the karyotype of the Arlee line (2 N = 64). It is composed of 938 scaffolds with N50 of 39.16 Mb and a total length of 2.33 Gb, of which ∼95% was in 32 chromosome sequences with only 438 gaps between contigs and scaffolds. In rainbow trout the haploid chromosome number can vary from 29 to 32. In the Arlee karyotype the haploid chromosome number is 32 because chromosomes Omy04, 14 and 25 are divided into six acrocentric chromosomes. Additional structural variations that were identified in the Arlee genome included the major inversions on chromosomes Omy05 and Omy20 and additional 15 smaller inversions that will require further validation. This is also the first rainbow trout genome assembly that includes a scaffold with the sex-determination gene (sdY) in the chromosome Y sequence. The utility of this genome assembly is demonstrated through the improved annotation of the duplicated genome loci that harbor the IGH genes on chromosomes Omy12 and Omy13.

Download Full-text

Nanopore sequencing technology and tools for genome assembly: computational analysis of the current state, bottlenecks and future directions

Briefings in Bioinformatics ◽

10.1093/bib/bby017 ◽

2018 ◽

Vol 20 (4) ◽

pp. 1542-1559 ◽

Cited By ~ 44

Author(s):

Damla Senol Cali ◽

Jeremie S Kim ◽

Saugata Ghose ◽

Can Alkan ◽

Onur Mutlu

Keyword(s):

Sequence Analysis ◽

Genome Assembly ◽

Sequence Data ◽

Error Rates ◽

Nanopore Sequencing ◽

Memory Usage ◽

Sequencing Technology ◽

Assembly Pipeline ◽

And Performance ◽

Polishing Tool

Abstract Nanopore sequencing technology has the potential to render other sequencing technologies obsolete with its ability to generate long reads and provide portability. However, high error rates of the technology pose a challenge while generating accurate genome assemblies. The tools used for nanopore sequence analysis are of critical importance, as they should overcome the high error rates of the technology. Our goal in this work is to comprehensively analyze current publicly available tools for nanopore sequence analysis to understand their advantages, disadvantages and performance bottlenecks. It is important to understand where the current tools do not perform well to develop better tools. To this end, we (1) analyze the multiple steps and the associated tools in the genome assembly pipeline using nanopore sequence data, and (2) provide guidelines for determining the appropriate tools for each step. Based on our analyses, we make four key observations: (1) the choice of the tool for basecalling plays a critical role in overcoming the high error rates of nanopore sequencing technology. (2) Read-to-read overlap finding tools, GraphMap and Minimap, perform similarly in terms of accuracy. However, Minimap has a lower memory usage, and it is faster than GraphMap. (3) There is a trade-off between accuracy and performance when deciding on the appropriate tool for the assembly step. The fast but less accurate assembler Miniasm can be used for quick initial assembly, and further polishing can be applied on top of it to increase the accuracy, which leads to faster overall assembly. (4) The state-of-the-art polishing tool, Racon, generates high-quality consensus sequences while providing a significant speedup over another polishing tool, Nanopolish. We analyze various combinations of different tools and expose the trade-offs between accuracy, performance, memory usage and scalability. We conclude that our observations can guide researchers and practitioners in making conscious and effective choices for each step of the genome assembly pipeline using nanopore sequence data. Also, with the help of bottlenecks we have found, developers can improve the current tools or build new ones that are both accurate and fast, to overcome the high error rates of the nanopore sequencing technology.

Download Full-text

A curated dataset of modern and ancient high-coverage shotgun human genomes

Scientific Data ◽

10.1038/s41597-021-00980-1 ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Pierpaolo Maisano Delser ◽

Eppie R. Jones ◽

Anahit Hovhannisyan ◽

Lara Cassidy ◽

Ron Pinhasi ◽

...

Keyword(s):

Sequence Data ◽

Whole Genome ◽

Reference Dataset ◽

High Coverage ◽

Sample Distribution ◽

Human Samples ◽

Human Genomes ◽

Genome Wide ◽

Genome Wide Data ◽

Computationally Intensive

AbstractOver the last few years, genome-wide data for a large number of ancient human samples have been collected. Whilst datasets of captured SNPs have been collated, high coverage shotgun genomes (which are relatively few but allow certain types of analyses not possible with ascertained captured SNPs) have to be reprocessed by individual groups from raw reads. This task is computationally intensive. Here, we release a dataset including 35 whole-genome sequenced samples, previously published and distributed worldwide, together with the genetic pipeline used to process them. The dataset contains 72,041,355 sites called across 19 ancient and 16 modern individuals and includes sequence data from four previously published ancient samples which we sequenced to higher coverage (10–18x). Such a resource will allow researchers to analyse their new samples with the same genetic pipeline and directly compare them to the reference dataset without re-processing published samples. Moreover, this dataset can be easily expanded to increase the sample distribution both across time and space.

Download Full-text

High throughput crop genome genotyping by a combination of pool next generation sequencing and haplotype-based data processing

10.21203/rs.3.rs-415602/v1 ◽

2021 ◽

Author(s):

Michael Schneider ◽

Asis Shrestha ◽

Agim Ballvora ◽

Jens Leon

Keyword(s):

Next Generation Sequencing ◽

Allele Frequency ◽

Frequency Estimation ◽

Whole Genome ◽

Next Generation ◽

Conservation Genomics ◽

High Coverage ◽

Allele Frequency Estimation ◽

Low Coverage ◽

Generation Sequencing

Abstract BackgroundThe identification of environmentally specific alleles and the observation of evolutional processes is a goal of conservation genomics. By generational changes of allele frequencies in populations, questions regarding effective population size, gene flow, drift, and selection can be addressed. The observation of such effects often is a trade-off of costs and resolution, when a decent sample of genotypes should be genotyped for many loci. Pool genotyping approaches can derive a high resolution and precision in allele frequency estimation, when high coverage sequencing is utilized. Still, pool high coverage pool sequencing of big genomes comes along with high costs.ResultsHere we present a reliable method to estimate a barley population’s allele frequency at low coverage sequencing. Three hundred genotypes were sampled from a barley backcross population to estimate the entire population’s allele frequency. The allele frequency estimation accuracy and yield were compared for three next generation sequencing methods. To reveal accurate allele frequency estimates on a low coverage sequencing level, a haplotyping approach was performed. Low coverage allele frequency of positional connected single polymorphisms were aggregated to a single haplotype allele frequency, resulting in two to 271 times higher depth and increased precision. We compared different haplotyping tactics, showing that gene and chip marker-based haplotypes perform on par or better than simple contig haplotype windows. The comparison of multiple pool samples and the referencing against an individual sequencing approach revealed whole genome pool resequencing having the highest correlation to individual genotyping (up to 0.97), while transcriptomics and genotyping by sequencing indicated higher error rates and lower correlations.ConclusionUsing the proposed method allows to identify the allele frequency of populations with high accuracy at low cost. This is particularly interesting for conservation genomics in species with big genomes, like barley or wheat. Whole genome low coverage resequencing at 10x coverage can deliver a highly accurate estimation of the allele frequency, when a loci-based haplotyping approach is applied. Using annotated haplotypes allows to capitalize from biological background and statistical robustness.

Download Full-text

Analysis of Transposable Elements in the Genome of Asparagus officinalis from High Coverage Sequence Data

PLoS ONE ◽

10.1371/journal.pone.0097189 ◽

2014 ◽

Vol 9 (5) ◽

pp. e97189 ◽

Cited By ~ 11

Author(s):

Shu-Fen Li ◽

Wu-Jun Gao ◽

Xin-Peng Zhao ◽

Tian-Yu Dong ◽

Chuan-Liang Deng ◽

...

Keyword(s):

Transposable Elements ◽

Sequence Data ◽

Asparagus Officinalis ◽

High Coverage

Download Full-text

BAQALC: Blockchain Applied Lossless Efficient Transmission of DNA Sequencing Data for Next Generation Medical Informatics

Applied Sciences ◽

10.3390/app8091471 ◽

2018 ◽

Vol 8 (9) ◽

pp. 1471 ◽

Cited By ~ 7

Author(s):

Seo-Joon Lee ◽

Gyoun-Yon Cho ◽

Fumiaki Ikeno ◽

Tae-Ro Lee

Keyword(s):

Dna Sequencing ◽

Medical Informatics ◽

Compression Ratio ◽

Sequence Data ◽

Lossless Compression ◽

World Health ◽

Smart Devices ◽

Next Generation ◽

And Storage ◽

Efficient Transmission

Due to the development of high-throughput DNA sequencing technology, genome-sequencing costs have been significantly reduced, which has led to a number of revolutionary advances in the genetics industry. However, the problem is that compared to the decrease in time and cost needed for DNA sequencing, the management of such large volumes of data is still an issue. Therefore, this research proposes Blockchain Applied FASTQ and FASTA Lossless Compression (BAQALC), a lossless compression algorithm that allows for the efficient transmission and storage of the immense amounts of DNA sequence data that are being generated by Next Generation Sequencing (NGS). Also, security and reliability issues exist in public sequence databases. For methods, compression ratio comparisons were determined for genetic biomarkers corresponding to the five diseases with the highest mortality rates according to the World Health Organization. The results showed an average compression ratio of approximately 12 for all the genetic datasets used. BAQALC performed especially well for lung cancer genetic markers, with a compression ratio of 17.02. BAQALC performed not only comparatively higher than widely used compression algorithms, but also higher than algorithms described in previously published research. The proposed solution is envisioned to contribute to providing an efficient and secure transmission and storage platform for next-generation medical informatics based on smart devices for both researchers and healthcare users.

Download Full-text