Reducing INDEL calling errors in whole-genome and exome sequencing data

Mapping Intimacies ◽

10.1101/006148 ◽

2014 ◽

Cited By ~ 2

Author(s):

Han Fang ◽

Yiyang Wu ◽

Giuseppe Narzisi ◽

Jason A. O'Rawe ◽

Laura T. Jimenez Barrón ◽

...

Keyword(s):

Exome Sequencing ◽

Genome Sequencing ◽

Large Scale ◽

Published Data ◽

Whole Genome ◽

Sequencing Data ◽

High Quality ◽

Indel Detection ◽

Validation Experiment ◽

Large Indels

BackgroundINDELs, especially those disrupting protein-coding regions of the genome, have been strongly associated with human diseases. However, there are still many errors with INDEL variant calling, driven by library preparation, sequencing biases, and algorithm artifacts.MethodsWe characterized whole genome sequencing (WGS), whole exome sequencing (WES), and PCR-free sequencing data from the same samples to investigate the sources of INDEL errors. We also developed a classification scheme based on the coverage and composition to rank high and low quality INDEL calls. We performed a large-scale validation experiment on 600 loci, and find high-quality INDELs to have a substantially lower error rate than low quality INDELs (7% vs. 51%).ResultsSimulation and experimental data show that assembly based callers are significantly more sensitive and robust for detecting large INDELs (>5bp) than alignment based callers, consistent with published data. The concordance of INDEL detection between WGS and WES is low (52%), and WGS data uniquely identifies 10.8-fold more high-quality INDELs. The validation rate for WGS-specific INDELs is also much higher than that for WES-specific INDELs (85% vs. 54%), and WES misses many large INDELs. In addition, the concordance for INDEL detection between standard WGS and PCR-free sequencing is 71%, and standard WGS data uniquely identifies 6.3-fold more low-quality INDELs. Furthermore, accurate detection with Scalpel of heterozygous INDELs requires 1.2-fold higher coverage than that for homozygous INDELs. Lastly, homopolymer A/T INDELs are a major source of low-quality INDEL calls, and they are highly enriched in the WES data.ConclusionsOverall, we show that accuracy of INDEL detection with WGS is much greater than WES even in the targeted region. We calculated that 60X WGS depth of coverage from the HiSeq platform is needed to recover 95% of INDELs detected by Scalpel. While this is higher than current sequencing practice, the deeper coverage may save total project costs because of the greater accuracy and sensitivity. Finally, we investigate sources of INDEL errors (e.g. capture deficiency, PCR amplification, homopolymers) with various data that will serve as a guideline to effectively reduce INDEL errors in genome sequencing.

Download Full-text

Plasmids or no plasmids? A comparison between the agilent TapeStation and whole-genome sequencing data in a large-scale bacterial sequencing project

10.26226/morressier.56d5ba27d462b80296c95fe7 ◽

2016 ◽

Author(s):

Sarah Alexander

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Large Scale ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Sequencing Project

Download Full-text

Estimating sequencing error rates using families

BioData Mining ◽

10.1186/s13040-021-00259-6 ◽

2021 ◽

Vol 14 (1) ◽

Author(s):

Kelley Paskov ◽

Jae-Yoon Jung ◽

Brianna Chrisman ◽

Nate T. Stockham ◽

Peter Washington ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Exome Sequencing ◽

Genome Sequencing ◽

Variant Calling ◽

Error Rates ◽

Sequencing Error ◽

Whole Genome ◽

Sequencing Data ◽

Sequencing Platform ◽

Whole Exome

Abstract Background As next-generation sequencing technologies make their way into the clinic, knowledge of their error rates is essential if they are to be used to guide patient care. However, sequencing platforms and variant-calling pipelines are continuously evolving, making it difficult to accurately quantify error rates for the particular combination of assay and software parameters used on each sample. Family data provide a unique opportunity for estimating sequencing error rates since it allows us to observe a fraction of sequencing errors as Mendelian errors in the family, which we can then use to produce genome-wide error estimates for each sample. Results We introduce a method that uses Mendelian errors in sequencing data to make highly granular per-sample estimates of precision and recall for any set of variant calls, regardless of sequencing platform or calling methodology. We validate the accuracy of our estimates using monozygotic twins, and we use a set of monozygotic quadruplets to show that our predictions closely match the consensus method. We demonstrate our method’s versatility by estimating sequencing error rates for whole genome sequencing, whole exome sequencing, and microarray datasets, and we highlight its sensitivity by quantifying performance increases between different versions of the GATK variant-calling pipeline. We then use our method to demonstrate that: 1) Sequencing error rates between samples in the same dataset can vary by over an order of magnitude. 2) Variant calling performance decreases substantially in low-complexity regions of the genome. 3) Variant calling performance in whole exome sequencing data decreases with distance from the nearest target region. 4) Variant calls from lymphoblastoid cell lines can be as accurate as those from whole blood. 5) Whole-genome sequencing can attain microarray-level precision and recall at disease-associated SNV sites. Conclusion Genotype datasets from families are powerful resources that can be used to make fine-grained estimates of sequencing error for any sequencing platform and variant-calling methodology.

Download Full-text

Fast and inexpensive whole genome sequencing library preparation from intact yeast cells

10.1101/2020.09.03.280990 ◽

2020 ◽

Author(s):

Sibylle C Vonesch ◽

Shengdi Li ◽

Chelsea Szu Tu ◽

Bianca P Hennig ◽

Nikolay Dobrev ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Genomic Dna ◽

Large Scale ◽

Massively Parallel Sequencing ◽

Yeast Cells ◽

Whole Genome ◽

High Quality ◽

Rapid Preparation ◽

Yeast Cultures

ABSTRACTThrough the increase in the capacity of sequencing machines massively parallel sequencing of thousands of samples in a single run is now possible. With the improved throughput and resulting drop in the price of sequencing, the cost and time for preparation of sequencing libraries have become the major bottleneck in large-scale experiments. Methods using a hyperactive variant of the Tn5 transposase efficiently generate libraries starting from cDNA or genomic DNA in a few hours and are highly scalable. For genome sequencing, however, the time and effort spent on genomic DNA isolation limits the practicability of sequencing large numbers of samples. Here, we describe a highly scalable method for preparing high quality whole-genome sequencing libraries directly from yeast cultures in less than three hours at 34 cents per sample. We skip the rate-limiting step of genomic DNA extraction by directly tagmenting yeast spheroplasts and add a nucleosome release step prior to enrichment PCR to improve the evenness of genomic coverage. Resulting libraries do not show any GC-bias and are comparable in quality to libraries processed from genomic DNA with a commercially available Tn5-based kit. We use our protocol to investigate CRISPR/Cas9 on- and off-target edits and reliably detect edited variants and shared polymorphisms between strains. Our protocol enables rapid preparation of unbiased and high-quality, sequencing-ready indexed libraries for hundreds of yeast strains in a single day at a low price. By adjusting individual steps of our workflow we expect that our protocol can be adapted to other organisms.

Download Full-text

Les gènes des enfants de Tchernobyl

médecine/sciences ◽

10.1051/medsci/2021107 ◽

2021 ◽

Vol 37 (8-9) ◽

pp. 802-805

Author(s):

Bertrand Jordan

Keyword(s):

Radiation Dose ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Large Scale ◽

Whole Genome ◽

Transgenerational Effects ◽

High Quality ◽

Chernobyl Disaster ◽

Large Scale Study ◽

New Mutations

Transgenerational effects have long been expected in children from parents exposed to radiation from atomic bombs in Japan in 1945 or from the Chernobyl disaster in 1986. These effects have in fact proven hard to detect. A new large-scale study based on high-quality whole genome sequencing of father/mother/child trios in which the parental radiation dose is known now demonstrates that the rate of new mutations (50/70 per generation) is not detectably increased when comparing irradiated and non-irradiated parents. This solid data shows conclusively that transgenerational effects of irradiation from the Chernobyl disaster are absent or undetectable.

Download Full-text

Evaluation of Single-Molecule Sequencing Technologies for Structural Variant Detection in Two Swedish Human Genomes

Genes ◽

10.3390/genes11121444 ◽

2020 ◽

Vol 11 (12) ◽

pp. 1444

Author(s):

Nazeefa Fatima ◽

Anna Petri ◽

Ulf Gyllensten ◽

Lars Feuk ◽

Adam Ameur

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Single Molecule ◽

Large Scale ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Structural Variations ◽

Single Molecule Sequencing ◽

Human Samples

Long-read single molecule sequencing is increasingly used in human genomics research, as it allows to accurately detect large-scale DNA rearrangements such as structural variations (SVs) at high resolution. However, few studies have evaluated the performance of different single molecule sequencing platforms for SV detection in human samples. Here we performed Oxford Nanopore Technologies (ONT) whole-genome sequencing of two Swedish human samples (average 32× coverage) and compared the results to previously generated Pacific Biosciences (PacBio) data for the same individuals (average 66× coverage). Our analysis inferred an average of 17k and 23k SVs from the ONT and PacBio data, respectively, with a majority of them overlapping with an available multi-platform SV dataset. When comparing the SV calls in the two Swedish individuals, we find a higher concordance between ONT and PacBio SVs detected in the same individual as compared to SVs detected by the same technology in different individuals. Downsampling of PacBio reads, performed to obtain similar coverage levels for all datasets, resulted in 17k SVs per individual and improved overlap with the ONT SVs. Our results suggest that ONT and PacBio have a similar performance for SV detection in human whole genome sequencing data, and that both technologies are feasible for population-scale studies.

Download Full-text

The MOBSTER R package for tumour subclonal deconvolution from bulk DNA whole-genome sequencing data

BMC Bioinformatics ◽

10.1186/s12859-020-03863-1 ◽

2020 ◽

Vol 21 (1) ◽

Author(s):

Giulio Caravagna ◽

Guido Sanguinetti ◽

Trevor A. Graham ◽

Andrea Sottoriva

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Large Scale ◽

R Package ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Evolutionary Forces ◽

Evolutionary Trajectories ◽

Cancer Tissues

Abstract Background The large-scale availability of whole-genome sequencing profiles from bulk DNA sequencing of cancer tissues is fueling the application of evolutionary theory to cancer. From a bulk biopsy, subclonal deconvolution methods are used to determine the composition of cancer subpopulations in the biopsy sample, a fundamental step to determine clonal expansions and their evolutionary trajectories. Results In a recent work we have developed a new model-based approach to carry out subclonal deconvolution from the site frequency spectrum of somatic mutations. This new method integrates, for the first time, an explicit model for neutral evolutionary forces that participate in clonal expansions; in that work we have also shown that our method improves largely over competing data-driven methods. In this Software paper we present mobster, an open source R package built around our new deconvolution approach, which provides several functions to plot data and fit models, assess their confidence and compute further evolutionary analyses that relate to subclonal deconvolution. Conclusions We present the mobster package for tumour subclonal deconvolution from bulk sequencing, the first approach to integrate Machine Learning and Population Genetics which can explicitly model co-existing neutral and positive selection in cancer. We showcase the analysis of two datasets, one simulated and one from a breast cancer patient, and overview all package functionalities.

Download Full-text

Indel detection from Whole Genome Sequencing data and association with lipid metabolism in pigs

PLoS ONE ◽

10.1371/journal.pone.0218862 ◽

2019 ◽

Vol 14 (6) ◽

pp. e0218862 ◽

Cited By ~ 1

Author(s):

Daniel Crespo-Piazuelo ◽

Lourdes Criado-Mesas ◽

Manuel Revilla ◽

Anna Castelló ◽

Ana I. Fernández ◽

...

Keyword(s):

Lipid Metabolism ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Indel Detection

Download Full-text

Rainbow: a tool for large-scale whole-genome sequencing data analysis using cloud computing

BMC Genomics ◽

10.1186/1471-2164-14-425 ◽

2013 ◽

Vol 14 (1) ◽

pp. 425 ◽

Cited By ~ 32

Author(s):

Shanrong Zhao ◽

Kurt Prenger ◽

Lance Smith ◽

Thomas Messina ◽

Hongtao Fan ◽

...

Keyword(s):

Cloud Computing ◽

Data Analysis ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Large Scale ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Sequencing Data Analysis

Download Full-text

Fast and inexpensive whole-genome sequencing library preparation from intact yeast cells

G3 Genes|Genome|Genetics ◽

10.1093/g3journal/jkaa009 ◽

2020 ◽

Vol 11 (1) ◽

pp. 1-12

Author(s):

Sibylle C Vonesch ◽

Shengdi Li ◽

Chelsea Szu Tu ◽

Bianca P Hennig ◽

Nikolay Dobrev ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Genomic Dna ◽

Large Scale ◽

Massively Parallel Sequencing ◽

Yeast Cells ◽

Whole Genome ◽

High Quality ◽

Rapid Preparation ◽

Genomic Dna Isolation

Abstract Through the increase in the capacity of sequencing machines massively parallel sequencing of thousands of samples in a single run is now possible. With the improved throughput and resulting drop in the price of sequencing, the cost and time for preparation of sequencing libraries have become the major bottleneck in large-scale experiments. Methods using a hyperactive variant of the Tn5 transposase efficiently generate libraries starting from cDNA or genomic DNA in a few hours and are highly scalable. For genome sequencing, however, the time and effort spent on genomic DNA isolation limit the practicability of sequencing large numbers of samples. Here, we describe a highly scalable method for preparing high-quality whole-genome sequencing libraries directly from Saccharomyces cerevisiae cultures in less than 3 h at 34 cents per sample. We skip the rate-limiting step of genomic DNA extraction by directly tagmenting lysed yeast spheroplasts and add a nucleosome release step prior to enrichment PCR to improve the evenness of genomic coverage. Resulting libraries do not show any GC bias and are comparable in quality to libraries processed from genomic DNA with a commercially available Tn5-based kit. We use our protocol to investigate CRISPR/Cas9 on- and off-target edits and reliably detect edited variants and shared polymorphisms between strains. Our protocol enables rapid preparation of unbiased and high-quality, sequencing-ready indexed libraries for hundreds of yeast strains in a single day at a low price. By adjusting individual steps of our workflow, we expect that our protocol can be adapted to other organisms.

Download Full-text

Detection of structural mosaicism from targeted and whole-genome sequencing data

10.1101/062620 ◽

2016 ◽

Author(s):

Daniel A. King ◽

Alejandro Sifrim ◽

Tomas W. Fitzgerald ◽

Raheleh Rahbari ◽

Emma Hobson ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Exome Sequencing ◽

Genome Sequencing ◽

Developmental Disorders ◽

Large Fraction ◽

Clinical Diagnostics ◽

Next Generation Sequencing Data ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data

ABSTRACTStructural mosaic abnormalities are large post-zygotic mutations present in a subset of cells and have been implicated in developmental disorders and cancer. Such mutations have been conventionally assessed in clinical diagnostics using cytogenetic or microarray testing. Modern disease studies rely heavily on exome sequencing, yet an adequate method for the detection of structural mosaicism using targeted sequencing data is lacking. Here, we present a method, called MrMosaic, to detect structural mosaic abnormalities using deviations in allele fraction and read coverage from next generation sequencing data. Whole-exome sequencing (WES) and whole-genome sequencing (WGS) simulations were used to calculate detection performance across a range of mosaic event sizes, types, clonalities, and sequencing depths. The tool was applied to 4,911 patients with undiagnosed developmental disorders, and 11 events in 9 patients were detected. In 8 of 11 cases, mosaicism was observed in saliva but not blood, suggesting that assaying blood alone would miss a large fraction, possibly more than 50%, of mosaic diagnostic chromosomal rearrangements.

Download Full-text