Needlestack: an ultra-sensitive variant caller for multi-sample next generation sequencing data

Mapping Intimacies ◽

10.1101/639377 ◽

2019 ◽

Cited By ~ 2

Author(s):

Tiffany M. Delhomme ◽

Patrice H. Avogbe ◽

Aurélie Gabriel ◽

Nicolas Alcala ◽

Noemie Leblay ◽

...

Keyword(s):

Next Generation Sequencing ◽

Error Rate ◽

Somatic Mutations ◽

Next Generation Sequencing Data ◽

Sequencing Error ◽

Next Generation ◽

Sequencing Error Rate ◽

Main Challenge ◽

A Genome ◽

Generation Sequencing

ABSTRACTThe emergence of Next-Generation Sequencing (NGS) has revolutionized the way of reaching a genome sequence, with the promise of potentially providing a comprehensive characterization of DNA variations. Nevertheless, detecting somatic mutations is still a difficult problem, in particular when trying to identify low abundance mutations such as subclonal mutations, tumour-derived alterations in body fluids or somatic mutations from histological normal tissue. The main challenge is to precisely distinguish between sequencing artefacts and true mutations, particularly when the latter are so rare they reach similar abundance levels as artefacts. Here, we present needlestack, a highly sensitive variant caller, which directly learns from the data the level of systematic sequencing errors to accurately call mutations. Needlestack is based on the idea that the sequencing error rate can be dynamically estimated from analyzing multiple samples together. We show that the sequencing error rate varies across alterations, illustrating the need to precisely estimate it. We evaluate the performance of needlestack for various types of variations, and we show that needlestack is robust among positions and outperforms existing state-of-the-art method for low abundance mutations. Needlestack, along with its source code is freely available on the GitHub plateform: https://github.com/IARCbioinfo/needlestack.

Download Full-text

Needlestack: an ultra-sensitive variant caller for multi-sample next generation sequencing data

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqaa021 ◽

2020 ◽

Vol 2 (2) ◽

Cited By ~ 1

Author(s):

Tiffany M Delhomme ◽

Patrice H Avogbe ◽

Aurélie A G Gabriel ◽

Nicolas Alcala ◽

Noemie Leblay ◽

...

Keyword(s):

Next Generation Sequencing ◽

Error Rate ◽

Somatic Mutations ◽

Next Generation Sequencing Data ◽

Sequencing Error ◽

Next Generation ◽

Sequencing Error Rate ◽

Main Challenge ◽

A Genome ◽

Generation Sequencing

Abstract The emergence of next-generation sequencing (NGS) has revolutionized the way of reaching a genome sequence, with the promise of potentially providing a comprehensive characterization of DNA variations. Nevertheless, detecting somatic mutations is still a difficult problem, in particular when trying to identify low abundance mutations, such as subclonal mutations, tumour-derived alterations in body fluids or somatic mutations from histological normal tissue. The main challenge is to precisely distinguish between sequencing artefacts and true mutations, particularly when the latter are so rare they reach similar abundance levels as artefacts. Here, we present needlestack, a highly sensitive variant caller, which directly learns from the data the level of systematic sequencing errors to accurately call mutations. Needlestack is based on the idea that the sequencing error rate can be dynamically estimated from analysing multiple samples together. We show that the sequencing error rate varies across alterations, illustrating the need to precisely estimate it. We evaluate the performance of needlestack for various types of variations, and we show that needlestack is robust among positions and outperforms existing state-of-the-art method for low abundance mutations. Needlestack, along with its source code is freely available on the GitHub platform: https://github.com/IARCbioinfo/needlestack.

Download Full-text

Lacer: accurate base quality score recalibration for improving variant calling from next-generation sequencing data in any organism

10.1101/130732 ◽

2017 ◽

Author(s):

Jade C.S. Chung ◽

Swaine L. Chen

Keyword(s):

Next Generation Sequencing ◽

Variant Calling ◽

Quality Score ◽

Identification Accuracy ◽

Next Generation Sequencing Data ◽

Sequencing Error ◽

Next Generation ◽

Sequencing Data ◽

Base Quality Score ◽

Generation Sequencing

AbstractNext-generation sequencing data is accompanied by quality scores that quantify sequencing error. Inaccuracies in these quality scores propagate through all subsequent analyses; thus base quality score recalibration is a standard step in many next-generation sequencing workflows, resulting in improved variant calls. Current base quality score recalibration algorithms rely on the assumption that sequencing errors are already known; for human resequencing data, relatively complete variant databases facilitate this. However, because existing databases are still incomplete, recalibration is still inaccurate; and most organisms do not have variant databases, exacerbating inaccuracy for non-human data. To overcome these logical and practical problems, we introduce Lacer, which recalibrates base quality scores without assuming knowledge of correct and incorrect bases and without requiring knowledge of common variants. Lacer is the first logically sound, fully general, and truly accurate base recalibrator. Lacer enhances variant identification accuracy for resequencing data of human as well as other organisms (which are not accessible to current recalibrators), simultaneously improving and extending the benefits of base quality score recalibration to nearly all ongoing sequencing projects. Lacer is available at: https://github.com/swainechen/lacer.

Download Full-text

JointSNVMix: a probabilistic model for accurate detection of somatic mutations in normal/tumour paired next-generation sequencing data

Bioinformatics ◽

10.1093/bioinformatics/bts053 ◽

2012 ◽

Vol 28 (7) ◽

pp. 907-913 ◽

Cited By ~ 126

Author(s):

Andrew Roth ◽

Jiarui Ding ◽

Ryan Morin ◽

Anamaria Crisan ◽

Gavin Ha ◽

...

Keyword(s):

Next Generation Sequencing ◽

Probabilistic Model ◽

Somatic Mutations ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Accurate Detection ◽

Generation Sequencing

Download Full-text

SequencErr: measuring and suppressing sequencer errors in next-generation sequencing data

Genome Biology ◽

10.1186/s13059-020-02254-2 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Eric M. Davis ◽

Yu Sun ◽

Yanling Liu ◽

Pandurang Kolekar ◽

Ying Shao ◽

...

Keyword(s):

Next Generation Sequencing ◽

Error Rate ◽

Control Method ◽

Error Rates ◽

Computational Method ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Flow Cells ◽

Generation Sequencing

Abstract Background There is currently no method to precisely measure the errors that occur in the sequencing instrument/sequencer, which is critical for next-generation sequencing applications aimed at discovering the genetic makeup of heterogeneous cellular populations. Results We propose a novel computational method, SequencErr, to address this challenge by measuring the base correspondence between overlapping regions in forward and reverse reads. An analysis of 3777 public datasets from 75 research institutions in 18 countries revealed the sequencer error rate to be ~ 10 per million (pm) and 1.4% of sequencers and 2.7% of flow cells have error rates > 100 pm. At the flow cell level, error rates are elevated in the bottom surfaces and > 90% of HiSeq and NovaSeq flow cells have at least one outlier error-prone tile. By sequencing a common DNA library on different sequencers, we demonstrate that sequencers with high error rates have reduced overall sequencing accuracy, and removal of outlier error-prone tiles improves sequencing accuracy. We demonstrate that SequencErr can reveal novel insights relative to the popular quality control method FastQC and achieve a 10-fold lower error rate than popular error correction methods including Lighter and Musket. Conclusions Our study reveals novel insights into the nature of DNA sequencing errors incurred on DNA sequencers. Our method can be used to assess, calibrate, and monitor sequencer accuracy, and to computationally suppress sequencer errors in existing datasets.

Download Full-text

Design and Statistical Analysis of Pooled Next Generation Sequencing for Rare Variants

Journal of Probability and Statistics ◽

10.1155/2012/524724 ◽

2012 ◽

Vol 2012 ◽

pp. 1-19 ◽

Cited By ~ 3

Author(s):

Tao Wang ◽

Chang-Yun Lin ◽

Yuanhao Zhang ◽

Ruofeng Wen ◽

Kenny Ye

Keyword(s):

Next Generation Sequencing ◽

Error Rate ◽

Statistical Power ◽

Rare Variants ◽

Good Control ◽

Testing Procedure ◽

Sequencing Error ◽

Type I ◽

Next Generation ◽

Generation Sequencing

Next generation sequencing (NGS) is a revolutionary technology for biomedical research. One highly cost-efficient application of NGS is to detect disease association based on pooled DNA samples. However, several key issues need to be addressed for pooled NGS. One of them is the high sequencing error rate and its high variability across genomic positions and experiment runs, which, if not well considered in the experimental design and analysis, could lead to either inflated false positive rates or loss in statistical power. Another important issue is how to test association of a group of rare variants. To address the first issue, we proposed a new blocked pooling design in which multiple pools of DNA samples from cases and controls are sequenced together on same NGS functional units. To address the second issue, we proposed a testing procedure that does not require individual genotypes but by taking advantage of multiple DNA pools. Through a simulation study, we demonstrated that our approach provides a good control of the type I error rate, and yields satisfactory power compared to the test-based on individual genotypes. Our results also provide guidelines for designing an efficient pooled.

Download Full-text

Estimating DNA polymorphism from next generation sequencing data with high error rate by dual sequencing applications

BMC Genomics ◽

10.1186/1471-2164-14-535 ◽

2013 ◽

Vol 14 (1) ◽

pp. 535 ◽

Cited By ~ 9

Author(s):

Ziwen He ◽

Xinnian Li ◽

Shaoping Ling ◽

Yun-Xin Fu ◽

Eric Hungate ◽

...

Keyword(s):

Next Generation Sequencing ◽

Error Rate ◽

Dna Polymorphism ◽

Next Generation Sequencing Data ◽

Next Generation ◽

High Error Rate ◽

Sequencing Data ◽

Generation Sequencing

Download Full-text

Faculty Opinions recommendation of VarWalker: personalized mutation network analysis of putative cancer genes from next-generation sequencing data.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.718272765.793499663 ◽

2014 ◽

Author(s):

Gary Bader ◽

Mohamed Helmy

Keyword(s):

Next Generation Sequencing ◽

Network Analysis ◽

Next Generation Sequencing Data ◽

Cancer Genes ◽

Next Generation ◽

Sequencing Data ◽

Generation Sequencing

Download Full-text

Faculty Opinions recommendation of Bioinformatory-assisted analysis of next-generation sequencing data for precision medicine in pancreatic cancer.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.727775566.793536095 ◽

2017 ◽

Author(s):

Steve Pereira

Keyword(s):

Pancreatic Cancer ◽

Next Generation Sequencing ◽

Precision Medicine ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Assisted Analysis ◽

Generation Sequencing

Download Full-text

Myelodysplastic syndromes with no somatic mutations detected by next‐generation sequencing display similar features to myelodysplastic syndromes with detectable mutations

American Journal of Hematology ◽

10.1002/ajh.26325 ◽

2021 ◽

Author(s):

Sa A. Wang ◽

Chi Young Ok ◽

Annette S. Kim ◽

Fabienne Lucas ◽

Elizabeth A. Morgan ◽

...

Keyword(s):

Next Generation Sequencing ◽

Myelodysplastic Syndromes ◽

Somatic Mutations ◽

Next Generation ◽

Generation Sequencing

Download Full-text

NGSremix: A software tool for estimating pairwise relatedness between admixed individuals from next-generation sequencing data

G3 Genes|Genome|Genetics ◽

10.1093/g3journal/jkab174 ◽

2021 ◽

Author(s):

Anne Krogh Nøhr ◽

Kristian Hanghøj ◽

Genis Garcia Erill ◽

Zilong Li ◽

Ida Moltke ◽

...

Keyword(s):

Next Generation Sequencing ◽

Genetic Research ◽

Likelihood Estimation ◽

Software Tool ◽

Estimation Methods ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Ngs Data ◽

Generation Sequencing

Abstract Estimation of relatedness between pairs of individuals is important in many genetic research areas. When estimating relatedness, it is important to account for admixture if this is present. However, the methods that can account for admixture are all based on genotype data as input, which is a problem for low-depth next-generation sequencing (NGS) data from which genotypes are called with high uncertainty. Here we present a software tool, NGSremix, for maximum likelihood estimation of relatedness between pairs of admixed individuals from low-depth NGS data, which takes the uncertainty of the genotypes into account via genotype likelihoods. Using both simulated and real NGS data for admixed individuals with an average depth of 4x or below we show that our method works well and clearly outperforms all the commonly used state-of-the-art relatedness estimation methods PLINK, KING, relateAdmix, and ngsRelate that all perform quite poorly. Hence, NGSremix is a useful new tool for estimating relatedness in admixed populations from low-depth NGS data. NGSremix is implemented in C/C ++ in a multi-threaded software and is freely available on Github https://github.com/KHanghoj/NGSremix.

Download Full-text