PhredEM: A Phred-Score-Informed Genotype-Calling Approach for Next-Generation Sequencing Studies

Mapping Intimacies ◽

10.1101/046136 ◽

2016 ◽

Author(s):

Peizhou Liao ◽

Glen A. Satten ◽

Yi-juan Hu

Keyword(s):

Logistic Regression ◽

Next Generation Sequencing ◽

Em Algorithm ◽

Error Rates ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Genotype Calling ◽

Sequencing Studies ◽

Generation Sequencing

ABSTRACTA fundamental challenge in analyzing next-generation sequencing data is to determine an individual’s genotype correctly as the accuracy of the inferred genotype is essential to downstream analyses. Some genotype callers, such as GATK and SAMtools, directly calculate the base-calling error rates from phred scores or recalibrated base quality scores. Others, such as SeqEM, estimate error rates from the read data without using any quality scores. It is also a common quality control procedure to filter out reads with low phred scores. However, choosing an appropriate phred score threshold is problematic as a too-high threshold may lose data while a too-low threshold may introduce errors. We propose a new likelihood-based genotype-calling approach that exploits all reads and estimates the per-base error rates by incorporating phred scores through a logistic regression model. The algorithm, which we call PhredEM, uses the Expectation-Maximization (EM) algorithm to obtain consistent estimates of genotype frequencies and logistic regression parameters. We also develop a simple, computationally efficient screening algorithm to identify loci that are estimated to be monomorphic, so that only loci estimated to be non-monomorphic require application of the EM algorithm. We evaluate the performance of PhredEM using both simulated data and real sequencing data from the UK10K project. The results demonstrate that PhredEM is an improved, robust and widely applicable genotype-calling approach for next-generation sequencing studies. The relevant software is freely available.

Download Full-text

A fully automated pipeline for quantitative genotype calling from next generation sequencing data in autopolyploids

BMC Bioinformatics ◽

10.1186/s12859-018-2433-6 ◽

2018 ◽

Vol 19 (1) ◽

Cited By ~ 15

Author(s):

Guilherme S. Pereira ◽

Antonio Augusto F. Garcia ◽

Gabriel R. A. Margarido

Keyword(s):

Next Generation Sequencing ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Genotype Calling ◽

Automated Pipeline ◽

Generation Sequencing

Download Full-text

SRPRISM (Single Read Paired Read Indel Substitution Minimizer): an efficient aligner for assemblies with explicit guarantees

GigaScience ◽

10.1093/gigascience/giaa023 ◽

2020 ◽

Vol 9 (4) ◽

Cited By ~ 2

Author(s):

Aleksandr Morgulis ◽

Richa Agarwala

Keyword(s):

Next Generation Sequencing ◽

Error Rates ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Mapping Tool ◽

Genome Assemblies ◽

Generation Sequencing ◽

Paired Read ◽

Benchmark Sets

Abstract Background Alignment of sequence reads generated by next-generation sequencing is an integral part of most pipelines analyzing next-generation sequencing data. A number of tools designed to quickly align a large volume of sequences are already available. However, most existing tools lack explicit guarantees about their output. They also do not support searching genome assemblies, such as the human genome assembly GRCh38, that include primary and alternate sequences and placement information for alternate sequences to primary sequences in the assembly. Findings This paper describes SRPRISM (Single Read Paired Read Indel Substitution Minimizer), an alignment tool for aligning reads without splices. SRPRISM has features not available in most tools, such as (i) support for searching genome assemblies with alternate sequences, (ii) partial alignment of reads with a specified region of reads to be included in the alignment, (iii) choice of ranking schemes for alignments, and (iv) explicit criteria for search sensitivity. We compare the performance of SRPRISM to GEM, Kart, STAR, BWA-MEM, Bowtie2, Hobbes, and Yara using benchmark sets for paired and single reads of lengths 100 and 250 bp generated using DWGSIM. SRPRISM found the best results for most benchmark sets with error rate of up to ∼2.5% and GEM performed best for higher error rates. SRPRISM was also more sensitive than other tools even when sensitivity was reduced to improve run time performance. Conclusions We present SRPRISM as a flexible read mapping tool that provides explicit guarantees on results.

Download Full-text

Genotype calling from next-generation sequencing data using haplotype information of reads

Bioinformatics ◽

10.1093/bioinformatics/bts047 ◽

2012 ◽

Vol 28 (7) ◽

pp. 938-946 ◽

Cited By ~ 10

Author(s):

Degui Zhi ◽

Jihua Wu ◽

Nianjun Liu ◽

Kui Zhang

Keyword(s):

Next Generation Sequencing ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Genotype Calling ◽

Haplotype Information ◽

Generation Sequencing

Download Full-text

Comparing machine learning and logistic regression methods for predicting hypertension using a combination of gene expression and next-generation sequencing data

BMC Proceedings ◽

10.1186/s12919-016-0020-2 ◽

2016 ◽

Vol 10 (S7) ◽

Cited By ~ 10

Author(s):

Elizabeth Held ◽

Joshua Cape ◽

Nathan Tintle

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Logistic Regression ◽

Next Generation Sequencing ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Regression Methods ◽

Generation Sequencing

Download Full-text

SequencErr: measuring and suppressing sequencer errors in next-generation sequencing data

Genome Biology ◽

10.1186/s13059-020-02254-2 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Eric M. Davis ◽

Yu Sun ◽

Yanling Liu ◽

Pandurang Kolekar ◽

Ying Shao ◽

...

Keyword(s):

Next Generation Sequencing ◽

Error Rate ◽

Control Method ◽

Error Rates ◽

Computational Method ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Flow Cells ◽

Generation Sequencing

Abstract Background There is currently no method to precisely measure the errors that occur in the sequencing instrument/sequencer, which is critical for next-generation sequencing applications aimed at discovering the genetic makeup of heterogeneous cellular populations. Results We propose a novel computational method, SequencErr, to address this challenge by measuring the base correspondence between overlapping regions in forward and reverse reads. An analysis of 3777 public datasets from 75 research institutions in 18 countries revealed the sequencer error rate to be ~ 10 per million (pm) and 1.4% of sequencers and 2.7% of flow cells have error rates > 100 pm. At the flow cell level, error rates are elevated in the bottom surfaces and > 90% of HiSeq and NovaSeq flow cells have at least one outlier error-prone tile. By sequencing a common DNA library on different sequencers, we demonstrate that sequencers with high error rates have reduced overall sequencing accuracy, and removal of outlier error-prone tiles improves sequencing accuracy. We demonstrate that SequencErr can reveal novel insights relative to the popular quality control method FastQC and achieve a 10-fold lower error rate than popular error correction methods including Lighter and Musket. Conclusions Our study reveals novel insights into the nature of DNA sequencing errors incurred on DNA sequencers. Our method can be used to assess, calibrate, and monitor sequencer accuracy, and to computationally suppress sequencer errors in existing datasets.

Download Full-text

Genotype Calling and Haplotype Phasing from Next Generation Sequencing Data

Statistical Analysis of Next Generation Sequencing Data ◽

10.1007/978-3-319-07212-8_16 ◽

2014 ◽

pp. 315-333 ◽

Cited By ~ 1

Author(s):

Degui Zhi ◽

Kui Zhang

Keyword(s):

Next Generation Sequencing ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Genotype Calling ◽

Haplotype Phasing ◽

Generation Sequencing

Download Full-text

Faculty Opinions recommendation of VarWalker: personalized mutation network analysis of putative cancer genes from next-generation sequencing data.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.718272765.793499663 ◽

2014 ◽

Author(s):

Gary Bader ◽

Mohamed Helmy

Keyword(s):

Next Generation Sequencing ◽

Network Analysis ◽

Next Generation Sequencing Data ◽

Cancer Genes ◽

Next Generation ◽

Sequencing Data ◽

Generation Sequencing

Download Full-text

Faculty Opinions recommendation of Bioinformatory-assisted analysis of next-generation sequencing data for precision medicine in pancreatic cancer.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.727775566.793536095 ◽

2017 ◽

Author(s):

Steve Pereira

Keyword(s):

Pancreatic Cancer ◽

Next Generation Sequencing ◽

Precision Medicine ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Assisted Analysis ◽

Generation Sequencing

Download Full-text

NGSremix: A software tool for estimating pairwise relatedness between admixed individuals from next-generation sequencing data

G3 Genes|Genome|Genetics ◽

10.1093/g3journal/jkab174 ◽

2021 ◽

Author(s):

Anne Krogh Nøhr ◽

Kristian Hanghøj ◽

Genis Garcia Erill ◽

Zilong Li ◽

Ida Moltke ◽

...

Keyword(s):

Next Generation Sequencing ◽

Genetic Research ◽

Likelihood Estimation ◽

Software Tool ◽

Estimation Methods ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Ngs Data ◽

Generation Sequencing

Abstract Estimation of relatedness between pairs of individuals is important in many genetic research areas. When estimating relatedness, it is important to account for admixture if this is present. However, the methods that can account for admixture are all based on genotype data as input, which is a problem for low-depth next-generation sequencing (NGS) data from which genotypes are called with high uncertainty. Here we present a software tool, NGSremix, for maximum likelihood estimation of relatedness between pairs of admixed individuals from low-depth NGS data, which takes the uncertainty of the genotypes into account via genotype likelihoods. Using both simulated and real NGS data for admixed individuals with an average depth of 4x or below we show that our method works well and clearly outperforms all the commonly used state-of-the-art relatedness estimation methods PLINK, KING, relateAdmix, and ngsRelate that all perform quite poorly. Hence, NGSremix is a useful new tool for estimating relatedness in admixed populations from low-depth NGS data. NGSremix is implemented in C/C ++ in a multi-threaded software and is freely available on Github https://github.com/KHanghoj/NGSremix.

Download Full-text

recoup: flexible and versatile signal visualization from next generation sequencing

BMC Bioinformatics ◽

10.1186/s12859-020-03902-x ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Panagiotis Moulos

Keyword(s):

Next Generation Sequencing ◽

Next Generation Sequencing Data ◽

Special Focus ◽

Next Generation ◽

Sequencing Data ◽

User Friendliness ◽

Computational Environment ◽

Level Data ◽

Data Signal ◽

Generation Sequencing

Abstract Background The relentless continuing emergence of new genomic sequencing protocols and the resulting generation of ever larger datasets continue to challenge the meaningful summarization and visualization of the underlying signal generated to answer important qualitative and quantitative biological questions. As a result, the need for novel software able to reliably produce quick, comprehensive, and easily repeatable genomic signal visualizations in a user-friendly manner is rapidly re-emerging. Results recoup is a Bioconductor package for quick, flexible, versatile, and accurate visualization of genomic coverage profiles generated from Next Generation Sequencing data. Coupled with a database of precalculated genomic regions for multiple organisms, recoup offers processing mechanisms for quick, efficient, and multi-level data interrogation with minimal effort, while at the same time creating publication-quality visualizations. Special focus is given on plot reusability, reproducibility, and real-time exploration and formatting options, operations rarely supported in similar visualization tools in a profound way. recoup was assessed using several qualitative user metrics and found to balance the tradeoff between important package features, including speed, visualization quality, overall friendliness, and the reusability of the results with minimal additional calculations. Conclusion While some existing solutions for the comprehensive visualization of NGS data signal offer satisfying results, they are often compromised regarding issues such as effortless tracking of processing and preparation steps under a common computational environment, visualization quality and user friendliness. recoup is a unique package presenting a balanced tradeoff for a combination of assessment criteria while remaining fast and friendly.

Download Full-text