SiNPle: Fast and Sensitive Variant Calling for Deep Sequencing Data

Luca Ferretti; Chandana Tennakoon; Adrian Silesian; Graham Freimanis andPaolo Ribeca

doi:10.3390/genes10080561

SiNPle: Fast and Sensitive Variant Calling for Deep Sequencing Data

Genes ◽

10.3390/genes10080561 ◽

2019 ◽

Vol 10 (8) ◽

pp. 561 ◽

Cited By ~ 5

Author(s):

Luca Ferretti ◽

Chandana Tennakoon ◽

Adrian Silesian ◽

Graham Freimanis andPaolo Ribeca

Keyword(s):

Deep Sequencing ◽

High Throughput Sequencing ◽

Sequence Data ◽

Analytical Formula ◽

Real Life ◽

Variant Calling ◽

Error Rates ◽

High Coverage ◽

Sequencing Errors ◽

Very High

Current high-throughput sequencing technologies can generate sequence data and provide information on the genetic composition of samples at very high coverage. Deep sequencing approaches enable the detection of rare variants in heterogeneous samples, such as viral quasi-species, but also have the undesired effect of amplifying sequencing errors and artefacts. Distinguishing real variants from such noise is not straightforward. Variant callers that can handle pooled samples can be in trouble at extremely high read depths, while at lower depths sensitivity is often sacrificed to specificity. In this paper, we propose SiNPle (Simplified Inference of Novel Polymorphisms from Large coveragE), a fast and effective software for variant calling. SiNPle is based on a simplified Bayesian approach to compute the posterior probability that a variant is not generated by sequencing errors or PCR artefacts. The Bayesian model takes into consideration individual base qualities as well as their distribution, the baseline error rates during both the sequencing and the PCR stage, the prior distribution of variant frequencies and their strandedness. Our approach leads to an approximate but extremely fast computation of posterior probabilities even for very high coverage data, since the expression for the posterior distribution is a simple analytical formula in terms of summary statistics for the variants appearing at each site in the genome. These statistics can be used to filter out putative SNPs and indels according to the required level of sensitivity. We tested SiNPle on several simulated and real-life viral datasets to show that it is faster and more sensitive than existing methods. The source code for SiNPle is freely available to download and compile, or as a Conda/Bioconda package.

Download Full-text

Fast and sensitive mapping of error-prone nanopore sequencing reads with GraphMap

10.1101/020719 ◽

2015 ◽

Cited By ~ 1

Author(s):

Ivan Sovic ◽

Mile Sikic ◽

Andreas Wilm ◽

Shannon Nicole Fenlon ◽

Swaine Chen ◽

...

Keyword(s):

Human Genome ◽

Variant Calling ◽

Error Rates ◽

Nanopore Sequencing ◽

Structural Variants ◽

Specific Identification ◽

Long Reads ◽

Long Read ◽

Specific Error ◽

Very High

Exploiting the power of nanopore sequencing requires the development of new bioinformatics approaches to deal with its specific error characteristics. We present the first nanopore read mapper (GraphMap) that uses a read-funneling paradigm to robustly handle variable error rates and fast graph traversal to align long reads with speed and very high precision (>95%). Evaluation on MinION sequencing datasets against short and long-read mappers indicates that GraphMap increases mapping sensitivity by at least 15-80%. GraphMap alignments are the first to demonstrate consensus calling with <1 error in 100,000 bases, variant calling on the human genome with 76% improvement in sensitivity over the next best mapper (BWA-MEM), precise detection of structural variants from 100bp to 4kbp in length and species and strain-specific identification of pathogens using MinION reads. GraphMap is available open source under the MIT license at https://github.com/isovic/graphmap.

Download Full-text

Decona: From demultiplexing to consensus for Nanopore amplicon data

ARPHA Conference Abstracts ◽

10.3897/aca.4.e65029 ◽

2021 ◽

Vol 4 ◽

Author(s):

Saskia Oosterbroek ◽

Karlijn Doorenspleet ◽

Reindert Nijland ◽

Lara Jansen

Keyword(s):

Sequence Data ◽

Variant Calling ◽

Environmental Dna ◽

Laptop Computer ◽

Consensus Sequences ◽

Sequencing Errors ◽

Blast Output ◽

Command Line Tool ◽

Microbial Symbionts ◽

User Friendly

Sequencing of long amplicons is one of the major benefits of Nanopore technologies, as it allows for reads much longer than Illumina. One of the major challenges for the analysis of these long Nanopore reads is the relatively high error rate. Sequencing errors are generally corrected by consensus generation and polishing. This is still a challenge for mixed samples such as metabarcoding environmental DNA, bulk DNA, mixed amplicon PCR’s and contaminated samples because sequence data would have to be clustered before consensus generation. To this end, we developed Decona (https://github.com/Saskia-Oosterbroek/decona), a command line tool that creates consensus sequences from mixed (metabarcoding) samples using a single command. Decona uses the CD-hit algorithm to cluster reads after demultiplexing (qcat) and filtering (NanoFilt). The sequences in each cluster are subsequently aligned (Minimap2), consensus sequences are generated (Racon) and finally polished (Medaka). Variant calling of the clusters (Medaka) is optional. With the integration of the BLAST+ application Decona does not only generate consensus sequences but also produces BLAST output if desired. The program can be used on a laptop computer making it suitable for use under field conditions. Amplicon data ranging from 300-7500 nucleotides was successfully processed by Decona, creating consensus sequences reaching over 99,9% read identity. This included fish datasets (environmental DNA from filtered water) from a curated aquarium, vertebrate datasets that were contaminated with human sequences and separating sponge sequences from their countless microbial symbionts. Decona considerably simplifies and speeds up post sequencing processes, providing consensus sequences and BLAST output through a single command. Classifying consensus sequences instead of raw sequences improves classification accuracy and drastically decreases the amount of sequences that need to be classified. Overall it is a user friendly option for researchers with limited knowledge of script based data processing.

Download Full-text

Estimating genotype error rates from high-coverage next-generation sequence data

Genome Research ◽

10.1101/gr.168393.113 ◽

2014 ◽

Vol 24 (11) ◽

pp. 1734-1739 ◽

Cited By ~ 84

Author(s):

Jeffrey D. Wall ◽

Ling Fung Tang ◽

Brandon Zerbe ◽

Mark N. Kvale ◽

Pui-Yan Kwok ◽

...

Keyword(s):

Sequence Data ◽

Error Rates ◽

Next Generation ◽

High Coverage

Download Full-text

Impact of index hopping and bias towards the reference allele on accuracy of genotype calls from low-coverage sequencing

10.1101/358085 ◽

2018 ◽

Cited By ~ 1

Author(s):

Roger Ros-Freixedes ◽

Battagin Mara ◽

Martin Johnsson ◽

Gregor Gorjanc ◽

Alan J Mileham ◽

...

Keyword(s):

Sequence Data ◽

Cost Effective ◽

Reference Allele ◽

High Coverage ◽

Sequencing Errors ◽

Percentage Points ◽

Low Coverage ◽

The Impact ◽

Genotype Concordance

AbstractBackgroundInherent sources of error and bias that affect the quality of the sequence data include index hopping and bias towards the reference allele. The impact of these artefacts is likely greater for low-coverage data than for high-coverage data because low-coverage data has scant information and standard tools for processing sequence data were designed for high-coverage data. With the proliferation of cost-effective low-coverage sequencing there is a need to understand the impact of these errors and bias on resulting genotype calls.ResultsWe used a dataset of 26 pigs sequenced both at 2x with multiplexing and at 30x without multiplexing to show that index hopping and bias towards the reference allele due to alignment had little impact on genotype calls. However, pruning of alternative haplotypes supported by a number of reads below a predefined threshold, a default and desired step for removing potential sequencing errors in high-coverage data, introduced an unexpected bias towards the reference allele when applied to low-coverage data. This bias reduced best-guess genotype concordance of low-coverage sequence data by 19.0 absolute percentage points.ConclusionsWe propose a simple pipeline to correct this bias and we recommend that users of low-coverage sequencing be wary of unexpected biases produced by tools designed for high-coverage sequencing.

Download Full-text

A workflow for accurate metabarcoding using nanopore MinION sequencing

10.1101/2020.05.21.108852 ◽

2020 ◽

Cited By ~ 2

Author(s):

Bilgenur Baloğlu ◽

Zhewei Chen ◽

Vasco Elbrecht ◽

Thomas Braukmann ◽

Shanna MacDonald ◽

...

Keyword(s):

High Throughput Sequencing ◽

Sequence Data ◽

Rolling Circle Amplification ◽

Error Rates ◽

Read Length ◽

Taxonomic Assignment ◽

Major Drawback ◽

Rolling Circle ◽

Sequencing Platform ◽

Sequencing Platforms

AbstractMetabarcoding has become a common approach to the rapid identification of the species composition in a mixed sample. The majority of studies use established short-read high-throughput sequencing platforms. The Oxford Nanopore MinION™, a portable sequencing platform, represents a low-cost alternative allowing researchers to generate sequence data in the field. However, a major drawback is the high raw read error rate that can range from 10% to 22%.To test if the MinION™ represents a viable alternative to other sequencing platforms we used rolling circle amplification (RCA) to generate full-length consensus DNA barcodes (658bp of cytochrome oxidase I - COI) for a bulk mock sample of 50 aquatic invertebrate species. By applying two different laboratory protocols, we generated two MinION™ runs that were used to build consensus sequences. We also developed a novel Python pipeline, ASHURE, for processing, consensus building, clustering, and taxonomic assignment of the resulting reads.We were able to show that it is possible to reduce error rates to a median accuracy of up to 99.3% for long RCA fragments (>45 barcodes). Our pipeline successfully identified all 50 species in the mock community and exhibited comparable sensitivity and accuracy to MiSeq. The use of RCA was integral for increasing consensus accuracy, but it was also the most time-consuming step during the laboratory workflow and most RCA reads were skewed towards a shorter read length range with a median RCA fragment length of up to 1262bp. Our study demonstrates that Nanopore sequencing can be used for metabarcoding but we recommend the exploration of other isothermal amplification procedures to improve consensus length.

Download Full-text

Comparison of sequencing data processing pipelines and application to underrepresented African human populations

BMC Bioinformatics ◽

10.1186/s12859-021-04407-x ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Gwenna Breton ◽

Anna C. V. Johansson ◽

Per Sjödin ◽

Carina M. Schlebusch ◽

Mattias Jakobsson

Keyword(s):

Best Practices ◽

High Throughput ◽

High Throughput Sequencing ◽

Variant Calling ◽

Human Populations ◽

Sequencing Data ◽

High Coverage ◽

Individual Level ◽

Bioinformatic Tools ◽

The Individual

Abstract Background Population genetic studies of humans make increasing use of high-throughput sequencing in order to capture diversity in an unbiased way. There is an abundance of sequencing technologies, bioinformatic tools and the available genomes are increasing in number. Studies have evaluated and compared some of these technologies and tools, such as the Genome Analysis Toolkit (GATK) and its “Best Practices” bioinformatic pipelines. However, studies often focus on a few genomes of Eurasian origin in order to detect technical issues. We instead surveyed the use of the GATK tools and established a pipeline for processing high coverage full genomes from a diverse set of populations, including Sub-Saharan African groups, in order to reveal challenges from human diversity and stratification. Results We surveyed 29 studies using high-throughput sequencing data, and compared their strategies for data pre-processing and variant calling. We found that processing of data is very variable across studies and that the GATK “Best Practices” are seldom followed strictly. We then compared three versions of a GATK pipeline, differing in the inclusion of an indel realignment step and with a modification of the base quality score recalibration step. We applied the pipelines on a diverse set of 28 individuals. We compared the pipelines in terms of count of called variants and overlap of the callsets. We found that the pipelines resulted in similar callsets, in particular after callset filtering. We also ran one of the pipelines on a larger dataset of 179 individuals. We noted that including more individuals at the joint genotyping step resulted in different counts of variants. At the individual level, we observed that the average genome coverage was correlated to the number of variants called. Conclusions We conclude that applying the GATK “Best Practices” pipeline, including their recommended reference datasets, to underrepresented populations does not lead to a decrease in the number of called variants compared to alternative pipelines. We recommend to aim for coverage of > 30X if identifying most variants is important, and to work with large sample sizes at the variant calling stage, also for underrepresented individuals and populations.

Download Full-text

Simulation of African and non-African low and high coverage whole genome sequence data to assess variant calling approaches

Briefings in Bioinformatics ◽

10.1093/bib/bbaa366 ◽

2020 ◽

Author(s):

Shatha Alosaimi ◽

Noëlle van Biljon ◽

Denis Awany ◽

Prisca K Thami ◽

Joel Defo ◽

...

Keyword(s):

Genetic Diversity ◽

False Positive ◽

Sequence Data ◽

False Negative ◽

Variant Calling ◽

High Rate ◽

Whole Genome Sequence ◽

Whole Genome ◽

Sequence Coverage ◽

High Coverage

Abstract Current variant calling (VC) approaches have been designed to leverage populations of long-range haplotypes and were benchmarked using populations of European descent, whereas most genetic diversity is found in non-European such as Africa populations. Working with these genetically diverse populations, VC tools may produce false positive and false negative results, which may produce misleading conclusions in prioritization of mutations, clinical relevancy and actionability of genes. The most prominent question is which tool or pipeline has a high rate of sensitivity and precision when analysing African data with either low or high sequence coverage, given the high genetic diversity and heterogeneity of this data. Here, a total of 100 synthetic Whole Genome Sequencing (WGS) samples, mimicking the genetics profile of African and European subjects for different specific coverage levels (high/low), have been generated to assess the performance of nine different VC tools on these contrasting datasets. The performances of these tools were assessed in false positive and false negative call rates by comparing the simulated golden variants to the variants identified by each VC tool. Combining our results on sensitivity and positive predictive value (PPV), VarDict [PPV = 0.999 and Matthews correlation coefficient (MCC) = 0.832] and BCFtools (PPV = 0.999 and MCC = 0.813) perform best when using African population data on high and low coverage data. Overall, current VC tools produce high false positive and false negative rates when analysing African compared with European data. This highlights the need for development of VC approaches with high sensitivity and precision tailored for populations characterized by high genetic variations and low linkage disequilibrium.

Download Full-text

Comprehensive fundamental somatic variant calling and quality management strategies for human cancer genomes

Briefings in Bioinformatics ◽

10.1093/bib/bbaa083 ◽

2020 ◽

Author(s):

Xiaoyu He ◽

Shanyu Chen ◽

Ruilin Li ◽

Xinyin Han ◽

Zhipeng He ◽

...

Keyword(s):

Genome Sequencing ◽

High Throughput Sequencing ◽

Cancer Genomics ◽

Sequence Data ◽

Human Cancer ◽

Management Strategies ◽

Variant Calling ◽

Cancer Genome ◽

Sequencing Data ◽

Cancer Genome Sequencing

Abstract Next-generation sequencing (NGS) technology has revolutionised human cancer research, particularly via detection of genomic variants with its ultra-high-throughput sequencing and increasing affordability. However, the inundation of rich cancer genomics data has resulted in significant challenges in its exploration and translation into biological insights. One of the difficulties in cancer genome sequencing is software selection. Currently, multiple tools are widely used to process NGS data in four stages: raw sequence data pre-processing and quality control (QC), sequence alignment, variant calling and annotation and visualisation. However, the differences between these NGS tools, including their installation, merits, drawbacks and application, have not been fully appreciated. Therefore, a systematic review of the functionality and performance of NGS tools is required to provide cancer researchers with guidance on software and strategy selection. Another challenge is the multidimensional QC of sequencing data because QC can not only report varied sequence data characteristics but also reveal deviations in diverse features and is essential for a meaningful and successful study. However, monitoring of QC metrics in specific steps including alignment and variant calling is neglected in certain pipelines such as the ‘Best Practices Workflows’ in GATK. In this review, we investigated the most widely used software for the fundamental analysis and QC of cancer genome sequencing data and provided instructions for selecting the most appropriate software and pipelines to ensure precise and efficient conclusions. We further discussed the prospects and new research directions for cancer genomics.

Download Full-text

PB-Motif—A Method for Identifying Gene/Pseudogene Rearrangements With Long Reads: An Application to CYP21A2 Genotyping

Frontiers in Genetics ◽

10.3389/fgene.2021.716586 ◽

2021 ◽

Vol 12 ◽

Author(s):

Zachary Stephens ◽

Dragana Milosevic ◽

Benjamin Kipp ◽

Stefan Grebe ◽

Ravishankar K. Iyer ◽

...

Keyword(s):

Phase Variation ◽

Variant Calling ◽

Error Rates ◽

Clinical Samples ◽

Sequencing Error ◽

Carrier Status ◽

Sequencing Errors ◽

Sequencing Technologies ◽

Long Reads ◽

Genomic Regions

Long read sequencing technologies have the potential to accurately detect and phase variation in genomic regions that are difficult to fully characterize with conventional short read methods. These difficult to sequence regions include several clinically relevant genes with highly homologous pseudogenes, many of which are prone to gene conversions or other types of complex structural rearrangements. We present PB-Motif, a new method for identifying rearrangements between two highly homologous genomic regions using PacBio long reads. PB-Motif leverages clustering and filtering techniques to efficiently report rearrangements in the presence of sequencing errors and other systematic artifacts. Supporting reads for each high-confidence rearrangement can then be used for copy number estimation and phased variant calling. First, we demonstrate PB-Motif's accuracy with simulated sequence rearrangements of PMS2 and its pseudogene PMS2CL using simulated reads sweeping over a range of sequencing error rates. We then apply PB-Motif to 26 clinical samples, characterizing CYP21A2 and its pseudogene CYP21A1P as part of a diagnostic assay for congenital adrenal hyperplasia. We successfully identify damaging variation and patient carrier status concordant with clinical diagnosis obtained from multiplex ligation-dependent amplification (MLPA) and Sanger sequencing. The source code is available at: github.com/zstephens/pb-motif.

Download Full-text

Denoising DNA deep sequencing data—high-throughput sequencing errors and their correction

Briefings in Bioinformatics ◽

10.1093/bib/bbv029 ◽

2015 ◽

Vol 17 (1) ◽

pp. 154-179 ◽

Cited By ~ 139

Author(s):

David Laehnemann ◽

Arndt Borkhardt ◽

Alice Carolyn McHardy

Keyword(s):

High Throughput ◽

Deep Sequencing ◽

High Throughput Sequencing ◽

Sequencing Data ◽

Sequencing Errors ◽

Deep Sequencing Data

Download Full-text