Comparing Variant Call Files for Performance Benchmarking of Next-Generation Sequencing Variant Calling Pipelines

Mapping Intimacies ◽

10.1101/023754 ◽

2015 ◽

Cited By ~ 63

Author(s):

John G. Cleary ◽

Ross Braithwaite ◽

Kurt Gaastra ◽

Brian S Hilbush ◽

Stuart Inglis ◽

...

Keyword(s):

Gold Standard ◽

High Throughput Sequencing ◽

Variant Calling ◽

False Positives ◽

Sequencing Data ◽

Variant Call ◽

Performance Benchmarking ◽

High Throughput Sequencing Data ◽

Accurate Performance ◽

Standard Output

To evaluate and compare the performance of variant calling methods and their confidence scores, comparisons between a test call set and a ?gold standard? need to be carried out. Unfortunately, these comparisons are not straightforward with the current Variant Call Files (VCF), which are the standard output of most variant calling algorithms for high-throughput sequencing data. Comparisons of VCFs are often confounded by the different representations of indels, MNPs, and combinations thereof with SNVs in complex regions of the genome, resulting in misleading results. A variant caller is inherently a classification method designed to score putative variants with confidence scores that could permit controlling the rate of false positives (FP) or false negatives (FN) for a given application. Receiver operator curves (ROC) and the area under the ROC (AUC) are efficient metrics to evaluate a test call set versus a gold standard. However, in the case of VCF data this also requires a special accounting to deal with discrepant representations. We developed a novel algorithm for comparing variant call sets that deals with complex call representation discrepancies and through a dynamic programing method that minimizes false positives and negatives globally across the entire call sets for accurate performance evaluation of VCFs.

Download Full-text

Generalizable characteristics of false-positive bacterial variant calls

Microbial Genomics ◽

10.1099/mgen.0.000615 ◽

2021 ◽

Vol 7 (8) ◽

Author(s):

Stephen J. Bush

Keyword(s):

False Positive ◽

Variant Calling ◽

Critical Issue ◽

False Positives ◽

True Positive ◽

Sequencing Data ◽

Variant Call ◽

Post Process ◽

Disproportionate Number ◽

Illumina Sequencing Data

Minimizing false positives is a critical issue when variant calling as no method is without error. It is common practice to post-process a variant-call file (VCF) using hard filter criteria intended to discriminate true-positive (TP) from false-positive (FP) calls. These are applied on the simple principle that certain characteristics are disproportionately represented among the set of FP calls and that a user-chosen threshold can maximize the number detected. To provide guidance on this issue, this study empirically characterized all false SNP and indel calls made using real Illumina sequencing data from six disparate species and 166 variant-calling pipelines (the combination of 14 read aligners with up to 13 different variant callers, plus four ‘all-in-one’ pipelines). We did not seek to optimize filter thresholds but instead to draw attention to those filters of greatest efficacy and the pipelines to which they may most usefully be applied. In this respect, this study acts as a coda to our previous benchmarking evaluation of bacterial variant callers, and provides general recommendations for effective practice. The results suggest that, of the pipelines analysed in this study, the most straightforward way of minimizing false positives would simply be to use Snippy. We also find that a disproportionate number of false calls, irrespective of the variant-calling pipeline, are located in the vicinity of indels, and highlight this as an issue for future development.

Download Full-text

iSVP: an integrated structural variant calling pipeline from high-throughput sequencing data

BMC Systems Biology ◽

10.1186/1752-0509-7-s6-s8 ◽

2013 ◽

Vol 7 (Suppl 6) ◽

pp. S8 ◽

Cited By ~ 21

Author(s):

Takahiro Mimori ◽

Naoki Nariai ◽

Kaname Kojima ◽

Mamoru Takahashi ◽

Akira Ono ◽

...

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Variant Calling ◽

Sequencing Data ◽

Structural Variant ◽

High Throughput Sequencing Data

Download Full-text

Bazam: A rapid method for read extraction and realignment of high throughput sequencing data

10.1101/433003 ◽

2018 ◽

Cited By ~ 1

Author(s):

Simon P Sadedin ◽

Alicia Oshlack

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Variant Calling ◽

Genomic Data ◽

Selective Extraction ◽

Sequencing Data ◽

High Throughput Sequencing Data ◽

Time Required ◽

Genomic Regions ◽

Reference Genomes

AbstractBackgroundAs costs of high throughput sequencing have fallen, we are seeing vast quantities of short read genomic data being generated. Often, the data is exchanged and stored as aligned reads, which provides high compression and convenient access for many analyses. However, aligned data becomes outdated as new reference genomes and alignment methods become available. Moreover, some applications cannot utilise pre-aligned reads at all, necessitating conversion back to raw format (FASTQ) before they can be used. In both cases, the process of extraction and realignment is expensive and time consuming.FindingsWe describe Bazam, a tool that efficiently extracts the original paired FASTQ from reads stored in aligned form (BAM or CRAM format). Bazam extracts reads in a format that directly allows realignment with popular aligners with high concurrency. Through eliminating steps and increasing the accessible concurrency, Bazam facilitates up to a 90% reduction in the time required for realignment compared to standard methods. Bazam can support selective extraction of read pairs from focused genomic regions, further increasing efficiency for targeted analyses. Bazam is additionally suitable as a base for other applications that require efficient paired read information, such as quality control, structural variant calling and alignment comparison.ConclusionsBazam offers significant improvements for users needing to realign genomic data.

Download Full-text

Twelve years of SAMtools and BCFtools

GigaScience ◽

10.1093/gigascience/giab008 ◽

2021 ◽

Vol 10 (2) ◽

Cited By ~ 2

Author(s):

Petr Danecek ◽

James K Bonfield ◽

Jennifer Liddle ◽

John Marshall ◽

Valeriu Ohan ◽

...

Keyword(s):

High Throughput Sequencing ◽

Source Code ◽

Variant Calling ◽

File Format ◽

Sequencing Data ◽

Software Projects ◽

Effect Analysis ◽

Commercial Use ◽

High Throughput Sequencing Data ◽

File Format Conversion

Abstract Background SAMtools and BCFtools are widely used programs for processing and analysing high-throughput sequencing data. They include tools for file format conversion and manipulation, sorting, querying, statistics, variant calling, and effect analysis amongst other methods. Findings The first version appeared online 12 years ago and has been maintained and further developed ever since, with many new features and improvements added over the years. The SAMtools and BCFtools packages represent a unique collection of tools that have been used in numerous other software projects and countless genomic pipelines. Conclusion Both SAMtools and BCFtools are freely available on GitHub under the permissive MIT licence, free for both non-commercial and commercial use. Both packages have been installed >1 million times via Bioconda. The source code and documentation are available from https://www.htslib.org.

Download Full-text

Joint variant andde novomutation identification on pedigrees from high-throughput sequencing data

10.1101/001958 ◽

2014 ◽

Author(s):

John G Cleary ◽

Ross Braithwaite ◽

Kurt Gaastra ◽

Brian S Hilbush ◽

Stuart Inglis ◽

...

Keyword(s):

High Throughput Sequencing ◽

De Novo ◽

Cost Optimization ◽

Snp Array ◽

Variant Calling ◽

Ground Truth ◽

Next Generation Sequencing Data ◽

Sequencing Data ◽

Mendelian Segregation ◽

High Throughput Sequencing Data

The analysis of whole-genome or exome sequencing data from trios and pedigrees has being successfully applied to the identification of disease-causing mutations. However, most methods used to identify and genotype genetic variants from next-generation sequencing data ignore the relationships between samples, resulting in significant Mendelian errors, false positives and negatives. Here we present a Bayesian network framework that jointly analyses data from all members of a pedigree simultaneously using Mendelian segregation priors, yet providing the ability to detectde novomutations in offspring, and is scalable to large pedigrees. We evaluated our method by simulations and analysis of WGS data from a 17 individual, 3-generation CEPH pedigree sequenced to 50X average depth. Compared to singleton calling, our family caller produced more high quality variants and eliminated spurious calls as judged by common quality metrics such as Ti/Tv, Het/Hom ratios, and dbSNP/SNP array data concordance. We developed a ground truth dataset to further evaluate our calls by identifying recombination cross-overs in the pedigree and testing variants for consistency with the inferred phasing, and we show that our method significantly outperforms singleton and population variant calling in pedigrees. We identify all previously validatedde novomutations in NA12878, concurrent with a 7X precision improvement. Our results show that our method is scalable to large genomics and human disease studies and allows cost optimization by rational sequencing capacity distribution.

Download Full-text

Accuracy and reproducibility of somatic point mutation calling in clinical-type targeted sequencing data

BMC Medical Genomics ◽

10.1186/s12920-020-00803-z ◽

2020 ◽

Vol 13 (1) ◽

Author(s):

Ali Karimnezhad ◽

Gareth A. Palidwor ◽

Kednapa Thavorn ◽

David J. Stewart ◽

Pearl A. Campbell ◽

...

Keyword(s):

False Positive ◽

High Throughput Sequencing ◽

Variant Calling ◽

High Sensitivity ◽

Ground Truth ◽

False Positives ◽

Targeted Sequencing ◽

Sequencing Data ◽

Sequencing Platform ◽

Ion Torrent Pgm

Abstract Background Treating cancer depends in part on identifying the mutations driving each patient’s disease. Many clinical laboratories are adopting high-throughput sequencing for assaying patients’ tumours, applying targeted panels to formalin-fixed paraffin-embedded tumour tissues to detect clinically-relevant mutations. While there have been some benchmarking and best practices studies of this scenario, much variant calling work focuses on whole-genome or whole-exome studies, with fresh or fresh-frozen tissue. Thus, definitive guidance on best choices for sequencing platforms, sequencing strategies, and variant calling for clinical variant detection is still being developed. Methods Because ground truth for clinical specimens is rarely known, we used the well-characterized Coriell cell lines GM12878 and GM12877 to generate data. We prepared samples to mimic as closely as possible clinical biopsies, including formalin fixation and paraffin embedding. We evaluated two well-known targeted sequencing panels, Illumina’s TruSight 170 hybrid-capture panel and the amplification-based Oncomine Focus panel. Sequencing was performed on an Illumina NextSeq500 and an Ion Torrent PGM respectively. We performed multiple replicates of each assay, to test reproducibility. Finally, we applied four different freely-available somatic single-nucleotide variant (SNV) callers to the data, along with the vendor-recommended callers for each sequencing platform. Results We did not observe major differences in variant calling success within the regions that each panel covers, but there were substantial differences between callers. All had high sensitivity for true SNVs, but numerous and non-overlapping false positives. Overriding certain default parameters to make them consistent between callers substantially reduced discrepancies, but still resulted in high false positive rates. Intersecting results from multiple replicates or from different variant callers eliminated most false positives, while maintaining sensitivity. Conclusions Reproducibility and accuracy of targeted clinical sequencing results depend less on sequencing platform and panel than on variability between replicates and downstream bioinformatics. Differences in variant callers’ default parameters are a greater influence on algorithm disagreement than other differences between the algorithms. Contrary to typical clinical practice, we recommend employing multiple variant calling pipelines and/or analyzing replicate samples, as this greatly decreases false positive calls.

Download Full-text

SomVarIUS: somatic variant identification from unpaired tissue samples

Bioinformatics ◽

10.1093/bioinformatics/btv685 ◽

2015 ◽

Vol 32 (6) ◽

pp. 808-813 ◽

Cited By ~ 18

Author(s):

Kyle S. Smith ◽

Vinod K. Yadav ◽

Shanshan Pei ◽

Daniel A. Pollyea ◽

Craig T. Jordan ◽

...

Keyword(s):

High Throughput Sequencing ◽

Variant Calling ◽

Computational Method ◽

Supplementary Information ◽

Sequencing Data ◽

Somatic Variant ◽

Tissue Samples ◽

Normal Tissues ◽

High Throughput Sequencing Data ◽

Oncogenic Mutations

Abstract Motivation: Somatic variant calling typically requires paired tumor-normal tissue samples. Yet, paired normal tissues are not always available in clinical settings or for archival samples. Results: We present SomVarIUS, a computational method for detecting somatic variants using high throughput sequencing data from unpaired tissue samples. We evaluate the performance of the method using genomic data from synthetic and real tumor samples. SomVarIUS identifies somatic variants in exome-seq data of ∼150 × coverage with at least 67.7% precision and 64.6% recall rates, when compared with paired-tissue somatic variant calls in real tumor samples. We demonstrate the utility of SomVarIUS by identifying somatic mutations in formalin-fixed samples, and tracking clonal dynamics of oncogenic mutations in targeted deep sequencing data from pre- and post-treatment leukemia samples. Availability and implementation: SomVarIUS is written in Python 2.7 and available at http://www.sjdlab.org/resources/ Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.

Download Full-text

Faculty Opinions recommendation of Coalescent Inference Using Serially Sampled, High-Throughput Sequencing Data from Intrahost HIV Infection.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.726132071.793531014 ◽

2017 ◽

Author(s):

Sarah Rowland-Jones ◽

Sophie Andrews

Keyword(s):

Hiv Infection ◽

High Throughput ◽

High Throughput Sequencing ◽

Sequencing Data ◽

High Throughput Sequencing Data

Download Full-text

BlindCall: ultra-fast base-calling of high-throughput sequencing data by blind deconvolution

Bioinformatics ◽

10.1093/bioinformatics/btu010 ◽

2014 ◽

Vol 30 (9) ◽

pp. 1214-1219 ◽

Cited By ~ 6

Author(s):

C. Ye ◽

C. Hsiao ◽

H. Corrada Bravo

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Blind Deconvolution ◽

Sequencing Data ◽

Base Calling ◽

High Throughput Sequencing Data

Download Full-text

Improvement, identification, and target prediction for miRNAs in the porcine genome by using massive, public high-throughput sequencing data

Journal of Animal Science ◽

10.1093/jas/skab018 ◽

2021 ◽

Vol 99 (2) ◽

Author(s):

Yuhua Fu ◽

Pengyu Fan ◽

Lu Wang ◽

Ziqiang Shu ◽

Shilin Zhu ◽

...

Keyword(s):

High Throughput Sequencing ◽

Target Genes ◽

Target Prediction ◽

Large Data ◽

Sequencing Data ◽

Regulate Gene Expression ◽

High Throughput Sequencing Data ◽

Annotation Information ◽

Public Data ◽

Broad Variety

Abstract Despite the broad variety of available microRNA (miRNA) research tools and methods, their application to the identification, annotation, and target prediction of miRNAs in nonmodel organisms is still limited. In this study, we collected nearly all public sRNA-seq data to improve the annotation for known miRNAs and identify novel miRNAs that have not been annotated in pigs (Sus scrofa). We newly annotated 210 mature sequences in known miRNAs and found that 43 of the known miRNA precursors were problematic due to redundant/missing annotations or incorrect sequences. We also predicted 811 novel miRNAs with high confidence, which was twice the current number of known miRNAs for pigs in miRBase. In addition, we proposed a correlation-based strategy to predict target genes for miRNAs by using a large amount of sRNA-seq and RNA-seq data. We found that the correlation-based strategy provided additional evidence of expression compared with traditional target prediction methods. The correlation-based strategy also identified the regulatory pairs that were controlled by nonbinding sites with a particular pattern, which provided abundant complementarity for studying the mechanism of miRNAs that regulate gene expression. In summary, our study improved the annotation of known miRNAs, identified a large number of novel miRNAs, and predicted target genes for all pig miRNAs by using massive public data. This large data-based strategy is also applicable for other nonmodel organisms with incomplete annotation information.

Download Full-text