Joint analysis of matched tumor samples with varying tumor contents improves somatic variant calling in the absence of a germline sample

Mapping Intimacies ◽

10.1101/364943 ◽

2018 ◽

Cited By ~ 1

Author(s):

Rebecca F. Halperin ◽

Winnie S. Liang ◽

Sidharth Kulkarni ◽

Erica E. Tassone ◽

Jonathan Adkins ◽

...

Keyword(s):

Normal Tissue ◽

Variant Calling ◽

Joint Analysis ◽

Normal Sample ◽

Adjacent Normal Tissue ◽

Sequencing Data ◽

Somatic Variant ◽

Tissue Samples ◽

Germline Variants ◽

Two Samples

AbstractArchival tumor samples represent a potential rich resource of annotated specimens for translational genomics research. However, standard variant calling approaches require a matched normal sample from the same individual, which is often not available in the retrospective setting, making it difficult to distinguish between true somatic variants and germline variants that are private to the individual. Archival sections often contain adjacent normal tissue, but this normal tissue can include infiltrating tumor cells. Comparative somatic variant callers are designed to exclude variants present in the normal sample, so a novel approach is required to leverage sequencing of adjacent normal tissue for somatic variant calling. Here we present LumosVar 2.0, a software package designed to jointly analyze multiple samples from the same patient. The approach is based on the concept that the allelic fraction of somatic variants, but not germline variants, would be reduced in samples with low tumor content. LumosVar 2.0 estimates allele specific copy number and tumor sample fractions from the data, and uses the model to determine expected allelic fractions for somatic and germline variants and classify variants accordingly. To evaluate using LumosVar 2.0 to jointly call somatic variants with tumor and adjacent normal samples, we used a glioblastoma dataset with matched high tumor content, low tumor content, and germline exome sequencing data (to define true somatic variants) available for each patient. We show that both sensitivity and positive predictive value are improved by analyzing the high tumor and low tumor samples jointly compared to analyzing the samples individually or compared to in-silico pooling of the two samples. Finally, we applied this approach to a set of breast and prostate archival tumor samples for which normal samples were not available for germline sequencing, but tumor blocks containing adjacent normal tissue were available for sequencing. Joint analysis using LumosVar 2.0 detected several variants, including known cancer hotspot mutations that were not detected by standard somatic variant calling tools using the adjacent normal as a reference. Together, these results demonstrate the potential utility of leveraging paired tissue samples to improve somatic variant calling when a constitutional DNA sample is not available.

Download Full-text

UNMASC: tumor-only variant calling with unmatched normal controls

NAR Cancer ◽

10.1093/narcan/zcab040 ◽

2021 ◽

Vol 3 (4) ◽

Author(s):

Paul Little ◽

Heejoon Jo ◽

Alan Hoyle ◽

Angela Mazul ◽

Xiaobei Zhao ◽

...

Keyword(s):

Decision Rules ◽

Point Mutations ◽

Variant Calling ◽

Next Generation Sequencing Data ◽

Normal Sample ◽

Sequencing Data ◽

Germline Variants ◽

Sequencing Errors ◽

Expert Review ◽

Normal Controls

Abstract Despite years of progress, mutation detection in cancer samples continues to require significant manual review as a final step. Expert review is particularly challenging in cases where tumors are sequenced without matched normal control DNA. Attempts have been made to call somatic point mutations without a matched normal sample by removing well-known germline variants, utilizing unmatched normal controls, and constructing decision rules to classify sequencing errors and private germline variants. With budgetary constraints related to computational and sequencing costs, finding the appropriate number of controls is a crucial step to identifying somatic variants. Our approach utilizes public databases for canonical somatic variants as well as germline variants and leverages information gathered about nearby positions in the normal controls. Drawing from our cohort of targeted capture panel sequencing of tumor and normal samples with varying tumortypes and demographics, these served as a benchmark for our tumor-only variant calling pipeline to observe the relationship between our ability to correctly classify variants against a number of unmatched normals. With our benchmarked samples, approximately ten normal controls were needed to maintain 94% sensitivity, 99% specificity and 76% positive predictive value, far outperforming comparable methods. Our approach, called UNMASC, also serves as a supplement to traditional tumor with matched normal variant calling workflows and can potentially extend to other concerns arising from analyzing next generation sequencing data.

Download Full-text

SomVarIUS: somatic variant identification from unpaired tissue samples

Bioinformatics ◽

10.1093/bioinformatics/btv685 ◽

2015 ◽

Vol 32 (6) ◽

pp. 808-813 ◽

Cited By ~ 18

Author(s):

Kyle S. Smith ◽

Vinod K. Yadav ◽

Shanshan Pei ◽

Daniel A. Pollyea ◽

Craig T. Jordan ◽

...

Keyword(s):

High Throughput Sequencing ◽

Variant Calling ◽

Computational Method ◽

Supplementary Information ◽

Sequencing Data ◽

Somatic Variant ◽

Tissue Samples ◽

Normal Tissues ◽

High Throughput Sequencing Data ◽

Oncogenic Mutations

Abstract Motivation: Somatic variant calling typically requires paired tumor-normal tissue samples. Yet, paired normal tissues are not always available in clinical settings or for archival samples. Results: We present SomVarIUS, a computational method for detecting somatic variants using high throughput sequencing data from unpaired tissue samples. We evaluate the performance of the method using genomic data from synthetic and real tumor samples. SomVarIUS identifies somatic variants in exome-seq data of ∼150 × coverage with at least 67.7% precision and 64.6% recall rates, when compared with paired-tissue somatic variant calls in real tumor samples. We demonstrate the utility of SomVarIUS by identifying somatic mutations in formalin-fixed samples, and tracking clonal dynamics of oncogenic mutations in targeted deep sequencing data from pre- and post-treatment leukemia samples. Availability and implementation: SomVarIUS is written in Python 2.7 and available at http://www.sjdlab.org/resources/ Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.

Download Full-text

Halvade somatic: Somatic variant calling with Apache Spark

GigaScience ◽

10.1093/gigascience/giab094 ◽

2022 ◽

Vol 11 (1) ◽

Author(s):

Dries Decap ◽

Louise de Schaetzen van Brienen ◽

Maarten Larmuseau ◽

Pascal Costanza ◽

Charlotte Herzeel ◽

...

Keyword(s):

Best Practices ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Computing Time ◽

Variant Calling ◽

Apache Spark ◽

Normal Sample ◽

Whole Genome ◽

Sequencing Data ◽

Somatic Variant

Abstract Background The accurate detection of somatic variants from sequencing data is of key importance for cancer treatment and research. Somatic variant calling requires a high sequencing depth of the tumor sample, especially when the detection of low-frequency variants is also desired. In turn, this leads to large volumes of raw sequencing data to process and hence, large computational requirements. For example, calling the somatic variants according to the GATK best practices guidelines requires days of computing time for a typical whole-genome sequencing sample. Findings We introduce Halvade Somatic, a framework for somatic variant calling from DNA sequencing data that takes advantage of multi-node and/or multi-core compute platforms to reduce runtime. It relies on Apache Spark to provide scalable I/O and to create and manage data streams that are processed on different CPU cores in parallel. Halvade Somatic contains all required steps to process the tumor and matched normal sample according to the GATK best practices recommendations: read alignment (BWA), sorting of reads, preprocessing steps such as marking duplicate reads and base quality score recalibration (GATK), and, finally, calling the somatic variants (Mutect2). Our approach reduces the runtime on a single 36-core node to 19.5 h compared to a runtime of 84.5 h for the original pipeline, a speedup of 4.3 times. Runtime can be further decreased by scaling to multiple nodes, e.g., we observe a runtime of 1.36 h using 16 nodes, an additional speedup of 14.4 times. Halvade Somatic supports variant calling from both whole-genome sequencing and whole-exome sequencing data and also supports Strelka2 as an alternative or complementary variant calling tool. We provide a Docker image to facilitate single-node deployment. Halvade Somatic can be executed on a variety of compute platforms, including Amazon EC2 and Google Cloud. Conclusions To our knowledge, Halvade Somatic is the first somatic variant calling pipeline that leverages Big Data processing platforms and provides reliable, scalable performance. Source code is freely available.

Download Full-text

A unified haplotype-based method for accurate and comprehensive variant calling

10.1101/456103 ◽

2018 ◽

Cited By ~ 3

Author(s):

Daniel P Cooke ◽

David C Wedge ◽

Gerton Lunter

Keyword(s):

De Novo ◽

Variant Calling ◽

Normal Sample ◽

Sequencing Data ◽

Somatic Variation ◽

Data Set ◽

Small Complex ◽

Physical Linkage ◽

Germline Variation ◽

Almost All

Haplotype-based variant callers, which consider physical linkage between variant sites, are currently among the best tools for germline variation discovery and genotyping from short-read sequencing data. However, almost all such tools were designed specifically for detecting common germline variation in diploid populations, and give sub-optimal results in other scenarios. Here we present Octopus, a versatile haplotype-based variant caller that uses a polymorphic Bayesian genotyping model capable of modeling sequencing data from a range of experimental designs within a unified haplotype-aware framework. We show that Octopus accurately calls de novo mutations in parent-offspring trios and germline variants in individuals, including SNVs, indels, and small complex replacements such as microinversions. In addition, using a carefully designed synthetic-tumour data set derived from clean sequencing data from a sample with known germline haplotypes, and observed mutations in large cohort of tumour samples, we show that Octopus accurately characterizes germline and somatic variation in tumours, both with and without a paired normal sample. Sequencing reads and prior information are combined to phase called genotypes of arbitrary ploidy, including those with somatic mutations. Octopus also outputs realigned evidence BAMs to aid validation and interpretation.

Download Full-text

TNscope: Accurate Detection of Somatic Mutations with Haplotype-based Variant Candidate Detection and Machine Learning Filtering

10.1101/250647 ◽

2018 ◽

Cited By ~ 11

Author(s):

Donald Freed ◽

Renke Pan ◽

Rafael Aldana

Keyword(s):

Machine Learning ◽

Normal Tissue ◽

Structural Variation ◽

Somatic Mutations ◽

Variant Calling ◽

Somatic Variant ◽

Accurate Detection ◽

Machine Learning Model ◽

Early Engineering ◽

Somatic Mutation Calling

AbstractDetection of somatic mutations in tumor samples is important in the clinic, where treatment decisions are increasingly based upon molecular diagnostics. However, accurate detection of these mutations is difficult, due in part to intra-tumor heterogeneity, contamination of the tumor sample with normal tissue and pervasive structural variation. Here, we describe Sentieon TNscope, a haplotype-based somatic variant caller with increased accuracy relative to existing methods. An early engineering version of TNscope was used in our submission to the most recent ICGC-DREAM Somatic Mutation calling challenge. In that challenge, TNscope is the leader in accuracy for SNVs, indels and SVs. To further improve variant calling accuracy, we combined the improvements in the variant caller with machine learning. We benchmarked TNscope using in-silico mixtures of well-characterized Genome in a Bottle (GIAB) samples. TNscope displays higher accuracy than the other benchmarked tools and the accuracy is substantially improved by the machine learning model.

Download Full-text

SomatoSim: precision simulation of somatic single nucleotide variants

BMC Bioinformatics ◽

10.1186/s12859-021-04024-8 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Marwan A. Hawari ◽

Celine S. Hong ◽

Leslie G. Biesecker

Keyword(s):

High Throughput Sequencing ◽

Variant Calling ◽

Simulated Data ◽

Sequencing Data ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Somatic Variant ◽

Simulation Tools ◽

Gold Standard Dataset ◽

High Level

Abstract Background Somatic single nucleotide variants have gained increased attention because of their role in cancer development and the widespread use of high-throughput sequencing techniques. The necessity to accurately identify these variants in sequencing data has led to a proliferation of somatic variant calling tools. Additionally, the use of simulated data to assess the performance of these tools has become common practice, as there is no gold standard dataset for benchmarking performance. However, many existing somatic variant simulation tools are limited because they rely on generating entirely synthetic reads derived from a reference genome or because they do not allow for the precise customizability that would enable a more focused understanding of single nucleotide variant calling performance. Results SomatoSim is a tool that lets users simulate somatic single nucleotide variants in sequence alignment map (SAM/BAM) files with full control of the specific variant positions, number of variants, variant allele fractions, depth of coverage, read quality, and base quality, among other parameters. SomatoSim accomplishes this through a three-stage process: variant selection, where candidate positions are selected for simulation, variant simulation, where reads are selected and mutated, and variant evaluation, where SomatoSim summarizes the simulation results. Conclusions SomatoSim is a user-friendly tool that offers a high level of customizability for simulating somatic single nucleotide variants. SomatoSim is available at https://github.com/BieseckerLab/SomatoSim.

Download Full-text

Standard operating procedure for somatic variant refinement of tumor sequencing data

10.1101/266262 ◽

2018 ◽

Cited By ~ 1

Author(s):

Erica K. Barnell ◽

Peter Ronning ◽

Katie M. Campbell ◽

Kilannin Krysiak ◽

Benjamin J. Ainscough ◽

...

Keyword(s):

Massively Parallel Sequencing ◽

Variant Calling ◽

Standard Operating Procedure ◽

Sequencing Data ◽

Optimal Method ◽

Somatic Variant ◽

Standard Operating ◽

Variant Detection ◽

Manual Review

AbstractPurposeManual review of aligned sequencing reads is required to develop a high-quality list of somatic variants from massively parallel sequencing data (MPS). Despite widespread use in analyzing MPS data, there has been little attempt to describe methods for manual review, resulting in high inter- and intra-lab variability in somatic variant detection and characterization of tumors.MethodsOpen source software was used to develop an optimal method for manual review setup. We also developed a systemic approach to visually inspect each variant during manual review.ResultsWe present a standard operating procedures for somatic variant refinement for use by manual reviewers. The approach is enhanced through representative examples of 4 different manual review categories that indicate a reviewer’s confidence in the somatic variant call and 19 annotation tags that contextualize commonly observed sequencing patterns during manual review. Representative examples provide detailed instructions on how to classify variants during manual review to rectify lack of confidence in automated somatic variant detection.ConclusionStandardization of somatic variant refinement through systematization of manual review will improve the consistency and reproducibility of identifying true somatic variants after automated variant calling.

Download Full-text

SMuRF: Portable and accurate ensemble-based somatic variant calling

10.1101/270413 ◽

2018 ◽

Cited By ~ 2

Author(s):

Weitai Huang ◽

Yu Amanda Guo ◽

Karthik Muthukumar ◽

Probhonjon Baruah ◽

Meimei Chang ◽

...

Keyword(s):

Point Mutations ◽

Variant Calling ◽

Whole Genome Sequencing Data ◽

Sequencing Data ◽

Somatic Variant ◽

Level Data ◽

Machine Learning Approach ◽

Cancer Types ◽

User Friendly ◽

Improved Accuracy

ABSTARCTSummarySMuRF is an ensemble method for prediction of somatic point mutations (SNVs) and small insertions/deletions (indels) in cancer genomes. The method integrates predictions and auxiliary features from different somatic mutation callers using a Random Forest machine learning approach. SMuRF is trained on community-curated tumor whole genome sequencing data, is robust across cancer types, and achieves improved accuracy for both SNV and indel predictions of genome and exome-level data. The software is user-friendly and portable by design, operating as an add-on to the community-developed bcbio-nextgen somatic variant calling [email protected]

Download Full-text

UVC: universality-based calling of small variants using pseudo-neural networks

10.1101/2020.08.23.263749 ◽

2020 ◽

Author(s):

Xiaofei Zhao ◽

Allison Hu ◽

Sizhen Wang ◽

Xiaoyue Wang

Keyword(s):

Neural Network ◽

State Of The Art ◽

Variant Calling ◽

The State ◽

Training Data ◽

Normal Sample ◽

Sequencing Data ◽

Damage Repair ◽

Biological Insight ◽

Sensitivity Specificity

AbstractWe describe UVC (https://github.com/genetronhealth/uvc), an open-source method for calling small somatic variants. UVC is aware of both unique molecular identifiers (UMIs) and the tumor-matched normal sample. UVC utilizes the following power-law universality that we discovered: allele fraction is inversely proportional to the cubic root of variant-calling error rate. Moreover, UVC utilizes pseudo-neural network (PNN). PNN is similar to deep neural network but does not require any training data. UVC outperformed Mageri and smCounter2, the state-of-the-art UMI-aware variant callers, on the tumor-only datasets used for publishing these two variant callers. Also, UVC outperformed Mutect2 and Strelka2, the state-of-the-art variant callers for tumor-normal pairs, on the Genome-in-a-Bottle somatic truth sets. UVC outperformed Mutect2 and Strelka2 on 21 in silico mixtures simulating 21 combinations of tumor purity and normal purity. Performance is measured by using sensitivity-specificity trade off for all called variants. The improved variant calls generated by UVC from previously published UMI-based sequencing data are able to provide additional biological insight about DNA damage repair. The versatility and robustness of UVC makes it a useful tool for variant calling in clinical settings.

Download Full-text

DNA methylation markers for diagnosis and prognosis of common cancers

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1703577114 ◽

2017 ◽

Vol 114 (28) ◽

pp. 7414-7419 ◽

Cited By ~ 149

Author(s):

Xiaoke Hao ◽

Huiyan Luo ◽

Michal Krawczyk ◽

Wei Wei ◽

Wenqiu Wang ◽

...

Keyword(s):

Colorectal Cancer ◽

Dna Methylation ◽

Cancer Biology ◽

Normal Tissue ◽

Diagnostic Methods ◽

Adjacent Normal Tissue ◽

Tissue Samples ◽

Diagnosis And Prognosis ◽

Cancer Metastases ◽

Colorectal Cancer Metastases

The ability to identify a specific cancer using minimally invasive biopsy holds great promise for improving the diagnosis, treatment selection, and prediction of prognosis in cancer. Using whole-genome methylation data from The Cancer Genome Atlas (TCGA) and machine learning methods, we evaluated the utility of DNA methylation for differentiating tumor tissue and normal tissue for four common cancers (breast, colon, liver, and lung). We identified cancer markers in a training cohort of 1,619 tumor samples and 173 matched adjacent normal tissue samples. We replicated our findings in a separate TCGA cohort of 791 tumor samples and 93 matched adjacent normal tissue samples, as well as an independent Chinese cohort of 394 tumor samples and 324 matched adjacent normal tissue samples. The DNA methylation analysis could predict cancer versus normal tissue with more than 95% accuracy in these three cohorts, demonstrating accuracy comparable to typical diagnostic methods. This analysis also correctly identified 29 of 30 colorectal cancer metastases to the liver and 32 of 34 colorectal cancer metastases to the lung. We also found that methylation patterns can predict prognosis and survival. We correlated differential methylation of CpG sites predictive of cancer with expression of associated genes known to be important in cancer biology, showing decreased expression with increased methylation, as expected. We verified gene expression profiles in a mouse model of hepatocellular carcinoma. Taken together, these findings demonstrate the utility of methylation biomarkers for the molecular characterization of cancer, with implications for diagnosis and prognosis.

Download Full-text