Predicting clone genotypes from tumor bulk sequencing of multiple samples

Mapping Intimacies ◽

10.1101/341180 ◽

2018 ◽

Author(s):

Sayaka Miura ◽

Karen Gomez ◽

Oscar Murillo ◽

Louise A Huuki ◽

Tracy Vu ◽

...

Keyword(s):

Computational Methods ◽

Prior Information ◽

Supplementary Information ◽

Cell Populations ◽

Absolute Accuracy ◽

Sequencing Data ◽

Single Nucleotide ◽

Supplementary Material ◽

Multiple Samples ◽

Genotype Inference

AbstractMotivationAnalyses of data generated from bulk sequencing of tumors have revealed extensive genomic heterogeneity within patients. Many computational methods have been developed to enable the inference of genotypes of tumor cell populations (clones) from bulk sequencing data. However, the relative and absolute accuracy of available computational methods in estimating clone counts and clone genotypes is not yet known.ResultsWe have assessed the performance of nine methods, including eight previously-published and one new method (CloneFinder), by analyzing computer simulated datasets. CloneFinder, LICHeE, CITUP, and cloneHD inferred clone genotypes with low error (<5% per clone) for a majority of datasets in which the tumor samples contained evolutionarily-related clones. Computational methods did not perform well for datasets in which tumor samples contained mixtures of clones from different clonal lineages. Generally, the number of clones was underestimated by cloneHD and overestimated by Phy-loWGS, and BayClone2, Canopy, and Clomial required prior information regarding the number of clones. AncesTree and Canopy did not produce results for a large number of datasets.ConclusionsDeconvolution of clone genotypes from single nucleotide variant (SNV) frequency differences among tumor samples remains challenging, so there is a need to develop more accurate computational methods and robust software for clone genotype inference.Availability and ImplementationCloneFinder is implemented in Python and is available from https://github.com/gstecher/[email protected] informationSupplementary data are available at Bioinformatics online

Download Full-text

neoepiscope improves neoepitope prediction with multivariant phasing

Bioinformatics ◽

10.1093/bioinformatics/btz653 ◽

2019 ◽

Vol 36 (3) ◽

pp. 713-720 ◽

Cited By ~ 5

Author(s):

Mary A Wood ◽

Austin Nguyen ◽

Adam J Struck ◽

Kyle Ellrott ◽

Abhinav Nellore ◽

...

Keyword(s):

False Negative ◽

Supplementary Information ◽

Supplementary File ◽

Sequencing Data ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Somatic Variant ◽

Negative Results ◽

Multiple Datasets ◽

False Negative Results

Abstract Motivation The vast majority of tools for neoepitope prediction from DNA sequencing of complementary tumor and normal patient samples do not consider germline context or the potential for the co-occurrence of two or more somatic variants on the same mRNA transcript. Without consideration of these phenomena, existing approaches are likely to produce both false-positive and false-negative results, resulting in an inaccurate and incomplete picture of the cancer neoepitope landscape. We developed neoepiscope chiefly to address this issue for single nucleotide variants (SNVs) and insertions/deletions (indels). Results Herein, we illustrate how germline and somatic variant phasing affects neoepitope prediction across multiple datasets. We estimate that up to ∼5% of neoepitopes arising from SNVs and indels may require variant phasing for their accurate assessment. neoepiscope is performant, flexible and supports several major histocompatibility complex binding affinity prediction tools. Availability and implementation neoepiscope is available on GitHub at https://github.com/pdxgx/neoepiscope under the MIT license. Scripts for reproducing results described in the text are available at https://github.com/pdxgx/neoepiscope-paper under the MIT license. Additional data from this study, including summaries of variant phasing incidence and benchmarking wallclock times, are available in Supplementary Files 1, 2 and 3. Supplementary File 1 contains Supplementary Table 1, Supplementary Figures 1 and 2, and descriptions of Supplementary Tables 2–8. Supplementary File 2 contains Supplementary Tables 2–6 and 8. Supplementary File 3 contains Supplementary Table 7. Raw sequencing data used for the analyses in this manuscript are available from the Sequence Read Archive under accessions PRJNA278450, PRJNA312948, PRJNA307199, PRJNA343789, PRJNA357321, PRJNA293912, PRJNA369259, PRJNA305077, PRJNA306070, PRJNA82745 and PRJNA324705; from the European Genome-phenome Archive under accessions EGAD00001004352 and EGAD00001002731; and by direct request to the authors. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

AlphaFamImpute: high-accuracy imputation in full-sib families from genotype-by-sequencing data

Bioinformatics ◽

10.1093/bioinformatics/btaa499 ◽

2020 ◽

Vol 36 (15) ◽

pp. 4369-4371

Author(s):

Andrew Whalen ◽

Gregor Gorjanc ◽

John M Hickey

Keyword(s):

Snp Array ◽

High Accuracy ◽

Supplementary Information ◽

Sequencing Data ◽

Single Nucleotide ◽

Genome Wide ◽

Sib Families ◽

Genotype By Sequencing ◽

Low Coverage ◽

Python Package

Abstract Summary AlphaFamImpute is an imputation package for calling, phasing and imputing genome-wide genotypes in outbred full-sib families from single nucleotide polymorphism (SNP) array and genotype-by-sequencing (GBS) data. GBS data are increasingly being used to genotype individuals, especially when SNP arrays do not exist for a population of interest. Low-coverage GBS produces data with a large number of missing or incorrect naïve genotype calls, which can be improved by identifying shared haplotype segments between full-sib individuals. Here, we present AlphaFamImpute, an algorithm specifically designed to exploit the genetic structure of full-sib families. It performs imputation using a two-step approach. In the first step, it phases and imputes parental genotypes based on the segregation states of their offspring (i.e. which pair of parental haplotypes the offspring inherited). In the second step, it phases and imputes the offspring genotypes by detecting which haplotype segments the offspring inherited from their parents. With a series of simulations, we find that AlphaFamImpute obtains high-accuracy genotypes, even when the parents are not genotyped and individuals are sequenced at <1x coverage. Availability and implementation AlphaFamImpute is available as a Python package from the AlphaGenes website http://www.AlphaGenes.roslin.ed.ac.uk/AlphaFamImpute. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

FQSqueezer: k-mer-based compression of sequencing data

10.1101/559807 ◽

2019 ◽

Cited By ~ 1

Author(s):

Sebastian Deorowicz

Keyword(s):

Data Compression ◽

State Of The Art ◽

Genomic Data ◽

General Purpose ◽

Supplementary Information ◽

Supplementary Data ◽

Sequencing Data ◽

Partial Matching ◽

Supplementary Material ◽

Better Than

AbstractMotivationThe amount of genomic data that needs to be stored is huge. Therefore it is not surprising that a lot of work has been done in the field of specialized data compression of FASTQ files. The existing algorithms are, however, still imperfect and the best tools produce quite large archives.ResultsWe present FQSqueezer, a novel compression algorithm for sequencing data able to process single- and paired-end reads of variable lengths. It is based on the ideas from the famous prediction by partial matching and dynamic Markov coder algorithms known from the general-purpose-compressors world. The compression ratios are often tens of percent better than offered by the state-of-the-art tools.Availability and Implementationhttps://github.com/refresh-bio/[email protected] informationSupplementary data are available at publisher’s Web site.

Download Full-text

Bivartect: accurate and memory-saving breakpoint detection by direct read comparison

Bioinformatics ◽

10.1093/bioinformatics/btaa059 ◽

2020 ◽

Vol 36 (9) ◽

pp. 2725-2730

Author(s):

Keisuke Shimmura ◽

Yuki Kato ◽

Yukio Kawahara

Keyword(s):

Genome Editing ◽

High Throughput Sequencing ◽

Variant Calling ◽

Simulated Data ◽

Supplementary Information ◽

Sequencing Data ◽

Single Nucleotide Variants ◽

Single Node ◽

Single Nucleotide ◽

Target Sites

Abstract Motivation Genetic variant calling with high-throughput sequencing data has been recognized as a useful tool for better understanding of disease mechanism and detection of potential off-target sites in genome editing. Since most of the variant calling algorithms rely on initial mapping onto a reference genome and tend to predict many variant candidates, variant calling remains challenging in terms of predicting variants with low false positives. Results Here we present Bivartect, a simple yet versatile variant caller based on direct comparison of short sequence reads between normal and mutated samples. Bivartect can detect not only single nucleotide variants but also insertions/deletions, inversions and their complexes. Bivartect achieves high predictive performance with an elaborate memory-saving mechanism, which allows Bivartect to run on a computer with a single node for analyzing small omics data. Tests with simulated benchmark and real genome-editing data indicate that Bivartect was comparable to state-of-the-art variant callers in positive predictive value for detection of single nucleotide variants, even though it yielded a substantially small number of candidates. These results suggest that Bivartect, a reference-free approach, will contribute to the identification of germline mutations as well as off-target sites introduced during genome editing with high accuracy. Availability and implementation Bivartect is implemented in C++ and available along with in silico simulated data at https://github.com/ykat0/bivartect. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

ReQTL – an allele-level measure of variation-expression genomic relationships

10.1101/464206 ◽

2018 ◽

Author(s):

Liam Spurr ◽

Nawaf Alomran ◽

Piotr Słowiński ◽

Muzi Li ◽

Pavlos Bousounis ◽

...

Keyword(s):

Gene Expression ◽

Genetic Variation ◽

Rna Sequencing ◽

High Performance ◽

Supplementary Information ◽

Phenotypic Trait ◽

Sequencing Data ◽

Single Nucleotide Variants ◽

Genomic Relationships ◽

Supplementary Material

MotivationBy testing for association of DNA genotypes with gene expression levels, expression quantitative trait locus (eQTL) analyses have been instrumental in understanding how thousands of single nucleotide variants (SNVs) may affect gene expression. As compared to DNA genotypes, RNA genetic variation represents a phenotypic trait that reflects the actual allele content of the studied system. RNA genetic variation can be measured at expressed genome regions, and differs from the DNA genotype in sites subjected to regulatory forces. Therefore, assessment of correlation between RNA genetic variation and gene expression can reveal regulatory genomic relationships in addition to eQTLs.ResultsWe introduce ReQTL, an eQTL modification which substitutes the DNA allele count for the variant allele frequency (VAF) at expressed SNV loci in the transcriptome. We exemplify the method on sets of RNA-sequencing data from human tissues obtained though the Genotype-Tissue Expression Project (GTEx) and demonstrate that ReQTL analyses show consistently high performance and sufficient power to identify both previously known and novel molecular associations. The majority of the SNVs implicated in significant cis-ReQTLs identified by our analysis were previously reported as significant cis-eQTL loci. Notably, trans ReQTL loci in our data were substantially enriched in RNA-editing sites. In summary, ReQTL analyses are computationally feasible and do not require matched DNA data, hence they have a high potential to facilitate the discovery of novel molecular interactions through exploration of the increasingly accessible RNA-sequencing datasets.Availability and implementationSample scripts used in our ReQTL analyses are available with the Supplementary Material (ReQTL_sample_code)[email protected] or [email protected] InformationRe_QTL_Supplementary_Data.zip

Download Full-text

Meltos: multi-sample tumor phylogeny reconstruction for structural variants

Bioinformatics ◽

10.1093/bioinformatics/btz737 ◽

2019 ◽

Cited By ~ 2

Author(s):

Camir Ricketts ◽

Daniel Seidman ◽

Victoria Popic ◽

Fereydoun Hormozdiari ◽

Serafim Batzoglou ◽

...

Keyword(s):

Supplementary Information ◽

Whole Genome Sequencing Data ◽

Structural Variants ◽

Sequencing Data ◽

Single Nucleotide Variants ◽

High Confidence ◽

Cancer Data ◽

Lineage Tree ◽

Multiple Samples ◽

Tumor Phylogeny

Abstract Motivation We propose Meltos, a novel computational framework to address the challenging problem of building tumor phylogeny trees using somatic structural variants (SVs) among multiple samples. Meltos leverages the tumor phylogeny tree built on somatic single nucleotide variants (SNVs) to identify high confidence SVs and produce a comprehensive tumor lineage tree, using a novel optimization formulation. While we do not assume the evolutionary progression of SVs is necessarily the same as SNVs, we show that a tumor phylogeny tree using high-quality somatic SNVs can act as a guide for calling and assigning somatic SVs on a tree. Meltos utilizes multiple genomic read signals for potential SV breakpoints in whole genome sequencing data and proposes a probabilistic formulation for estimating variant allele fractions (VAFs) of SV events. Results In order to assess the ability of Meltos to correctly refine SNV trees with SV information, we tested Meltos on two simulated datasets with five genomes in both. We also assessed Meltos on two real cancer datasets. We tested Meltos on multiple samples from a liposarcoma tumor and on a multi-sample breast cancer data (Yates et al., 2015), where the authors provide validated structural variation events together with deep, targeted sequencing for a collection of somatic SNVs. We show Meltos has the ability to place high confidence validated SV calls on a refined tumor phylogeny tree. We also showed the flexibility of Meltos to either estimate VAFs directly from genomic data or to use copy number corrected estimates. Availability and implementation Meltos is available at https://github.com/ih-lab/Meltos. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

BAMixChecker: an automated checkup tool for matched sample pairs in NGS cohort

Bioinformatics ◽

10.1093/bioinformatics/btz479 ◽

2019 ◽

Vol 35 (22) ◽

pp. 4806-4808 ◽

Cited By ~ 2

Author(s):

Hein Chun ◽

Sangwoo Kim

Keyword(s):

Genomic Analysis ◽

Supplementary Information ◽

Nucleotide Polymorphisms ◽

Rna Seq ◽

Sequencing Data ◽

Single Nucleotide ◽

Frequent Problem ◽

Generation Sequencing ◽

User Intervention ◽

Genotype Concordance

Abstract Summary Mislabeling in the process of next generation sequencing is a frequent problem that can cause an entire genomic analysis to fail, and a regular cohort-level checkup is needed to ensure that it has not occurred. We developed a new, automated tool (BAMixChecker) that accurately detects sample mismatches from a given BAM file cohort with minimal user intervention. BAMixChecker uses a flexible, data-specific set of single-nucleotide polymorphisms and detects orphan (unpaired) and swapped (mispaired) samples based on genotype-concordance score and entropy-based file name analysis. BAMixChecker shows ∼100% accuracy in real WES, RNA-Seq and targeted sequencing data cohorts, even for small panels (<50 genes). BAMixChecker provides an HTML-style report that graphically outlines the sample matching status in tables and heatmaps, with which users can quickly inspect any mismatch events. Availability and implementation BAMixChecker is available at https://github.com/heinc1010/BAMixChecker Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Accuracy of somatic variant detection in multiregional tumor sequencing data

10.1101/655605 ◽

2019 ◽

Cited By ~ 1

Author(s):

Harald Detering ◽

Laura Tomás ◽

Tamara Prieto ◽

David Posada

Keyword(s):

Variant Calling ◽

Sequencing Data ◽

Single Nucleotide ◽

Somatic Variant ◽

Work Sample ◽

Tumor Sequencing ◽

Variant Detection ◽

Correction Step ◽

Multiple Samples ◽

Better Than

AbstractMultiregional bulk sequencing data is necessary to characterize intratumor genetic heterogeneity. Novel somatic variant calling approaches aim to address the particular characteristics of multiregional data, but it remains unclear to which extent they improve compared to single-sample strategies. Here we compared the performance of 16 single-nucleotide variant calling approaches on multiregional sequencing data under different scenarios with in-silico and real sequencing reads, including varying sequencing coverage and increasing levels of spatial clonal admixture. Under the conditions simulated, methods that use information across multiple samples do not necessarily perform better than some of the standard calling methods that work sample by sample. Nonetheless, our results indicate that under difficult conditions, Mutect2 in multisample mode, in combination with a correction step, seems to perform best. Our analysis provides data-driven guidance for users and developers of somatic variant calling tools.

Download Full-text

VISOR: a versatile haplotype-aware structural variant simulator for short- and long-read sequencing

Bioinformatics ◽

10.1093/bioinformatics/btz719 ◽

2019 ◽

Author(s):

Davide Bolognini ◽

Ashley Sanders ◽

Jan O Korbel ◽

Alberto Magi ◽

Vladimir Benes ◽

...

Keyword(s):

Single Cell ◽

Supplementary Information ◽

Sequencing Data ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Cancer Heterogeneity ◽

Long Reads ◽

Long Read ◽

Complex Structural ◽

Error Profiles

Abstract Summary VISOR is a tool for haplotype-specific simulations of simple and complex structural variants (SVs). The method is applicable to haploid, diploid or higher ploidy simulations for bulk or single-cell sequencing data. SVs are implanted into FASTA haplotypes at single-basepair resolution, optionally with nearby single-nucleotide variants. Short or long reads are drawn at random from these haplotypes using standard error profiles. Double- or single-stranded data can be simulated and VISOR supports the generation of haplotype-tagged BAM files. The tool further includes methods to interactively visualize simulated variants in single-stranded data. The versatility of VISOR is unmet by comparable tools and it lays the foundation to simulate haplotype-resolved cancer heterogeneity data in bulk or at single-cell resolution. Availability and implementation VISOR is implemented in python 3.6, open-source and freely available at https://github.com/davidebolo1993/VISOR. Documentation is available at https://davidebolo1993.github.io/visordoc/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Moss enables high sensitivity single-nucleotide variant calling from multiple bulk DNA tumor samples

Nature Communications ◽

10.1038/s41467-021-22466-9 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Chuanyi Zhang ◽

Mohammed El-Kebir ◽

Idoia Ochoa

Keyword(s):

Cancer Genomics ◽

Low Frequency ◽

Variant Calling ◽

High Sensitivity ◽

Single Sample ◽

Sequencing Data ◽

Single Nucleotide Variants ◽

Additional Time ◽

Single Nucleotide ◽

Multiple Samples

AbstractIntra-tumor heterogeneity renders the identification of somatic single-nucleotide variants (SNVs) a challenging problem. In particular, low-frequency SNVs are hard to distinguish from sequencing artifacts. While the increasing availability of multi-sample tumor DNA sequencing data holds the potential for more accurate variant calling, there is a lack of high-sensitivity multi-sample SNV callers that utilize these data. Here we report Moss, a method to identify low-frequency SNVs that recur in multiple sequencing samples from the same tumor. Moss provides any existing single-sample SNV caller the ability to support multiple samples with little additional time overhead. We demonstrate that Moss improves recall while maintaining high precision in a simulated dataset. On multi-sample hepatocellular carcinoma, acute myeloid leukemia and colorectal cancer datasets, Moss identifies new low-frequency variants that meet manual review criteria and are consistent with the tumor’s mutational signature profile. In addition, Moss detects the presence of variants in more samples of the same tumor than reported by the single-sample caller. Moss’ improved sensitivity in SNV calling will enable more detailed downstream analyses in cancer genomics.

Download Full-text