Custom workflows to improve joint variant calling from multiple related tumour samples: FreeBayesSomatic and Strelka2Pass

Bioinformatics ◽

10.1093/bioinformatics/btab606 ◽

2021 ◽

Author(s):

S Hollizeck ◽

S Q Wong ◽

B Solomon ◽

D Chandranada ◽

S-J Dawson

Keyword(s):

Somatic Mutations ◽

Source Code ◽

Variant Calling ◽

Supplementary Information ◽

Normal Sample ◽

Supplementary Data ◽

Sequencing Data ◽

Normal Pair ◽

Single Tumour ◽

Matched Normal Sample

Abstract Summary This work describes two novel workflows for variant calling that extend the widely used algorithms of Strelka2 and FreeBayes to call somatic mutations from multiple related tumour samples and one matched normal sample. We show that these workflows offer higher precision and recall than their single tumour-normal pair equivalents in both simulated and clinical sequencing data. Availability and Implementation Source code freely available at the following link: https://atlassian.petermac.org.au/bitbucket/projects/DAW/repos/multisamplevariantcalling and executable through Janis (https://github.com/PMCC-BioinformaticsCore/janis) under the GPLv3 licence. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Estimating tumor mutational burden from RNA-sequencing without a matched-normal sample

10.1101/2021.08.15.456379 ◽

2021 ◽

Author(s):

Rotem Katzir ◽

Keren Yizhak

Keyword(s):

Rna Sequencing ◽

Somatic Mutations ◽

Superior Performance ◽

Normal Sample ◽

Sequencing Data ◽

Mutational Signatures ◽

Driver Genes ◽

Mutational Burden ◽

Tumor Mutational Burden ◽

Matched Normal Sample

Detection of somatic point mutations using patients sequencing data has many clinical applications, including the identification of cancer driver genes, detection of mutational signatures, and estimation of tumor mutational burden (TMB). In a recent work we developed a tool for detection of somatic mutations using tumor RNA and matched-normal DNA. Here, we further extend it to detect somatic mutations from RNA sequencing data without a matched-normal sample. This is accomplished via a machine learning approach that classifies mutations as either somatic or germline based on various features. When applied to RNA-sequencing of >450 melanoma samples high precision and recall are achieved, and both mutational signatures and driver genes are correctly identified. Finally, we show that RNA-based TMB is significantly associated with patient survival, with similar or superior performance to DNA-based TMB. Our pipeline can be utilized in many future applications, analyzing novel and existing datasets where only RNA is available.

Download Full-text

KEC: unique sequence search by K-mer exclusion

Bioinformatics ◽

10.1093/bioinformatics/btab196 ◽

2021 ◽

Author(s):

Pavel Beran ◽

Dagmar Stehlíková ◽

Stephen P Cohen ◽

Vladislav Čurn

Keyword(s):

Amino Acid ◽

Nucleic Acid ◽

Source Code ◽

Unique Sequence ◽

Supplementary Information ◽

Supplementary Data ◽

Laptop Computers ◽

Sequence Search ◽

Target Sequences ◽

Cross Reference

Abstract Summary Searching for amino acid or nucleic acid sequences unique to one organism may be challenging depending on size of the available datasets. K-mer elimination by cross-reference (KEC) allows users to quickly and easily find unique sequences by providing target and non-target sequences. Due to its speed, it can be used for datasets of genomic size and can be run on desktop or laptop computers with modest specifications. Availability and implementation KEC is freely available for non-commercial purposes. Source code and executable binary files compiled for Linux, Mac and Windows can be downloaded from https://github.com/berybox/KEC. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

BioCommons: a robust java library for RNA structural bioinformatics

Bioinformatics ◽

10.1093/bioinformatics/btab069 ◽

2021 ◽

Author(s):

Tomasz Zok

Keyword(s):

Source Code ◽

Structural Bioinformatics ◽

Supplementary Information ◽

Supplementary Data ◽

Bioinformatic Tools ◽

Data Formats ◽

Central Repository ◽

Diverse Data ◽

2D And 3D ◽

Java Library

Abstract Motivation Biomolecular structures come in multiple representations and diverse data formats. Their incompatibility with the requirements of data analysis programs significantly hinders the analytics and the creation of new structure-oriented bioinformatic tools. Therefore, the need for robust libraries of data processing functions is still growing. Results BioCommons is an open-source, Java library for structural bioinformatics. It contains many functions working with the 2D and 3D structures of biomolecules, with a particular emphasis on RNA. Availability and implementation The library is available in Maven Central Repository and its source code is hosted on GitHub: https://github.com/tzok/BioCommons Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

A unified haplotype-based method for accurate and comprehensive variant calling

10.1101/456103 ◽

2018 ◽

Cited By ~ 3

Author(s):

Daniel P Cooke ◽

David C Wedge ◽

Gerton Lunter

Keyword(s):

De Novo ◽

Variant Calling ◽

Normal Sample ◽

Sequencing Data ◽

Somatic Variation ◽

Data Set ◽

Small Complex ◽

Physical Linkage ◽

Germline Variation ◽

Almost All

Haplotype-based variant callers, which consider physical linkage between variant sites, are currently among the best tools for germline variation discovery and genotyping from short-read sequencing data. However, almost all such tools were designed specifically for detecting common germline variation in diploid populations, and give sub-optimal results in other scenarios. Here we present Octopus, a versatile haplotype-based variant caller that uses a polymorphic Bayesian genotyping model capable of modeling sequencing data from a range of experimental designs within a unified haplotype-aware framework. We show that Octopus accurately calls de novo mutations in parent-offspring trios and germline variants in individuals, including SNVs, indels, and small complex replacements such as microinversions. In addition, using a carefully designed synthetic-tumour data set derived from clean sequencing data from a sample with known germline haplotypes, and observed mutations in large cohort of tumour samples, we show that Octopus accurately characterizes germline and somatic variation in tumours, both with and without a paired normal sample. Sequencing reads and prior information are combined to phase called genotypes of arbitrary ploidy, including those with somatic mutations. Octopus also outputs realigned evidence BAMs to aid validation and interpretation.

Download Full-text

SVIM-asm: Structural variant detection from haploid and diploid genome assemblies

10.1101/2020.10.27.356907 ◽

2020 ◽

Author(s):

David Heller ◽

Martin Vingron

Keyword(s):

Genetic Information ◽

Source Code ◽

Supplementary Information ◽

Supplementary Data ◽

Diploid Genome ◽

Insertions And Deletions ◽

Structural Variant ◽

Sequencing Technologies ◽

Variant Detection ◽

Genome Assemblies

AbstractMotivationWith the availability of new sequencing technologies, the generation of haplotype-resolved genome assemblies up to chromosome scale has become feasible. These assemblies capture the complete genetic information of both parental haplotypes, increase structural variant (SV) calling sensitivity and enable direct genotyping and phasing of SVs. Yet, existing SV callers are designed for haploid genome assemblies only, do not support genotyping or detect only a limited set of SV classes.ResultsWe introduce our method SVIM-asm for the detection and genotyping of six common classes of SVs from haploid and diploid genome assemblies. Compared against the only other existing SV caller for diploid assemblies, DipCall, SVIM-asm detects more SV classes and reached higher F1 scores for the detection of insertions and deletions on two recently published assemblies of the HG002 individual.Availability and ImplementationSVIM-asm has been implemented in Python and can be easily installed via bioconda. Its source code is available at github.com/eldariont/[email protected] informationSupplementary data are available online.

Download Full-text

ExpansionHunter: a sequence-graph-based tool to analyze variation in short tandem repeat regions

Bioinformatics ◽

10.1093/bioinformatics/btz431 ◽

2019 ◽

Vol 35 (22) ◽

pp. 4754-4756 ◽

Cited By ~ 29

Author(s):

Egor Dolzhenko ◽

Viraj Deshpande ◽

Felix Schlesinger ◽

Peter Krusche ◽

Roman Petrovski ◽

...

Keyword(s):

Tandem Repeat ◽

Broad Class ◽

Source Code ◽

Computational Method ◽

Supplementary Information ◽

Dna Repeats ◽

Supplementary Data ◽

Sequence Graph ◽

Version 2.0 ◽

Short Tandem

Abstract Summary We describe a novel computational method for genotyping repeats using sequence graphs. This method addresses the long-standing need to accurately genotype medically important loci containing repeats adjacent to other variants or imperfect DNA repeats such as polyalanine repeats. Here we introduce a new version of our repeat genotyping software, ExpansionHunter, that uses this method to perform targeted genotyping of a broad class of such loci. Availability and implementation ExpansionHunter is implemented in C++ and is available under the Apache License Version 2.0. The source code, documentation, and Linux/macOS binaries are available at https://github.com/Illumina/ExpansionHunter/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Numt identification and removal with RtN!

Bioinformatics ◽

10.1093/bioinformatics/btaa642 ◽

2020 ◽

Vol 36 (20) ◽

pp. 5115-5116 ◽

Cited By ~ 2

Author(s):

August E Woerner ◽

Jennifer Churchill Cihlar ◽

Utpal Smart ◽

Bruce Budowle

Keyword(s):

Mitochondrial Genome ◽

Massively Parallel Sequencing ◽

Sequence Similarity ◽

Variant Calling ◽

Supplementary Information ◽

Mitochondrial Genomes ◽

Sequencing Data ◽

Read Mapping ◽

Genome Data ◽

Mitochondrial Sequences

Abstract Motivation Assays in mitochondrial genomics rely on accurate read mapping and variant calling. However, there are known and unknown nuclear paralogs that have fundamentally different genetic properties than that of the mitochondrial genome. Such paralogs complicate the interpretation of mitochondrial genome data and confound variant calling. Results Remove the Numts! (RtN!) was developed to categorize reads from massively parallel sequencing data not based on the expected properties and sequence identities of paralogous nuclear encoded mitochondrial sequences, but instead using sequence similarity to a large database of publicly available mitochondrial genomes. RtN! removes low-level sequencing noise and mitochondrial paralogs while not impacting variant calling, while competing methods were shown to remove true variants from mitochondrial mixtures. Availability and implementation https://github.com/Ahhgust/RtN Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

LIONS: analysis suite for detecting and quantifying transposable element initiated transcription from RNA-seq

Bioinformatics ◽

10.1093/bioinformatics/btz130 ◽

2019 ◽

Vol 35 (19) ◽

pp. 3839-3841 ◽

Cited By ~ 6

Author(s):

Artem Babaian ◽

I Richard Thompson ◽

Jake Lever ◽

Liane Gagnier ◽

Mohammad M Karimi ◽

...

Keyword(s):

Transposable Elements ◽

Transposable Element ◽

Test Data ◽

Source Code ◽

Supplementary Information ◽

Transcriptional Networks ◽

Supplementary Data ◽

Rna Seq ◽

Transcriptional Initiation ◽

Instruction Manual

Abstract Summary Transposable elements (TEs) influence the evolution of novel transcriptional networks yet the specific and meaningful interpretation of how TE-derived transcriptional initiation contributes to the transcriptome has been marred by computational and methodological deficiencies. We developed LIONS for the analysis of RNA-seq data to specifically detect and quantify TE-initiated transcripts. Availability and implementation Source code, container, test data and instruction manual are freely available at www.github.com/ababaian/LIONS. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

OpenBioLink: a benchmarking framework for large-scale biomedical link prediction

Bioinformatics ◽

10.1093/bioinformatics/btaa274 ◽

2020 ◽

Vol 36 (13) ◽

pp. 4097-4098 ◽

Cited By ~ 3

Author(s):

Anna Breit ◽

Simon Ott ◽

Asan Agibetov ◽

Matthias Samwald

Keyword(s):

Link Prediction ◽

Large Scale ◽

Source Code ◽

Machine Learning Algorithms ◽

Knowledge Networks ◽

Supplementary Information ◽

Supplementary Data ◽

Biomedical Knowledge ◽

High Quality ◽

Baseline Evaluation

Abstract Summary Recently, novel machine-learning algorithms have shown potential for predicting undiscovered links in biomedical knowledge networks. However, dedicated benchmarks for measuring algorithmic progress have not yet emerged. With OpenBioLink, we introduce a large-scale, high-quality and highly challenging biomedical link prediction benchmark to transparently and reproducibly evaluate such algorithms. Furthermore, we present preliminary baseline evaluation results. Availability and implementation Source code and data are openly available at https://github.com/OpenBioLink/OpenBioLink. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Capybara: equivalence ClAss enumeration of coPhylogenY event-BAsed ReconciliAtions

Bioinformatics ◽

10.1093/bioinformatics/btaa498 ◽

2020 ◽

Vol 36 (14) ◽

pp. 4197-4199

Author(s):

Yishu Wang ◽

Arnaud Mary ◽

Marie-France Sagot ◽

Blerina Sinaimeri

Keyword(s):

Source Code ◽

Supplementary Information ◽

Optimal Solutions ◽

Supplementary Data ◽

Huge Number ◽

Tree Reconciliation ◽

Class Enumeration ◽

Suboptimal Solutions ◽

The Many ◽

Event Based

Abstract Motivation Phylogenetic tree reconciliation is the method of choice in analyzing host-symbiont systems. Despite the many reconciliation tools that have been proposed in the literature, two main issues remain unresolved: (i) listing suboptimal solutions (i.e. whose score is ‘close’ to the optimal ones) and (ii) listing only solutions that are biologically different ‘enough’. The first issue arises because the optimal solutions are not always the ones biologically most significant; providing many suboptimal solutions as alternatives for the optimal ones is thus very useful. The second one is related to the difficulty to analyze an often huge number of optimal solutions. In this article, we propose Capybara that addresses both of these problems in an efficient way. Furthermore, it includes a tool for visualizing the solutions that significantly helps the user in the process of analyzing the results. Availability and implementation The source code, documentation and binaries for all platforms are freely available at https://capybara-doc.readthedocs.io/. Contact [email protected] or [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text