Fusion detection and quantification by pseudoalignment

Alignment-free filtering for cfNA fusion fragments

Bioinformatics ◽

10.1093/bioinformatics/btz346 ◽

2019 ◽

Vol 35 (14) ◽

pp. i225-i232 ◽

Cited By ~ 2

Author(s):

Xiao Yang ◽

Yasushi Saito ◽

Arjun Rao ◽

Hyunsung John Kim ◽

Pranav Singh ◽

...

Keyword(s):

Nucleic Acid ◽

Cell Line ◽

De Novo ◽

High Sensitivity ◽

Detection Methods ◽

Rna Seq ◽

Sequencing Data ◽

Alignment Free ◽

Fusion Detection ◽

High Depth

Abstract Motivation Cell-free nucleic acid (cfNA) sequencing data require improvements to existing fusion detection methods along multiple axes: high depth of sequencing, low allele fractions, short fragment lengths and specialized barcodes, such as unique molecular identifiers. Results AF4 was developed to address these challenges. It uses a novel alignment-free kmer-based method to detect candidate fusion fragments with high sensitivity and orders of magnitude faster than existing tools. Candidate fragments are then filtered using a max-cover criterion that significantly reduces spurious matches while retaining authentic fusion fragments. This efficient first stage reduces the data sufficiently that commonly used criteria can process the remaining information, or sophisticated filtering policies that may not scale to the raw reads can be used. AF4 provides both targeted and de novo fusion detection modes. We demonstrate both modes in benchmark simulated and real RNA-seq data as well as clinical and cell-line cfNA data. Availability and implementation AF4 is open sourced, licensed under Apache License 2.0, and is available at: https://github.com/grailbio/bio/tree/master/fusion.

Download Full-text

LongGF: computational algorithm and software tool for fast and accurate detection of gene fusions by long-read transcriptome sequencing

BMC Genomics ◽

10.1186/s12864-020-07207-4 ◽

2020 ◽

Vol 21 (S11) ◽

Author(s):

Qian Liu ◽

Yu Hu ◽

Andres Stucky ◽

Li Fang ◽

Jiang F. Zhong ◽

...

Keyword(s):

Candidate Gene ◽

Gene Fusion ◽

Superior Performance ◽

Gene Fusions ◽

Rna Seq ◽

Cdna Sequencing ◽

Sequencing Data ◽

Mrna Sequencing ◽

Long Read ◽

Fusion Detection

Abstract Background Long-read RNA-Seq techniques can generate reads that encompass a large proportion or the entire mRNA/cDNA molecules, so they are expected to address inherited limitations of short-read RNA-Seq techniques that typically generate < 150 bp reads. However, there is a general lack of software tools for gene fusion detection from long-read RNA-seq data, which takes into account the high basecalling error rates and the presence of alignment errors. Results In this study, we developed a fast computational tool, LongGF, to efficiently detect candidate gene fusions from long-read RNA-seq data, including cDNA sequencing data and direct mRNA sequencing data. We evaluated LongGF on tens of simulated long-read RNA-seq datasets, and demonstrated its superior performance in gene fusion detection. We also tested LongGF on a Nanopore direct mRNA sequencing dataset and a PacBio sequencing dataset generated on a mixture of 10 cancer cell lines, and found that LongGF achieved better performance to detect known gene fusions over existing computational tools. Furthermore, we tested LongGF on a Nanopore cDNA sequencing dataset on acute myeloid leukemia, and pinpointed the exact location of a translocation (previously known in cytogenetic resolution) in base resolution, which was further validated by Sanger sequencing. Conclusions In summary, LongGF will greatly facilitate the discovery of candidate gene fusion events from long-read RNA-Seq data, especially in cancer samples. LongGF is implemented in C++ and is available at https://github.com/WGLab/LongGF.

Download Full-text

TrancriptomeReconstructoR, A Data-Driven Annotation of Complex Transcriptomes

10.21203/rs.3.rs-131404/v1 ◽

2020 ◽

Author(s):

Maxim Ivanov ◽

Albin Sandelin ◽

Sebastian Marquardt

Keyword(s):

De Novo ◽

Gene Annotation ◽

R Package ◽

Sequence Information ◽

Rna Seq ◽

Sequencing Data ◽

Gene Model ◽

Preparation Methods ◽

Downstream Analysis

Abstract Background: The quality of gene annotation determines the interpretation of results obtained in transcriptomic studies. The growing number of genome sequence information calls for experimental and computational pipelines for de novo transcriptome annotation. Ideally, gene and transcript models should be called from a limited set of key experimental data. Results: We developed TranscriptomeReconstructoR, an R package which implements a pipeline for automated transcriptome annotation. It relies on integrating features from independent and complementary datasets: i) full-length RNA-seq for detection of splicing patterns and ii) high-throughput 5' and 3' tag sequencing data for accurate definition of gene borders. The pipeline can also take a nascent RNA-seq dataset to supplement the called gene model with transient transcripts.We reconstructed de novo the transcriptional landscape of wild type Arabidopsis thaliana seedlings as a proof-of-principle. A comparison to the existing transcriptome annotations revealed that our gene model is more accurate and comprehensive than the two most commonly used community gene models, TAIR10 and Araport11. In particular, we identify thousands of transient transcripts missing from the existing annotations. Our new annotation promises to improve the quality of A.thaliana genome research.Conclusions: Our proof-of-concept data suggest a cost-efficient strategy for rapid and accurate annotation of complex eukaryotic transcriptomes. We combine the choice of library preparation methods and sequencing platforms with the dedicated computational pipeline implemented in the TranscriptomeReconstructoR package. The pipeline only requires prior knowledge on the reference genomic DNA sequence, but not the transcriptome. The package seamlessly integrates with Bioconductor packages for downstream analysis.

Download Full-text

RNA-Seq analysis of glioma tumors to reveal targetable gene fusions.

Journal of Clinical Oncology ◽

10.1200/jco.2017.35.15_suppl.2019 ◽

2017 ◽

Vol 35 (15_suppl) ◽

pp. 2019-2019 ◽

Cited By ~ 4

Author(s):

Deepa Suresh Subramaniam ◽

Joanne Xiu ◽

Shwetal Mehta ◽

Zoran Gatalica ◽

Jeffrey Swensen ◽

...

Keyword(s):

Life Sciences ◽

Gene Fusions ◽

Astrocytic Tumors ◽

Fusion Genes ◽

Rna Seq ◽

Wild Type ◽

Oligodendroglial Tumors ◽

Grade Ii ◽

Glioma Tumors ◽

Grade Iii

2019 Background: Fusions involving oncogenes have been reported in gliomas and may serve as novel therapeutic targets. We aim to use RNA-sequencing to interrogate a large cohort of gliomas for targetable genetic fusions. Methods: Gliomas were profiled using the ArcherDx FusionPlex Assay at a CLIA-certified lab (Caris Life Sciences) and 52 gene targets were analyzed. Fusions with preserved kinase domains were investigated. Results: Among 404 gliomas tested, 39 (9.7%) presented potentially targetable fusions, of which 24/226 (11%) of glioblastoma (GBM), 5/42 (12%) of anaplastic astrocytoma (AA), 2/25 (8%) of grade II astrocytoma and 3 of 7 (43%) of pilocytic astrocytoma (PA) harbored targetable fusions. In GBMs, 1 of 15 (6.7%) IDH-mutated tumors had a fusion while 22 of 175 (12.6%) IDH-wild type tumors had fusions. 46 oligodendroglial tumors were profiled and no fusions were seen, which was lower than frequency of fusions in astrocytic tumors (34/300, p = 0.0236). The most frequent fusions seen involved FGFR3 (N = 12), including 10 FGFR3-TACC3 (1 AA, 6 GBM and 3 glioma NOS); 1 FGFR3-NBR1 (AA) and 1 FGFR3-BRAP (GBM). 11 fusions involving MET were seen, 10 in GBM and 1 in AA. The most common MET fusion was PTPRZ1-MET (1 in AA and 4 in GBM), followed by ST7-MET (N = 3, GBM), CAPZA2-Met (N = 2, GBM) and TPR-MET (N = 1, GBM). 8 NTRK fusions were seen; 1 involving NTRK1 (BCAN-NTRK1, PA), 6 NTRK2 (1 NOS1AP-NTRK2 in AA; GKAP1-NTRK2, KCTD8-NTRK2, TBC1D2-NTRK2 and SOSTM1-NTRK2, 1 each in GBM and 1 VCAN-NTRK2 in grade II astrocytoma) and 1 NTRK3 (EML4-NTRK3 in GBM). EGFR fusions (2 EGFR-SEPT14 and 1 EGFR-VWC2) were seen in 3 GBMs, BRAF in 3 (1 KIAA1549-BRAF, 1 LOC100093631-BRAF in PA and 1 ZSCAN23-BRAF in glioma NOS) and PDGFRA (RAB3IP-PDGFRA, in GBM) in 1. C11orf95-RELA fusions were seen in 2 of 3 grade III ependymomas but not in the 2 grade II ependymomas. Conclusions: We report targetable fusion genes involving NTRK, MET, EGFR, FGFR3, BRAF and PDGFRA including novel fusions that haven’t been previously described in gliomas (e.g., EGFR-VWC2; FGFR3-NBR1). Fusions were seen in over 10% of astrocytic tumors, while none was seen oligodendrogliomas. Identification of such kinase-associated fusion transcripts may allow us to exploit therapeutic opportunities with targeted therapies in gliomas.

Download Full-text

SimFuse: A Novel Fusion Simulator for RNA Sequencing (RNA-Seq) Data

BioMed Research International ◽

10.1155/2015/780519 ◽

2015 ◽

Vol 2015 ◽

pp. 1-5 ◽

Cited By ~ 2

Author(s):

Yuxiang Tan ◽

Yann Tambouret ◽

Stefano Monti

Keyword(s):

Sample Size ◽

Rna Sequencing ◽

High Throughput Sequencing ◽

Performance Metrics ◽

Simulated Data ◽

Real Data ◽

Rna Seq ◽

Sequencing Data ◽

Detection Algorithms ◽

Fusion Detection

The performance evaluation of fusion detection algorithms from high-throughput sequencing data crucially relies on the availability of data with known positive and negative cases of gene rearrangements. The use of simulated data circumvents some shortcomings of real data by generation of an unlimited number of true and false positive events, and the consequent robust estimation of accuracy measures, such as precision and recall. Although a few simulated fusion datasets from RNA Sequencing (RNA-Seq) are available, they are of limited sample size. This makes it difficult to systematically evaluate the performance of RNA-Seq based fusion-detection algorithms. Here, we present SimFuse to address this problem. SimFuse utilizes real sequencing data as the fusions’ background to closely approximate the distribution of reads from a real sequencing library and uses a reference genome as the template from which to simulate fusions’ supporting reads. To assess the supporting read-specific performance, SimFuse generates multiple datasets with various numbers of fusion supporting reads. Compared to an extant simulated dataset, SimFuse gives users control over the supporting read features and the sample size of the simulated library, based on which the performance metrics needed for the validation and comparison of alternative fusion-detection algorithms can be rigorously estimated.

Download Full-text

WIND (Workflow for pIRNAs aNd beyonD): a strategy for in-depth analysis of small RNA-seq data

F1000Research ◽

10.12688/f1000research.27868.1 ◽

2021 ◽

Vol 10 ◽

pp. 1

Author(s):

Konstantinos Geles ◽

Domenico Palumbo ◽

Assunta Sellitto ◽

Giorgio Giurato ◽

Eleonora Cianflone ◽

...

Keyword(s):

Small Rna ◽

Differential Expression Analysis ◽

Small Rna Sequencing ◽

Rna Seq ◽

Sequencing Data ◽

Transcript Quantification ◽

Annotation Track ◽

Depth Analysis ◽

Exploratory Data ◽

Downstream Analysis

Current bioinformatics workflows for PIWI-interacting RNA (piRNA) analysis focus primarily on germline-derived piRNAs and piRNA-clusters. Frequently, they suffer from outdated piRNA databases, questionable quantification methods, and lack of reproducibility. Often, pipelines specific to miRNA analysis are used for the piRNA research in silico. Furthermore, the absence of a well-established database for piRNA annotation, as for miRNA, leads to uniformity issues between studies and generates confusion for data analysts and biologists. For these reasons, we have developed WIND (Workflow for pIRNAs aNd beyonD), a bioinformatics workflow that addresses the crucial issue of piRNA annotation, thereby allowing a reliable analysis of small RNA sequencing data for the identification of piRNAs and other small non-coding RNAs (sncRNAs) that in the past have been incorrectly classified as piRNAs. WIND allows the creation of a comprehensive annotation track of sncRNAs combining information available in RNAcentral, with piRNA sequences from piRNABank, the first database dedicated to piRNA annotation. WIND was built with Docker containers for reproducibility and integrates widely used bioinformatics tools for sequence alignment and quantification. In addition, it includes Bioconductor packages for exploratory data and differential expression analysis. Moreover, WIND implements a "dual" approach for the evaluation of sncRNAs expression level quantifying the aligned reads to the annotated genome and carrying out an alignment-free transcript quantification using reads mapped to the transcriptome. Therefore, a broader range of piRNAs can be annotated, improving their quantification and easing the subsequent downstream analysis. WIND performance has been tested with several small RNA-seq datasets, demonstrating how our approach can be a useful and comprehensive resource to analyse piRNAs and other classes of sncRNAs.

Download Full-text

OR28-04 Identification of Novel and Rare Receptor Tyrosine Kinase Fusions in Thyroid Fine Needle Aspirates

Journal of the Endocrine Society ◽

10.1210/jendso/bvaa046.2185 ◽

2020 ◽

Vol 4 (Supplement_1) ◽

Author(s):

Lori J Wirth ◽

Elizabeth G Grubbs ◽

Masha J Livhits ◽

Steven I Sherman ◽

Steven P Weitzman ◽

...

Keyword(s):

Tyrosine Kinases ◽

Kinase Inhibitors ◽

Chromosomal Rearrangements ◽

Kinase Domain ◽

Fusion Partner ◽

Gene Fusions ◽

Rna Seq ◽

Fine Needle Aspirates ◽

Genomic Markers ◽

Treatment Naïve

Abstract Introduction: Receptor tyrosine kinases (RTKs) initiate signaling cascades, including growth and differentiation. Activation can occur through chromosomal rearrangements that lead to gene fusions. RTK fusions are potential targets for small molecule inhibitors to treat advanced cancers. The original Afirma Xpression Atlas (XA) reported 761 selected variants and 130 fusion pairs in Bethesda III/IV Afirma Genomic Sequencing Classifier (GSC) suspicious or Bethesda V/VI nodules. The landscape of additional potentially actionable gene fusions has not been explored in treatment-naïve patients. Methods: Anonymized RNA-seq data from >37,000 Bethesda III-VI samples were examined with STAR-fusion to determine gene/gene fusions. All samples were examined for NTRK1, NTRK3, RET, ALK, and BRAF fusions, regardless of fusion partner. Fusions were evaluated for being in-frame, with an intact kinase domain at the 3’ end of the fusion pair. Fusion pairs not currently reported by XA and not reported in thyroid TCGA fusion data are denoted “additional”. All fusion pairs were searched for in the literature and public fusion databases. Results: Examining the Veracyte clinical database revealed 7 additional NTRK1/3 fusions, with 3 NTRK fusions observed more than once - SQSTM1/NTRK3, VIM/NTRK3, and EML4/NTRK3. One of the 7 NTRK fusions had not been previously reported. Eight additional ALK fusions were identified, with 4 observed more than once- ITSN2/ALK, PPP1R21/ALK, PDE8B/ALK, NPAT/ALK. Five of these 8 ALK fusions had not been previously described. Seventeen additional RET fusions were identified, with 5 observed recurrently - KIAA1217/RET, AFAP1L2/RET, ACBD5/RET, SQSTM1/RET, and TFG/RET. Six of the 17 RET fusions had not been previously reported. Seventy-two additional BRAF fusions were identified, and 58 of them have not been previously reported. Eight of the 72 BRAF fusions were observed more than once. Examining >50,000 Afirma samples, NTRK1, NTRK3, RET, ALK, or BRAF fusions were not identified among the Afirma GSC Benign, and were present in 3.2% of 16,594 Bethesda III/IV Afirma GSC Suspicious samples, and 8.0% of 1,692 Bethesda V/VI samples. Correlation with surgical histology is unknown. Conclusions: By examining a large cohort of patients with an unbiased, whole-transcriptome RNA-seq assay, we identified potentially actionable kinase fusions in thyroid nodules beyond those described in TCGA. All fusions described here are either novel and not previously reported, rarely reported in one or two case studies, or not described in thyroid cancers. Additional NTRK, ALK, RET and BRAF fusions were found, all of which may be targeted with specific kinase inhibitors currently available. Future studies may determine genotype-phenotype correlations regarding the natural history of these neoplasms. Because of the potential clinical implications of these genomic markers for patient management, all 104 fusions described here are now included among the 235 gene pairs reported by the expanded Afirma XA.

Download Full-text

The landscape of gene fusions in hepatocellular carcinoma

10.1101/055376 ◽

2016 ◽

Author(s):

Chengpei Zhu ◽

Yanling Lv ◽

Liangcai Wu ◽

Jinxia Guan ◽

Xue Bai ◽

...

Keyword(s):

Hepatocellular Carcinoma ◽

Cancer Progression ◽

Gene Fusion ◽

Early Stage ◽

Pcr Analysis ◽

Gene Fusions ◽

Fusion Genes ◽

Rna Seq ◽

Rt Pcr ◽

Critical Function

AbstractMost hepatocellular carcinoma (HCC) patients are diagnosed at advanced stages and suffer limited treatment options. Challenges in early stage diagnosis may be due to the genetic complexity of HCC. Gene fusion plays a critical function in tumorigenesis and cancer progression in multiple cancers, yet the identities of fusion genes as potential diagnostic markers in HCC have not been investigated.Paired-end RNA sequencing was performed on noncancerous and cancerous lesions in two representative HBV-HCC patients. Potential fusion genes were identified by STAR-Fusion in STAR software and validated by four publicly available RNA-seq datasets. Fourteen pairs of frozen HBV-related HCC samples and adjacent non-tumor liver tissues were examined by RT-PCR analysis for gene fusion expression.We identified 2,354 different gene fusions in the two HBV-HCC patients. Validation analysis against the four RNA-seq datasets revealed only 1.8% (43/2,354) as recurrent fusions that were supported by public datasets. Comparison with four fusion databases demonstrated that three (HLA-DPB2-HLA-DRB1, CDH23-HLA-DPB1, and C15orf57-CBX3) out of 43 recurrent gene fusions were annotated as disease-related fusion events. Nineteen were novel recurrent fusions not previously annotated to diseases, including DCUN1D3-GSG1L and SERPINA5-SERPINA9. RT-PCR and Sanger sequencing of 14 pairs of HBV-related HCC samples confirmed expression of six of the new fusions, including RP11-476K15.1-CTD-2015H3.2.Our study provides new insights into gene fusions in HCC and could contribute to the development of anti-HCC therapy. RP11–476K15.1-CTD–2015H3.2 may serve as a new therapeutic biomarker in HCC.

Download Full-text

WIND (Workflow for pIRNAs aNd beyonD): a strategy for in-depth analysis of small RNA-seq data

F1000Research ◽

10.12688/f1000research.27868.2 ◽

2021 ◽

Vol 10 ◽

pp. 1

Author(s):

Konstantinos Geles ◽

Domenico Palumbo ◽

Assunta Sellitto ◽

Giorgio Giurato ◽

Eleonora Cianflone ◽

...

Keyword(s):

Small Rna ◽

Differential Expression Analysis ◽

Small Rna Sequencing ◽

Rna Seq ◽

Sequencing Data ◽

Transcript Quantification ◽

Annotation Track ◽

Depth Analysis ◽

Exploratory Data ◽

Downstream Analysis

Current bioinformatics workflows for PIWI-interacting RNA (piRNA) analysis focus primarily on germline-derived piRNAs and piRNA-clusters. Frequently, they suffer from outdated piRNA databases, questionable quantification methods, and lack of reproducibility. Often, pipelines specific to miRNA analysis are used for the piRNA research in silico. Furthermore, the absence of a well-established database for piRNA annotation, as for miRNA, leads to uniformity issues between studies and generates confusion for data analysts and biologists. For these reasons, we have developed WIND (Workflow for pIRNAs aNd beyonD), a bioinformatics workflow that addresses the crucial issue of piRNA annotation, thereby allowing a reliable analysis of small RNA sequencing data for the identification of piRNAs and other small non-coding RNAs (sncRNAs) that in the past have been incorrectly classified as piRNAs. WIND allows the creation of a comprehensive annotation track of sncRNAs combining information available in RNAcentral, with piRNA sequences from piRNABank, the first database dedicated to piRNA annotation. WIND was built with Docker containers for reproducibility and integrates widely used bioinformatics tools for sequence alignment and quantification. In addition, it includes Bioconductor packages for exploratory data and differential expression analysis. Moreover, WIND implements a "dual" approach for the evaluation of sncRNAs expression level quantifying the aligned reads to the annotated genome and carrying out an alignment-free transcript quantification using reads mapped to the transcriptome. Therefore, a broader range of piRNAs can be annotated, improving their quantification and easing the subsequent downstream analysis. WIND performance has been tested with several small RNA-seq datasets, demonstrating how our approach can be a useful and comprehensive resource to analyse piRNAs and other classes of sncRNAs.

Download Full-text

TrancriptomeReconstructoR: data-driven annotation of complex transcriptomes

10.1101/2020.12.10.418897 ◽

2020 ◽

Author(s):

Maxim Ivanov ◽

Albin Sandelin ◽

Sebastian Marquardt

Keyword(s):

De Novo ◽

Gene Annotation ◽

R Package ◽

Sequence Information ◽

Rna Seq ◽

Sequencing Data ◽

Gene Model ◽

Preparation Methods ◽

Downstream Analysis

AbstractBackgroundThe quality of gene annotation determines the interpretation of results obtained in transcriptomic studies. The growing number of genome sequence information calls for experimental and computational pipelines for de novo transcriptome annotation. Ideally, gene and transcript models should be called from a limited set of key experimental data.ResultsWe developed TranscriptomeReconstructoR, an R package which implements a pipeline for automated transcriptome annotation. It relies on integrating features from independent and complementary datasets: i) full-length RNA-seq for detection of splicing patterns and ii) high-throughput 5’ and 3’ tag sequencing data for accurate definition of gene borders. The pipeline can also take a nascent RNA-seq dataset to supplement the called gene model with transient transcripts.We reconstructed de novo the transcriptional landscape of wild type Arabidopsis thaliana seedlings as a proof-of-principle. A comparison to the existing transcriptome annotations revealed that our gene model is more accurate and comprehensive than the two most commonly used community gene models, TAIR10 and Araport11. In particular, we identify thousands of transient transcripts missing from the existing annotations. Our new annotation promises to improve the quality of A.thaliana genome research.ConclusionsOur proof-of-concept data suggest a cost-efficient strategy for rapid and accurate annotation of complex eukaryotic transcriptomes. We combine the choice of library preparation methods and sequencing platforms with the dedicated computational pipeline implemented in the TranscriptomeReconstructoR package. The pipeline only requires prior knowledge on the reference genomic DNA sequence, but not the transcriptome. The package seamlessly integrates with Bioconductor packages for downstream analysis.

Download Full-text