scholarly journals Fusion detection and quantification by pseudoalignment

2017 ◽  
Author(s):  
Páll Melsted ◽  
Shannon Hateley ◽  
Isaac Charles Joseph ◽  
Harold Pimentel ◽  
Nicolas Bray ◽  
...  

RNA sequencing in cancer cells is a powerful technique to detect chromosomal rearrangements, allowing for de novo discovery of actively expressed fusion genes. Here we focus on the problem of detecting gene fusions from raw sequencing data, assembling the reads to define fusion transcripts and their associated breakpoints, and quantifying their abundances. Building on the pseudoalignment idea that simplifies and accelerates transcript quantification, we introduce a novel approach to fusion detection based on inspecting paired reads that cannot be pseudoaligned due to conflicting matches. The method and software, called pizzly, filters false positives, assembles new transcripts from the fusion reads, and reports candidate fusions. With pizzly, fusion detection from raw RNA-Seq reads can be performed in a matter of minutes, making the program suitable for the analysis of large cancer gene expression databases and for clinical use. pizzly is available at https://github.com/pmelsted/pizzly

2019 ◽  
Vol 35 (14) ◽  
pp. i225-i232 ◽  
Author(s):  
Xiao Yang ◽  
Yasushi Saito ◽  
Arjun Rao ◽  
Hyunsung John Kim ◽  
Pranav Singh ◽  
...  

Abstract Motivation Cell-free nucleic acid (cfNA) sequencing data require improvements to existing fusion detection methods along multiple axes: high depth of sequencing, low allele fractions, short fragment lengths and specialized barcodes, such as unique molecular identifiers. Results AF4 was developed to address these challenges. It uses a novel alignment-free kmer-based method to detect candidate fusion fragments with high sensitivity and orders of magnitude faster than existing tools. Candidate fragments are then filtered using a max-cover criterion that significantly reduces spurious matches while retaining authentic fusion fragments. This efficient first stage reduces the data sufficiently that commonly used criteria can process the remaining information, or sophisticated filtering policies that may not scale to the raw reads can be used. AF4 provides both targeted and de novo fusion detection modes. We demonstrate both modes in benchmark simulated and real RNA-seq data as well as clinical and cell-line cfNA data. Availability and implementation AF4 is open sourced, licensed under Apache License 2.0, and is available at: https://github.com/grailbio/bio/tree/master/fusion.


BMC Genomics ◽  
2020 ◽  
Vol 21 (S11) ◽  
Author(s):  
Qian Liu ◽  
Yu Hu ◽  
Andres Stucky ◽  
Li Fang ◽  
Jiang F. Zhong ◽  
...  

Abstract Background Long-read RNA-Seq techniques can generate reads that encompass a large proportion or the entire mRNA/cDNA molecules, so they are expected to address inherited limitations of short-read RNA-Seq techniques that typically generate < 150 bp reads. However, there is a general lack of software tools for gene fusion detection from long-read RNA-seq data, which takes into account the high basecalling error rates and the presence of alignment errors. Results In this study, we developed a fast computational tool, LongGF, to efficiently detect candidate gene fusions from long-read RNA-seq data, including cDNA sequencing data and direct mRNA sequencing data. We evaluated LongGF on tens of simulated long-read RNA-seq datasets, and demonstrated its superior performance in gene fusion detection. We also tested LongGF on a Nanopore direct mRNA sequencing dataset and a PacBio sequencing dataset generated on a mixture of 10 cancer cell lines, and found that LongGF achieved better performance to detect known gene fusions over existing computational tools. Furthermore, we tested LongGF on a Nanopore cDNA sequencing dataset on acute myeloid leukemia, and pinpointed the exact location of a translocation (previously known in cytogenetic resolution) in base resolution, which was further validated by Sanger sequencing. Conclusions In summary, LongGF will greatly facilitate the discovery of candidate gene fusion events from long-read RNA-Seq data, especially in cancer samples. LongGF is implemented in C++ and is available at https://github.com/WGLab/LongGF.


2020 ◽  
Author(s):  
Maxim Ivanov ◽  
Albin Sandelin ◽  
Sebastian Marquardt

Abstract Background: The quality of gene annotation determines the interpretation of results obtained in transcriptomic studies. The growing number of genome sequence information calls for experimental and computational pipelines for de novo transcriptome annotation. Ideally, gene and transcript models should be called from a limited set of key experimental data. Results: We developed TranscriptomeReconstructoR, an R package which implements a pipeline for automated transcriptome annotation. It relies on integrating features from independent and complementary datasets: i) full-length RNA-seq for detection of splicing patterns and ii) high-throughput 5' and 3' tag sequencing data for accurate definition of gene borders. The pipeline can also take a nascent RNA-seq dataset to supplement the called gene model with transient transcripts.We reconstructed de novo the transcriptional landscape of wild type Arabidopsis thaliana seedlings as a proof-of-principle. A comparison to the existing transcriptome annotations revealed that our gene model is more accurate and comprehensive than the two most commonly used community gene models, TAIR10 and Araport11. In particular, we identify thousands of transient transcripts missing from the existing annotations. Our new annotation promises to improve the quality of A.thaliana genome research.Conclusions: Our proof-of-concept data suggest a cost-efficient strategy for rapid and accurate annotation of complex eukaryotic transcriptomes. We combine the choice of library preparation methods and sequencing platforms with the dedicated computational pipeline implemented in the TranscriptomeReconstructoR package. The pipeline only requires prior knowledge on the reference genomic DNA sequence, but not the transcriptome. The package seamlessly integrates with Bioconductor packages for downstream analysis.


2017 ◽  
Vol 35 (15_suppl) ◽  
pp. 2019-2019 ◽  
Author(s):  
Deepa Suresh Subramaniam ◽  
Joanne Xiu ◽  
Shwetal Mehta ◽  
Zoran Gatalica ◽  
Jeffrey Swensen ◽  
...  

2019 Background: Fusions involving oncogenes have been reported in gliomas and may serve as novel therapeutic targets. We aim to use RNA-sequencing to interrogate a large cohort of gliomas for targetable genetic fusions. Methods: Gliomas were profiled using the ArcherDx FusionPlex Assay at a CLIA-certified lab (Caris Life Sciences) and 52 gene targets were analyzed. Fusions with preserved kinase domains were investigated. Results: Among 404 gliomas tested, 39 (9.7%) presented potentially targetable fusions, of which 24/226 (11%) of glioblastoma (GBM), 5/42 (12%) of anaplastic astrocytoma (AA), 2/25 (8%) of grade II astrocytoma and 3 of 7 (43%) of pilocytic astrocytoma (PA) harbored targetable fusions. In GBMs, 1 of 15 (6.7%) IDH-mutated tumors had a fusion while 22 of 175 (12.6%) IDH-wild type tumors had fusions. 46 oligodendroglial tumors were profiled and no fusions were seen, which was lower than frequency of fusions in astrocytic tumors (34/300, p = 0.0236). The most frequent fusions seen involved FGFR3 (N = 12), including 10 FGFR3-TACC3 (1 AA, 6 GBM and 3 glioma NOS); 1 FGFR3-NBR1 (AA) and 1 FGFR3-BRAP (GBM). 11 fusions involving MET were seen, 10 in GBM and 1 in AA. The most common MET fusion was PTPRZ1-MET (1 in AA and 4 in GBM), followed by ST7-MET (N = 3, GBM), CAPZA2-Met (N = 2, GBM) and TPR-MET (N = 1, GBM). 8 NTRK fusions were seen; 1 involving NTRK1 (BCAN-NTRK1, PA), 6 NTRK2 (1 NOS1AP-NTRK2 in AA; GKAP1-NTRK2, KCTD8-NTRK2, TBC1D2-NTRK2 and SOSTM1-NTRK2, 1 each in GBM and 1 VCAN-NTRK2 in grade II astrocytoma) and 1 NTRK3 (EML4-NTRK3 in GBM). EGFR fusions (2 EGFR-SEPT14 and 1 EGFR-VWC2) were seen in 3 GBMs, BRAF in 3 (1 KIAA1549-BRAF, 1 LOC100093631-BRAF in PA and 1 ZSCAN23-BRAF in glioma NOS) and PDGFRA (RAB3IP-PDGFRA, in GBM) in 1. C11orf95-RELA fusions were seen in 2 of 3 grade III ependymomas but not in the 2 grade II ependymomas. Conclusions: We report targetable fusion genes involving NTRK, MET, EGFR, FGFR3, BRAF and PDGFRA including novel fusions that haven’t been previously described in gliomas (e.g., EGFR-VWC2; FGFR3-NBR1). Fusions were seen in over 10% of astrocytic tumors, while none was seen oligodendrogliomas. Identification of such kinase-associated fusion transcripts may allow us to exploit therapeutic opportunities with targeted therapies in gliomas.


2015 ◽  
Vol 2015 ◽  
pp. 1-5 ◽  
Author(s):  
Yuxiang Tan ◽  
Yann Tambouret ◽  
Stefano Monti

The performance evaluation of fusion detection algorithms from high-throughput sequencing data crucially relies on the availability of data with known positive and negative cases of gene rearrangements. The use of simulated data circumvents some shortcomings of real data by generation of an unlimited number of true and false positive events, and the consequent robust estimation of accuracy measures, such as precision and recall. Although a few simulated fusion datasets from RNA Sequencing (RNA-Seq) are available, they are of limited sample size. This makes it difficult to systematically evaluate the performance of RNA-Seq based fusion-detection algorithms. Here, we present SimFuse to address this problem. SimFuse utilizes real sequencing data as the fusions’ background to closely approximate the distribution of reads from a real sequencing library and uses a reference genome as the template from which to simulate fusions’ supporting reads. To assess the supporting read-specific performance, SimFuse generates multiple datasets with various numbers of fusion supporting reads. Compared to an extant simulated dataset, SimFuse gives users control over the supporting read features and the sample size of the simulated library, based on which the performance metrics needed for the validation and comparison of alternative fusion-detection algorithms can be rigorously estimated.


F1000Research ◽  
2021 ◽  
Vol 10 ◽  
pp. 1
Author(s):  
Konstantinos Geles ◽  
Domenico Palumbo ◽  
Assunta Sellitto ◽  
Giorgio Giurato ◽  
Eleonora Cianflone ◽  
...  

Current bioinformatics workflows for PIWI-interacting RNA (piRNA) analysis focus primarily on germline-derived piRNAs and piRNA-clusters. Frequently, they suffer from outdated piRNA databases, questionable quantification methods, and lack of reproducibility. Often, pipelines specific to miRNA analysis are used for the piRNA research in silico. Furthermore, the absence of a well-established database for piRNA annotation, as for miRNA, leads to uniformity issues between studies and generates confusion for data analysts and biologists. For these reasons, we have developed WIND (Workflow for pIRNAs aNd beyonD), a bioinformatics workflow that addresses the crucial issue of piRNA annotation, thereby allowing a reliable analysis of small RNA sequencing data for the identification of piRNAs and other small non-coding RNAs (sncRNAs) that in the past have been incorrectly classified as piRNAs. WIND allows the creation of a comprehensive annotation track of sncRNAs combining information available in RNAcentral, with piRNA sequences from piRNABank, the first database dedicated to piRNA annotation. WIND was built with Docker containers for reproducibility and integrates widely used bioinformatics tools for sequence alignment and quantification. In addition, it includes Bioconductor packages for exploratory data and differential expression analysis. Moreover, WIND implements a "dual" approach for the evaluation of sncRNAs expression level quantifying the aligned reads to the annotated genome and carrying out an alignment-free transcript quantification using reads mapped to the transcriptome. Therefore, a broader range of piRNAs can be annotated, improving their quantification and easing the subsequent downstream analysis. WIND performance has been tested with several small RNA-seq datasets, demonstrating how our approach can be a useful and comprehensive resource to analyse piRNAs and other classes of sncRNAs.


2020 ◽  
Vol 4 (Supplement_1) ◽  
Author(s):  
Lori J Wirth ◽  
Elizabeth G Grubbs ◽  
Masha J Livhits ◽  
Steven I Sherman ◽  
Steven P Weitzman ◽  
...  

Abstract Introduction: Receptor tyrosine kinases (RTKs) initiate signaling cascades, including growth and differentiation. Activation can occur through chromosomal rearrangements that lead to gene fusions. RTK fusions are potential targets for small molecule inhibitors to treat advanced cancers. The original Afirma Xpression Atlas (XA) reported 761 selected variants and 130 fusion pairs in Bethesda III/IV Afirma Genomic Sequencing Classifier (GSC) suspicious or Bethesda V/VI nodules. The landscape of additional potentially actionable gene fusions has not been explored in treatment-naïve patients. Methods: Anonymized RNA-seq data from &gt;37,000 Bethesda III-VI samples were examined with STAR-fusion to determine gene/gene fusions. All samples were examined for NTRK1, NTRK3, RET, ALK, and BRAF fusions, regardless of fusion partner. Fusions were evaluated for being in-frame, with an intact kinase domain at the 3’ end of the fusion pair. Fusion pairs not currently reported by XA and not reported in thyroid TCGA fusion data are denoted “additional”. All fusion pairs were searched for in the literature and public fusion databases. Results: Examining the Veracyte clinical database revealed 7 additional NTRK1/3 fusions, with 3 NTRK fusions observed more than once - SQSTM1/NTRK3, VIM/NTRK3, and EML4/NTRK3. One of the 7 NTRK fusions had not been previously reported. Eight additional ALK fusions were identified, with 4 observed more than once- ITSN2/ALK, PPP1R21/ALK, PDE8B/ALK, NPAT/ALK. Five of these 8 ALK fusions had not been previously described. Seventeen additional RET fusions were identified, with 5 observed recurrently - KIAA1217/RET, AFAP1L2/RET, ACBD5/RET, SQSTM1/RET, and TFG/RET. Six of the 17 RET fusions had not been previously reported. Seventy-two additional BRAF fusions were identified, and 58 of them have not been previously reported. Eight of the 72 BRAF fusions were observed more than once. Examining &gt;50,000 Afirma samples, NTRK1, NTRK3, RET, ALK, or BRAF fusions were not identified among the Afirma GSC Benign, and were present in 3.2% of 16,594 Bethesda III/IV Afirma GSC Suspicious samples, and 8.0% of 1,692 Bethesda V/VI samples. Correlation with surgical histology is unknown. Conclusions: By examining a large cohort of patients with an unbiased, whole-transcriptome RNA-seq assay, we identified potentially actionable kinase fusions in thyroid nodules beyond those described in TCGA. All fusions described here are either novel and not previously reported, rarely reported in one or two case studies, or not described in thyroid cancers. Additional NTRK, ALK, RET and BRAF fusions were found, all of which may be targeted with specific kinase inhibitors currently available. Future studies may determine genotype-phenotype correlations regarding the natural history of these neoplasms. Because of the potential clinical implications of these genomic markers for patient management, all 104 fusions described here are now included among the 235 gene pairs reported by the expanded Afirma XA.


2016 ◽  
Author(s):  
Chengpei Zhu ◽  
Yanling Lv ◽  
Liangcai Wu ◽  
Jinxia Guan ◽  
Xue Bai ◽  
...  

AbstractMost hepatocellular carcinoma (HCC) patients are diagnosed at advanced stages and suffer limited treatment options. Challenges in early stage diagnosis may be due to the genetic complexity of HCC. Gene fusion plays a critical function in tumorigenesis and cancer progression in multiple cancers, yet the identities of fusion genes as potential diagnostic markers in HCC have not been investigated.Paired-end RNA sequencing was performed on noncancerous and cancerous lesions in two representative HBV-HCC patients. Potential fusion genes were identified by STAR-Fusion in STAR software and validated by four publicly available RNA-seq datasets. Fourteen pairs of frozen HBV-related HCC samples and adjacent non-tumor liver tissues were examined by RT-PCR analysis for gene fusion expression.We identified 2,354 different gene fusions in the two HBV-HCC patients. Validation analysis against the four RNA-seq datasets revealed only 1.8% (43/2,354) as recurrent fusions that were supported by public datasets. Comparison with four fusion databases demonstrated that three (HLA-DPB2-HLA-DRB1, CDH23-HLA-DPB1, and C15orf57-CBX3) out of 43 recurrent gene fusions were annotated as disease-related fusion events. Nineteen were novel recurrent fusions not previously annotated to diseases, including DCUN1D3-GSG1L and SERPINA5-SERPINA9. RT-PCR and Sanger sequencing of 14 pairs of HBV-related HCC samples confirmed expression of six of the new fusions, including RP11-476K15.1-CTD-2015H3.2.Our study provides new insights into gene fusions in HCC and could contribute to the development of anti-HCC therapy. RP11–476K15.1-CTD–2015H3.2 may serve as a new therapeutic biomarker in HCC.


F1000Research ◽  
2021 ◽  
Vol 10 ◽  
pp. 1
Author(s):  
Konstantinos Geles ◽  
Domenico Palumbo ◽  
Assunta Sellitto ◽  
Giorgio Giurato ◽  
Eleonora Cianflone ◽  
...  

Current bioinformatics workflows for PIWI-interacting RNA (piRNA) analysis focus primarily on germline-derived piRNAs and piRNA-clusters. Frequently, they suffer from outdated piRNA databases, questionable quantification methods, and lack of reproducibility. Often, pipelines specific to miRNA analysis are used for the piRNA research in silico. Furthermore, the absence of a well-established database for piRNA annotation, as for miRNA, leads to uniformity issues between studies and generates confusion for data analysts and biologists. For these reasons, we have developed WIND (Workflow for pIRNAs aNd beyonD), a bioinformatics workflow that addresses the crucial issue of piRNA annotation, thereby allowing a reliable analysis of small RNA sequencing data for the identification of piRNAs and other small non-coding RNAs (sncRNAs) that in the past have been incorrectly classified as piRNAs. WIND allows the creation of a comprehensive annotation track of sncRNAs combining information available in RNAcentral, with piRNA sequences from piRNABank, the first database dedicated to piRNA annotation. WIND was built with Docker containers for reproducibility and integrates widely used bioinformatics tools for sequence alignment and quantification. In addition, it includes Bioconductor packages for exploratory data and differential expression analysis. Moreover, WIND implements a "dual" approach for the evaluation of sncRNAs expression level quantifying the aligned reads to the annotated genome and carrying out an alignment-free transcript quantification using reads mapped to the transcriptome. Therefore, a broader range of piRNAs can be annotated, improving their quantification and easing the subsequent downstream analysis. WIND performance has been tested with several small RNA-seq datasets, demonstrating how our approach can be a useful and comprehensive resource to analyse piRNAs and other classes of sncRNAs.


2020 ◽  
Author(s):  
Maxim Ivanov ◽  
Albin Sandelin ◽  
Sebastian Marquardt

AbstractBackgroundThe quality of gene annotation determines the interpretation of results obtained in transcriptomic studies. The growing number of genome sequence information calls for experimental and computational pipelines for de novo transcriptome annotation. Ideally, gene and transcript models should be called from a limited set of key experimental data.ResultsWe developed TranscriptomeReconstructoR, an R package which implements a pipeline for automated transcriptome annotation. It relies on integrating features from independent and complementary datasets: i) full-length RNA-seq for detection of splicing patterns and ii) high-throughput 5’ and 3’ tag sequencing data for accurate definition of gene borders. The pipeline can also take a nascent RNA-seq dataset to supplement the called gene model with transient transcripts.We reconstructed de novo the transcriptional landscape of wild type Arabidopsis thaliana seedlings as a proof-of-principle. A comparison to the existing transcriptome annotations revealed that our gene model is more accurate and comprehensive than the two most commonly used community gene models, TAIR10 and Araport11. In particular, we identify thousands of transient transcripts missing from the existing annotations. Our new annotation promises to improve the quality of A.thaliana genome research.ConclusionsOur proof-of-concept data suggest a cost-efficient strategy for rapid and accurate annotation of complex eukaryotic transcriptomes. We combine the choice of library preparation methods and sequencing platforms with the dedicated computational pipeline implemented in the TranscriptomeReconstructoR package. The pipeline only requires prior knowledge on the reference genomic DNA sequence, but not the transcriptome. The package seamlessly integrates with Bioconductor packages for downstream analysis.


Sign in / Sign up

Export Citation Format

Share Document