scholarly journals STAR-Fusion: Fast and Accurate Fusion Transcript Detection from RNA-Seq

2017 ◽  
Author(s):  
Brian J. Haas ◽  
Alex Dobin ◽  
Nicolas Stransky ◽  
Bo Li ◽  
Xiao Yang ◽  
...  

AbstractMotivationFusion genes created by genomic rearrangements can be potent drivers of tumorigenesis. However, accurate identification of functionally fusion genes from genomic sequencing requires whole genome sequencing, since exonic sequencing alone is often insufficient. Transcriptome sequencing provides a direct, highly effective alternative for capturing molecular evidence of expressed fusions in the precision medicine pipeline, but current methods tend to be inefficient or insufficiently accurate, lacking in sensitivity or predicting large numbers of false positives. Here, we describe STAR-Fusion, a method that is both fast and accurate in identifying fusion transcripts from RNA-Seq data.ResultsWe benchmarked STAR-Fusion’s fusion detection accuracy using both simulated and genuine Illumina paired-end RNA-Seq data, and show that it has superior performance compared to popular alternative fusion detection methods.Availability and implementationSTAR-Fusion is implemented in Perl, freely available as open source software at http://star-fusion.github.io, and supported on [email protected]

2021 ◽  
Author(s):  
Hamid Reza Mohebbi ◽  
Nurit Haspel

Gene fusions events, which are the result of two genes fused together to create a hybrid gene, were first described in cancer cells in the early 1980s. These events are relatively common in many cancers including prostate, lymphoid, soft tissue, and breast. Recent advances in next-generation sequencing (NGS) provide a high volume of genomic data, including cancer genomes. The detection of possible gene fusions requires fast and accurate methods. However, current methods suffer from inefficiency, lack of sufficient accuracy, and a high false-positive rate. We present an RNA-Seq fusion detection method that uses dimensionality reduction and parallel computing to speed up the computation. We convert the RNA categorical space into a compact binary array called binary fingerprints, which enables us to reduce the memory usage and increase efficiency. The search and detection of fusion candidates are done using the Jaccard distance. The detection of candidates is followed by refinement. We benchmarked our fusion prediction accuracy using both simulated and genuine RNA-Seq datasets. Paired-end Illumina RNA-Seq genuine data were obtained from 60 publicly available cancer cell line data sets. The results are compared against the state-of-the-art-methods such as STAR-Fusion, InFusion, and TopHat-Fusion. Our results show that FDJD exhibits superior accuracy compared to popular alternative fusion detection methods. We achieved 90% accuracy on simulated fusion transcript inputs, which is the highest among the compared methods while maintaining comparable run time.


Blood ◽  
2019 ◽  
Vol 134 (Supplement_1) ◽  
pp. 4655-4655
Author(s):  
Paul Kerbs ◽  
Aarif Mohamed Nazeer Batcha ◽  
Sebastian Vosberg ◽  
Dirk Metzler ◽  
Tobias Herold ◽  
...  

Accurate and complete genetic classification of AML is crucial for the prediction of clinical outcome and treatment stratification. Deciphering the spectrum of genetic abnormalities by polymerase chain reaction (PCR), karyotyping and fluorescence in situ hybridization (FISH) in routine diagnostics is the current gold standard, however, fusion genes might potentially be missed by these assays. Recently, several methods have been developed to improve the detection of gene fusion transcripts based on RNA sequencing data, providing robust results. To test the detection power and assess the applicability of RNA-Seq based methods in clinical diagnostics we applied two different algorithms, namely FusionCatcher (Nicorici D et al., bioRxiv, 2014) and Arriba (Uhrig S et al., DKFZ, https://github.com/suhrig/arriba), to the transcriptomes of 895 well-characterized AML samples from three independently sequenced cohorts: AMLCG (Herold T et al., Haematologica, 2018, n=261), DKTK (Greif PA et al., Clin Cancer Res, 2018 and unpublished data, n=166), BeatAML (Tyner JW et al., Nature 2018, n=468) and publicly available healthy control samples (SRA studies: SRP018028, SRP047126, SRP050146, SRP105369, SRP115911, SRP133442, n=38). According to karyotyping, 31% (277/895) of samples harbored chromosomal aberrations putatively causing gene fusions (i.e. translocations, interstitial deletions, duplications, inversions, insertions). Analyses by FISH and/or PCR confirmed these rearrangements in 51.3% (142/277) of samples, whereas fusion detection by the means of RNA-Seq showed evidence for fusion genes corresponding to these rearrangements in 60.3% (167/277) of samples. Chromosomal aberrations, identified by karyotyping, which are known to result in clinically relevant fusions (e.g. RUNX1-RUNX1T1, KMT2A fusions) were confirmed by FISH/PCR (AMLCG: n=27/27, DKTK: n=21/21, BeatAML: n=54/57) and RNA-Seq based methods (AMLCG: n=17/27, DKTK: n=21/21, BeatAML: n=56/57) in most of the cases. Of note, the AMLCG cohort was sequenced using the SENSE mRNA Library Prep Kit from Lexogen which seems to be not optimal for fusion detection. Furthermore, 19 samples (AMLCG: n=12, DKTK: n=4, BeatAML: n=3) were found to harbor known pathogenic fusions, described in previous studies, which were not reported by routine diagnostics: NUP98-NSD1 (n=11); CBFB-MYH11, RUNX1-RUNX1T1 and DEK-NUP214 (n=2 each); RUNX1-CBFA2T2 and RUNX1-CBFA2T3 (n=1 each). Reanalysis of six of these samples by PCR confirmed three fusions which were initially missed by routine diagnostics. In general, the amount of reported fusion events by RNA-Seq is high (on average 69 and 39 per sample as detected by FusionCatcher and Arriba respectively), even after applying the built-in filters, indicating a high false positive rate. To robustly identify putative novel fusions, we developed a filtering pipeline and incorporated two new filtering steps. The promiscuity score (PS) of a fusion measures the amount of further distinct fusion partners which were detected in the respective cohort for the 5' and 3' gene. The fusion transcript score (FTS) measures the relative abundance of a fusion transcript to its 5' and 3' partner gene. PS and FTS of known, clinically relevant fusions confirmed by FISH/PCR were used to define cut-offs. To further maximize specificity while maintaining sensitivity, we excluded fusion events which we detected in publicly available healthy samples and subsequently filtered for overlapping calls from FusionCatcher and Arriba (Fig. 1A). Additionally, we obtained further evidence for a fusion event by an elevated transcription of the 3' fusion partner. In case of a fusion event, the transcription of the 3' partner gene likely gets under the control of the promoter of the 5' partner gene. This results in an elevated transcription of genes which are otherwise transcribed at low levels (Fig. 1B-C). Thus, we identified five putatively novel recurrent fusion genes which were detected in two cohorts independently: NRIP1-MIR99AHG, LATS2-ZMYM2, ATP11A-ING1, MBP-SLC66A2, PRDM16-SKI (Fig. 1D-F). Although these events were called with high evidence, we aim at independent validation by complementary methods. In our study, we have not only demonstrated that the application of RNA-Seq to the detection of fusion genes is a valuable complement to diagnostic routine but also has the potential to discover novel putatively pathogenic fusions. Disclosures No relevant conflicts of interest to declare.


2017 ◽  
Vol 31 (2) ◽  
pp. 157 ◽  
Author(s):  
Jorge Mendoza ◽  
Oscar Francke

Mexican red-kneed tarantulas of the genus Brachypelma are regarded as some of the most desirable invertebrate pets, and although bred in captivity, they continue to be smuggled out of the wild in large numbers. Species are often difficult to identify based solely on morphology, therefore prompt and accurate identification is required for adequate protection. Thus, we explored the applicability of using COI-based DNA barcoding as a complementary identification tool. Brachypelma smithi (F. O. Pickard-Cambridge, 1897) and Brachypelma hamorii Tesmongt, Cleton & Verdez, 1997 are redescribed, and their morphological differences defined. Brachypelma annitha is proposed as a new synonym of B. smithi. The current distribution of red-kneed tarantulas shows that the Balsas River basin may act as a geographical barrier. Morphological and molecular evidence are concordant and together provide robust hypotheses for delimiting Mexican red-kneed tarantula species. DNA barcoding of these tarantulas is further shown to be useful for species-level identification and for potentially preventing black market trade in these spiders. As a Convention on International Trade in Endangered Species (CITES) listing does not protect habitat, or control wildlife management or human interactions with organisms, it is important to support environmental conservation activities to provide an alternative income for local communities and to avoid damage to wildlife populations.


2019 ◽  
Vol 35 (14) ◽  
pp. i225-i232 ◽  
Author(s):  
Xiao Yang ◽  
Yasushi Saito ◽  
Arjun Rao ◽  
Hyunsung John Kim ◽  
Pranav Singh ◽  
...  

Abstract Motivation Cell-free nucleic acid (cfNA) sequencing data require improvements to existing fusion detection methods along multiple axes: high depth of sequencing, low allele fractions, short fragment lengths and specialized barcodes, such as unique molecular identifiers. Results AF4 was developed to address these challenges. It uses a novel alignment-free kmer-based method to detect candidate fusion fragments with high sensitivity and orders of magnitude faster than existing tools. Candidate fragments are then filtered using a max-cover criterion that significantly reduces spurious matches while retaining authentic fusion fragments. This efficient first stage reduces the data sufficiently that commonly used criteria can process the remaining information, or sophisticated filtering policies that may not scale to the raw reads can be used. AF4 provides both targeted and de novo fusion detection modes. We demonstrate both modes in benchmark simulated and real RNA-seq data as well as clinical and cell-line cfNA data. Availability and implementation AF4 is open sourced, licensed under Apache License 2.0, and is available at: https://github.com/grailbio/bio/tree/master/fusion.


2017 ◽  
Author(s):  
Páll Melsted ◽  
Shannon Hateley ◽  
Isaac Charles Joseph ◽  
Harold Pimentel ◽  
Nicolas Bray ◽  
...  

RNA sequencing in cancer cells is a powerful technique to detect chromosomal rearrangements, allowing for de novo discovery of actively expressed fusion genes. Here we focus on the problem of detecting gene fusions from raw sequencing data, assembling the reads to define fusion transcripts and their associated breakpoints, and quantifying their abundances. Building on the pseudoalignment idea that simplifies and accelerates transcript quantification, we introduce a novel approach to fusion detection based on inspecting paired reads that cannot be pseudoaligned due to conflicting matches. The method and software, called pizzly, filters false positives, assembles new transcripts from the fusion reads, and reports candidate fusions. With pizzly, fusion detection from raw RNA-Seq reads can be performed in a matter of minutes, making the program suitable for the analysis of large cancer gene expression databases and for clinical use. pizzly is available at https://github.com/pmelsted/pizzly


2014 ◽  
Author(s):  
Michael J Axtell

Eukaryotes produce large numbers of small non-coding RNAs that act as specificity determinants for various gene-regulatory complexes. These include microRNAs (miRNAs), endogenous short interfering RNAs (siRNAs), and Piwi-associated RNAs (piRNAs). These RNAs can be discovered, annotated, and quantified using small RNA-seq, a variant RNA-seq method based on highly parallel sequencing. Alignment to a reference genome is a critical step in analysis of small RNA-seq data. Because of their small size (20-30 nts depending on the organism and sub-type) and tendency to originate from multi-gene families or repetitive regions, reads that align equally well to more than one genomic location are very common. Typical methods to deal with multi-mapped small RNA-seq reads sacrifice either precision or sensitivity. The tool 'butter' balances precision and sensitivity by placing multi-mapped reads using an iterative approach, where the decision between possible locations is dictated by the local densities of more confidently aligned reads. Butter displays superior performance relative to other small RNA-seq aligners. Treatment of multi-mapped small RNA-seq reads has substantial impacts on downstream analyses, including quantification of MIRNA paralogs, and discovery of endogenous siRNA loci. Butter is freely available under a GNU general public license.


2019 ◽  
Vol 11 (21) ◽  
pp. 2537 ◽  
Author(s):  
Dandan Ma ◽  
Yuan Yuan ◽  
Qi Wang

A hyperspectral image usually covers a large scale of ground scene, which contains various materials with different spectral properties. When directly exploring the background information using all the image pixels, complex spectral interactions and inter-/intra-difference of different samples will significantly reduce the accuracy of background evaluation and further affect the detection performance. To address this problem, this paper proposes a novel hyperspectral anomaly detection method based on separability-aware sample cascade model. Through identifying separability of hyperspectral pixels, background samples are sifted out layer-by-layer according to their separable degrees from anomalies, which can ensure the accuracy and distinctiveness of background representation. First, as spatial structure is beneficial for recognizing target, a new spectral–spatial feature extraction technique is used in this work based on the PCA technique and edge-preserving filtering. Second, depending on different separability computed by sparse representation, samples are separated into different sets which can effectively and completely reflect various characteristics of background across all the cascade layers. Meanwhile, some potential abnormal targets are removed at each selection step to avoid their effects on subsequent layers. Finally, comprehensively taking different good properties of all the separability-aware layers into consideration, a simple multilayer anomaly detection strategy is adopted to obtain the final detection map. Extensive experimental results on five real-world hyperspectral images demonstrate our method’s superior performance. Compared with seven representative anomaly detection methods, our method improves the average detection accuracy with great advantages.


2017 ◽  
Author(s):  
Daniel Mapleson ◽  
Luca Venturini ◽  
Gemy Kaithakottil ◽  
David Swarbreck

ABSTRACTNext generation sequencing (NGS) technologies enable rapid and cheap genome-wide transcriptome analysis, providing vital information about gene structure, transcript expression and alternative splicing. Key to this is the the accurate identification of exon-exon junctions from RNA sequenced (RNA-seq) reads. A number of RNA-seq aligners capable of splitting reads across these splice junctions (SJs) have been developed, however, it has been shown that while they correctly identify most genuine SJs available in a given sample, they also often produce large numbers of incorrect SJs. Herein we describe the extent of this problem using popular RNA-seq mapping tools, and present a new method, called Portcullis, to rapidly filter false SJs junctions from spliced alignments produced by any RNA-seq mapper capable of creating SAM/BAM files. We show that Portcullis distinguishes between genuine and false positive junctions to a high-degree of accuracy across different species, samples, expression levels, error profiles and read lengths. Portcullis makes efficient use of memory and threading and, to our knowledge, is currently the only SJ prediction tool that reliably scales for use with large RNAseq datasets and large highly fragmented genomes, whilst delivering highly accurate SJs.AvailabilityPortcullis is available under the GPLv3 license at: http://maplesond.github.io/portcullis/[email protected]


BMC Genomics ◽  
2020 ◽  
Vol 21 (S11) ◽  
Author(s):  
Qian Liu ◽  
Yu Hu ◽  
Andres Stucky ◽  
Li Fang ◽  
Jiang F. Zhong ◽  
...  

Abstract Background Long-read RNA-Seq techniques can generate reads that encompass a large proportion or the entire mRNA/cDNA molecules, so they are expected to address inherited limitations of short-read RNA-Seq techniques that typically generate < 150 bp reads. However, there is a general lack of software tools for gene fusion detection from long-read RNA-seq data, which takes into account the high basecalling error rates and the presence of alignment errors. Results In this study, we developed a fast computational tool, LongGF, to efficiently detect candidate gene fusions from long-read RNA-seq data, including cDNA sequencing data and direct mRNA sequencing data. We evaluated LongGF on tens of simulated long-read RNA-seq datasets, and demonstrated its superior performance in gene fusion detection. We also tested LongGF on a Nanopore direct mRNA sequencing dataset and a PacBio sequencing dataset generated on a mixture of 10 cancer cell lines, and found that LongGF achieved better performance to detect known gene fusions over existing computational tools. Furthermore, we tested LongGF on a Nanopore cDNA sequencing dataset on acute myeloid leukemia, and pinpointed the exact location of a translocation (previously known in cytogenetic resolution) in base resolution, which was further validated by Sanger sequencing. Conclusions In summary, LongGF will greatly facilitate the discovery of candidate gene fusion events from long-read RNA-Seq data, especially in cancer samples. LongGF is implemented in C++ and is available at https://github.com/WGLab/LongGF.


2019 ◽  
Vol 36 (7) ◽  
pp. 2256-2257
Author(s):  
Readman Chiu ◽  
Ka Ming Nip ◽  
Inanc Birol

Abstract Summary Presence or absence of gene fusions is one of the most important diagnostic markers in many cancer types. Consequently, fusion detection methods using various genomics data types, such as RNA sequencing (RNA-seq) are valuable tools for research and clinical applications. While information-rich RNA-seq data have proven to be instrumental in discovery of a number of hallmark fusion events, bioinformatics tools to detect fusions still have room for improvement. Here, we present Fusion-Bloom, a fusion detection method that leverages recent developments in de novo transcriptome assembly and assembly-based structural variant calling technologies (RNA-Bloom and PAVFinder, respectively). We benchmarked Fusion-Bloom against the performance of five other state-of-the-art fusion detection tools using multiple datasets. Overall, we observed Fusion-Bloom to display a good balance between detection sensitivity and specificity. We expect the tool to find applications in translational research and clinical genomics pipelines. Availability and implementation Fusion-Bloom is implemented as a UNIX Make utility, available at https://github.com/bcgsc/pavfinder and released under a Creative Commons License (Attribution 4.0 International), as described at http://creativecommons.org/licenses/by/4.0/. Supplementary information Supplementary data are available at Bioinformatics online.


Sign in / Sign up

Export Citation Format

Share Document