Systematic evaluation of differential splicing tools for RNA-seq studies

Arfa Mehmood; Asta Laiho; Mikko S Venäläinen; Aidan J McGlinchey; Ning Wang; Laura L Elo

doi:10.1093/bib/bbz126

Systematic evaluation of differential splicing tools for RNA-seq studies

Briefings in Bioinformatics ◽

10.1093/bib/bbz126 ◽

2019 ◽

Vol 21 (6) ◽

pp. 2052-2065 ◽

Cited By ~ 9

Author(s):

Arfa Mehmood ◽

Asta Laiho ◽

Mikko S Venäläinen ◽

Aidan J McGlinchey ◽

Ning Wang ◽

...

Keyword(s):

Biological Process ◽

Functional Enrichment ◽

Systematic Evaluation ◽

Data Sets ◽

Rna Seq ◽

Differential Splicing ◽

False Discovery ◽

Analysis Tools ◽

Event Based ◽

Better Than

Abstract Differential splicing (DS) is a post-transcriptional biological process with critical, wide-ranging effects on a plethora of cellular activities and disease processes. To date, a number of computational approaches have been developed to identify and quantify differentially spliced genes from RNA-seq data, but a comprehensive intercomparison and appraisal of these approaches is currently lacking. In this study, we systematically evaluated 10 DS analysis tools for consistency and reproducibility, precision, recall and false discovery rate, agreement upon reported differentially spliced genes and functional enrichment. The tools were selected to represent the three different methodological categories: exon-based (DEXSeq, edgeR, JunctionSeq, limma), isoform-based (cuffdiff2, DiffSplice) and event-based methods (dSpliceType, MAJIQ, rMATS, SUPPA). Overall, all the exon-based methods and two event-based methods (MAJIQ and rMATS) scored well on the selected measures. Of the 10 tools tested, the exon-based methods performed generally better than the isoform-based and event-based methods. However, overall, the different data analysis tools performed strikingly differently across different data sets or numbers of samples.

Download Full-text

A survey on identification and quantification of alternative polyadenylation sites from RNA-seq data

Briefings in Bioinformatics ◽

10.1093/bib/bbz068 ◽

2019 ◽

Vol 21 (4) ◽

pp. 1261-1276 ◽

Cited By ~ 7

Author(s):

Moliang Chen ◽

Guoli Ji ◽

Hongjuan Fu ◽

Qianmin Lin ◽

Congting Ye ◽

...

Keyword(s):

Gene Expression Regulation ◽

Simulated Data ◽

Alternative Polyadenylation ◽

Transcriptome Profiling ◽

Systematic Evaluation ◽

Data Sets ◽

Rna Seq ◽

Comprehensive Overview ◽

Computational Approaches ◽

The Status

Abstract Alternative polyadenylation (APA) has been implicated to play an important role in post-transcriptional regulation by regulating mRNA abundance, stability, localization and translation, which contributes considerably to transcriptome diversity and gene expression regulation. RNA-seq has become a routine approach for transcriptome profiling, generating unprecedented data that could be used to identify and quantify APA site usage. A number of computational approaches for identifying APA sites and/or dynamic APA events from RNA-seq data have emerged in the literature, which provide valuable yet preliminary results that should be refined to yield credible guidelines for the scientific community. In this review, we provided a comprehensive overview of the status of currently available computational approaches. We also conducted objective benchmarking analysis using RNA-seq data sets from different species (human, mouse and Arabidopsis) and simulated data sets to present a systematic evaluation of 11 representative methods. Our benchmarking study showed that the overall performance of all tools investigated is moderate, reflecting that there is still lot of scope to improve the prediction of APA site or dynamic APA events from RNA-seq data. Particularly, prediction results from individual tools differ considerably, and only a limited number of predicted APA sites or genes are common among different tools. Accordingly, we attempted to give some advice on how to assess the reliability of the obtained results. We also proposed practical recommendations on the appropriate method applicable to diverse scenarios and discussed implications and future directions relevant to profiling APA from RNA-seq data.

Download Full-text

A large-sample crisis? Exaggerated false positives by popular differential expression methods

10.1101/2021.08.25.457733 ◽

2021 ◽

Cited By ~ 1

Author(s):

Yumei Li ◽

Xinzhou Ge ◽

Fanglue Peng ◽

Wei Li ◽

Jingyi Jessica Li

Keyword(s):

Parametric Method ◽

Population Level ◽

Nonparametric Test ◽

Rna Seq ◽

False Discovery Rates ◽

False Discovery ◽

Wilcoxon Rank Sum Test ◽

Permutation Analysis ◽

Better Than ◽

Non Parametric

AbstractWe report a surprising phenomenon about identifying differentially expressed genes (DEGs) from population-level RNA-seq data: two popular bioinformatics methods, DESeq2 and edgeR, have unexpectedly high false discovery rates (FDRs). Via permutation analysis on an immunotherapy RNA-seq dataset, we observed that DESeq2 and edgeR identified even more DEGs after samples’ condition labels were randomly permuted. Motivated by this, we evaluated six DEG identification methods (DESeq2, edgeR, limma-voom, NOISeq, dearseq, and the Wilcoxon rank-sum test) on population-level RNA-seq datasets. We found that the FDR control was often failed by the three popular parametric methods—DESeq2, edgeR, and limma-voom— and the new non-parametric method dearseq. In particular, the actual FDRs of DESeq2 and edgeR sometimes exceeded 20% when the target FDR threshold was only 5%. Although NOISeq, a non-parametric method used by GTEx, controlled the FDR better than the other four methods did, its power was much lower than that of the Wilcoxon rank-sum test, a classic nonparametric test that consistently controlled the FDR and achieved good power in our evaluation. Based on these results, for population-level RNA-seq studies, we recommend the Wilcoxon rank-sum test.

Download Full-text

BANDITS: Bayesian differential splicing accounting for sample-to-sample variability and mapping uncertainty

10.1101/750018 ◽

2019 ◽

Author(s):

Simone Tiberi ◽

Mark D Robinson

Keyword(s):

Gene Expression ◽

Latent Variables ◽

Biological Process ◽

Single Gene ◽

Transcript Level ◽

Bioconductor Package ◽

Rna Seq ◽

Differential Splicing ◽

Bayesian Hierarchical ◽

Splicing Patterns

AbstractAlternative splicing is a biological process during gene expression that allows a single gene to code for multiple proteins. However, splicing patterns can be altered in some conditions or diseases. Here, we present BANDITS, a R/Bioconductor package to perform differential splicing, at both gene and transcript-level, based on RNA-seq data. BANDITS uses a Bayesian hierarchical structure to explicitly model the variability between samples, and treats the transcript allocation of reads as latent variables. We perform an extensive benchmark across both simulated and experimental RNA-seq datasets, where BANDITS has extremely favorable performance with respect to the competitors considered.

Download Full-text

Comment on TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions by Kim et al.

10.1101/000851 ◽

2013 ◽

Cited By ~ 8

Author(s):

Alexander Dobin ◽

Thomas R Gingeras

Keyword(s):

Simulated Data ◽

Splice Junction ◽

Gene Fusions ◽

Rna Seq ◽

Incorrect Choice ◽

False Discovery ◽

Mapping Parameters ◽

Junction Detection ◽

Low Sensitivity ◽

Better Than

In the recent paper by Kim et al. (Genome biology, 2013. 14(4): p. R36) the accuracy of TopHat2 was compared to other RNA-seq aligners. In this comment we re-examine most important analyses from this paper and identify several deficiencies that significantly diminished performance of some of the aligners, including incorrect choice of mapping parameters, unfair comparison metrics, and unrealistic simulated data. Using STAR (Dobin et al., Bioinformatics, 2013. 29(1): p. 15-21) as an exemplar, we demonstrate that correcting these deficiencies makes its accuracy equal or better than that of TopHat2. Furthermore, this exercise highlighted some serious issues with the TopHat2 algorithms, such as poor recall of alignments with a moderate (>3) number of mismatches, low sensitivity and high false discovery rate for splice junction detection, loss of precision for the realignment algorithm, and large number of false chimeric alignments.

Download Full-text

Characterization of kinase gene expression and splicing profile in prostate cancer with RNA-Seq data

10.1101/061085 ◽

2016 ◽

Author(s):

Huijuan Feng ◽

Tingting Li ◽

Xuegong Zhang

Keyword(s):

Gene Expression ◽

Prostate Cancer ◽

Alternative Splicing ◽

Differential Expression ◽

Functional Enrichment ◽

Cancer Development ◽

Rna Seq ◽

Differential Splicing ◽

Kinase Gene ◽

Isoform Switching

AbstractBackgroundAlternative splicing is a ubiquitous post-transcriptional process in most eukaryotic genes. Aberrant splicing isoforms and abnormal isoform ratios can contribute to cancer development. Kinase genes are key regulators of various cellular processes. Many kinases are found to be oncogenic and have been intensively investigated in the study of cancer and drugs. RNA-Seq provides a powerful technology for genome-wide study of alternative splicing in cancer besides the conventional gene expression profiling. But this potential has not been fully demonstrated yet.MethodsHere we characterized the transcriptome profile of prostate cancer using RNA-Seq data from viewpoints of both differential expression and differential splicing, with an emphasis on kinase genes and their splicing variations. We built up a pipeline to conduct differential expression and differential splicing analysis. Further functional enrichment analysis was performed to explore functional interpretation of the genes. With focus on kinase genes, we performed kinase domain analysis to identify the functionally important candidate kinase gene in prostate cancer. We further calculated the expression level of isoforms to explore the function of isoform switching of kinase genes in prostate cancer.ResultsWe identified distinct gene groups from differential expression and splicing analysis, which suggested that alternative splicing adds another level to gene expression regulation. Enriched GO terms of differentially expressed and spliced kinase genes were found to play different roles in regulation of cellular metabolism. Function analysis on differentially spliced kinase genes showed that differentially spliced exons of these genes are significantly enriched in protein kinase domains. Among them, we found that gene CDK5 has isoform switching between prostate cancer and benign tissues, which may affect cancer development by changing androgen receptor (AR) phosphorylation. The observation was validated in another RNA-Seq dataset of prostate cancer cell lines.ConclusionsOur work characterized the expression and splicing profile of kinase genes in prostate cancer and proposed a hypothetical model on isoform switching of CDK5 and AR phosphorylation in prostate cancer. These findings bring new understanding to the role of alternatively spliced kinases in prostate cancer and demonstrate the use of RNA-Seq data in studying alternative splicing in cancer.

Download Full-text

Faculty Opinions recommendation of A systematic evaluation of single cell RNA-seq analysis pipelines.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.736736304.793566365 ◽

2019 ◽

Author(s):

Hans-Rudolf Hotz

Keyword(s):

Single Cell ◽

Systematic Evaluation ◽

Rna Seq

Download Full-text

Identification of Candidate Genetic Markers and a Novel 4-genes Diagnostic Model in Osteoarthritis through Integrating Multiple Microarray Data

Combinatorial Chemistry & High Throughput Screening ◽

10.2174/1386207323666200428120310 ◽

2020 ◽

Vol 23 (8) ◽

pp. 805-813

Author(s):

Ai Jiang ◽

Peng Xu ◽

Zhenda Zhao ◽

Qizhao Tan ◽

Shang Sun ◽

...

Keyword(s):

Signaling Pathway ◽

Microarray Data ◽

Differential Expression Analysis ◽

Enrichment Analysis ◽

Mapk Signaling ◽

Functional Enrichment ◽

Joint Disease ◽

Support Vector ◽

Diagnostic Model ◽

Data Sets

Background: Osteoarthritis (OA) is a joint disease that leads to a high disability rate and a low quality of life. With the development of modern molecular biology techniques, some key genes and diagnostic markers have been reported. However, the etiology and pathogenesis of OA are still unknown. Objective: To develop a gene signature in OA. Method: In this study, five microarray data sets were integrated to conduct a comprehensive network and pathway analysis of the biological functions of OA related genes, which can provide valuable information and further explore the etiology and pathogenesis of OA. Results and Discussion: Differential expression analysis identified 180 genes with significantly expressed expression in OA. Functional enrichment analysis showed that the up-regulated genes were associated with rheumatoid arthritis (p < 0.01). Down-regulated genes regulate the biological processes of negative regulation of kinase activity and some signaling pathways such as MAPK signaling pathway (p < 0.001) and IL-17 signaling pathway (p < 0.001). In addition, the OA specific protein-protein interaction (PPI) network was constructed based on the differentially expressed genes. The analysis of network topological attributes showed that differentially upregulated VEGFA, MYC, ATF3 and JUN genes were hub genes of the network, which may influence the occurrence and development of OA through regulating cell cycle or apoptosis, and were potential biomarkers of OA. Finally, the support vector machine (SVM) method was used to establish the diagnosis model of OA, which not only had excellent predictive power in internal and external data sets (AUC > 0.9), but also had high predictive performance in different chip platforms (AUC > 0.9) and also had effective ability in blood samples (AUC > 0.8). Conclusion: The 4-genes diagnostic model may be of great help to the early diagnosis and prediction of OA.

Download Full-text

Development of genic KASP SNP markers from RNA-Seq data for map-based cloning and marker-assisted selection in maize

BMC Plant Biology ◽

10.1186/s12870-021-02932-8 ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Zhengjie Chen ◽

Dengguo Tang ◽

Jixing Ni ◽

Peng Li ◽

Le Wang ◽

...

Keyword(s):

Marker Assisted Selection ◽

Inbred Lines ◽

Average Density ◽

Snp Markers ◽

Data Sets ◽

Rna Seq ◽

Specific Pcr ◽

Maize Inbred Lines ◽

Allele Specific ◽

Allele Specific Pcr

Abstract Background Maize is one of the most important field crops in the world. Most of the key agronomic traits, including yield traits and plant architecture traits, are quantitative. Fine mapping of genes/ quantitative trait loci (QTL) influencing a key trait is essential for marker-assisted selection (MAS) in maize breeding. However, the SNP markers with high density and high polymorphism are lacking, especially kompetitive allele specific PCR (KASP) SNP markers that can be used for automatic genotyping. To date, a large volume of sequencing data has been produced by the next generation sequencing technology, which provides a good pool of SNP loci for development of SNP markers. In this study, we carried out a multi-step screening method to identify kompetitive allele specific PCR (KASP) SNP markers based on the RNA-Seq data sets of 368 maize inbred lines. Results A total of 2,948,985 SNPs were identified in the high-throughput RNA-Seq data sets with the average density of 1.4 SNP/kb. Of these, 71,311 KASP SNP markers (the average density of 34 KASP SNP/Mb) were developed based on the strict criteria: unique genomic region, bi-allelic, polymorphism information content (PIC) value ≥0.4, and conserved primer sequences, and were mapped on 16,161 genes. These 16,161 genes were annotated to 52 gene ontology (GO) terms, including most of primary and secondary metabolic pathways. Subsequently, the 50 KASP SNP markers with the PIC values ranging from 0.14 to 0.5 in 368 RNA-Seq data sets and with polymorphism between the maize inbred lines 1212 and B73 in in silico analysis were selected to experimentally validate the accuracy and polymorphism of SNPs, resulted in 46 SNPs (92.00%) showed polymorphism between the maize inbred lines 1212 and B73. Moreover, these 46 polymorphic SNPs were utilized to genotype the other 20 maize inbred lines, with all 46 SNPs showing polymorphism in the 20 maize inbred lines, and the PIC value of each SNP was 0.11 to 0.50 with an average of 0.35. The results suggested that the KASP SNP markers developed in this study were accurate and polymorphic. Conclusions These high-density polymorphic KASP SNP markers will be a valuable resource for map-based cloning of QTL/genes and marker-assisted selection in maize. Furthermore, the method used to develop SNP markers in maize can also be applied in other species.

Download Full-text

MUREN: a robust and multi-reference approach of RNA-seq transcript normalization

BMC Bioinformatics ◽

10.1186/s12859-021-04288-0 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Yance Feng ◽

Lei M. Li

Keyword(s):

Biological Significance ◽

Housekeeping Genes ◽

R Package ◽

Data Sets ◽

Statistical Regression ◽

Rna Seq ◽

Least Trimmed Squares ◽

Standard Data ◽

Wide Range ◽

Multiple References

Abstract Background Normalization of RNA-seq data aims at identifying biological expression differentiation between samples by removing the effects of unwanted confounding factors. Explicitly or implicitly, the justification of normalization requires a set of housekeeping genes. However, the existence of housekeeping genes common for a very large collection of samples, especially under a wide range of conditions, is questionable. Results We propose to carry out pairwise normalization with respect to multiple references, selected from representative samples. Then the pairwise intermediates are integrated based on a linear model that adjusts the reference effects. Motivated by the notion of housekeeping genes and their statistical counterparts, we adopt the robust least trimmed squares regression in pairwise normalization. The proposed method (MUREN) is compared with other existing tools on some standard data sets. The goodness of normalization emphasizes on preserving possible asymmetric differentiation, whose biological significance is exemplified by a single cell data of cell cycle. MUREN is implemented as an R package. The code under license GPL-3 is available on the github platform: github.com/hippo-yf/MUREN and on the conda platform: anaconda.org/hippo-yf/r-muren. Conclusions MUREN performs the RNA-seq normalization using a two-step statistical regression induced from a general principle. We propose that the densities of pairwise differentiations are used to evaluate the goodness of normalization. MUREN adjusts the mode of differentiation toward zero while preserving the skewness due to biological asymmetric differentiation. Moreover, by robustly integrating pre-normalized counts with respect to multiple references, MUREN is immune to individual outlier samples.

Download Full-text

A Novel LSTM Model with Interaction Dual Attention for Radar Echo Extrapolation

Remote Sensing ◽

10.3390/rs13020164 ◽

2021 ◽

Vol 13 (2) ◽

pp. 164

Author(s):

Chuyao Luo ◽

Xutao Li ◽

Yongliang Wen ◽

Yunming Ye ◽

Xiaofeng Zhang

Keyword(s):

Short Term Memory ◽

Weather Forecast ◽

Vital Role ◽

Data Sets ◽

Short Term ◽

Learning Techniques ◽

Radar Echo ◽

Hidden States ◽

Better Than

The task of precipitation nowcasting is significant in the operational weather forecast. The radar echo map extrapolation plays a vital role in this task. Recently, deep learning techniques such as Convolutional Recurrent Neural Network (ConvRNN) models have been designed to solve the task. These models, albeit performing much better than conventional optical flow based approaches, suffer from a common problem of underestimating the high echo value parts. The drawback is fatal to precipitation nowcasting, as the parts often lead to heavy rains that may cause natural disasters. In this paper, we propose a novel interaction dual attention long short-term memory (IDA-LSTM) model to address the drawback. In the method, an interaction framework is developed for the ConvRNN unit to fully exploit the short-term context information by constructing a serial of coupled convolutions on the input and hidden states. Moreover, a dual attention mechanism on channels and positions is developed to recall the forgotten information in the long term. Comprehensive experiments have been conducted on CIKM AnalytiCup 2017 data sets, and the results show the effectiveness of the IDA-LSTM in addressing the underestimation drawback. The extrapolation performance of IDA-LSTM is superior to that of the state-of-the-art methods.

Download Full-text

Systematic evaluation of differential splicing tools for RNA-seq studies

A survey on identification and quantification of alternative polyadenylation sites from RNA-seq data

A large-sample crisis? Exaggerated false positives by popular differential expression methods

BANDITS: Bayesian differential splicing accounting for sample-to-sample variability and mapping uncertainty

Comment on TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions by Kim et al.

Characterization of kinase gene expression and splicing profile in prostate cancer with RNA-Seq data

Faculty Opinions recommendation of A systematic evaluation of single cell RNA-seq analysis pipelines.

Identification of Candidate Genetic Markers and a Novel 4-genes Diagnostic Model in Osteoarthritis through Integrating Multiple Microarray Data

Development of genic KASP SNP markers from RNA-Seq data for map-based cloning and marker-assisted selection in maize

MUREN: a robust and multi-reference approach of RNA-seq transcript normalization

A Novel LSTM Model with Interaction Dual Attention for Radar Echo Extrapolation

Comment on TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions by Kim et al.