A junction coverage compatibility score to quantify the reliability of transcript abundance estimates and annotation catalogs

Mapping Intimacies ◽

10.1101/378539 ◽

2018 ◽

Cited By ~ 3

Author(s):

Charlotte Soneson ◽

Michael I Love ◽

Rob Patro ◽

Shobbir Hussain ◽

Dheeraj Malhotra ◽

...

Keyword(s):

Splice Junction ◽

Transcript Level ◽

Transcript Abundance ◽

Genomic Region ◽

Rna Seq ◽

Poor Agreement ◽

Abundance Estimates ◽

Small Set ◽

Good Agreement

AbstractMost methods for statistical analysis of RNA-seq data take a matrix of abundance estimates for some type of genomic features as their input, and consequently the quality of any obtained results are directly dependent on the quality of these abundances. Here, we present the junction coverage compatibility (JCC) score, which provides a way to evaluate the reliability of transcript-level abundance estimates as well as the accuracy of transcript annotation catalogs. It works by comparing the observed number of reads spanning each annotated splice junction in a genomic region to the predicted number of junction-spanning reads, inferred from the estimated transcript abundances and the genomic coordinates of the corresponding annotated transcripts. We show that while most genes show good agreement between the observed and predicted junction coverages, there is a small set of genes that do not. Genes with poor agreement are found regardless of the method used to estimate transcript abundances, and the corresponding transcript abundances should be treated with care in any downstream analyses.

Download Full-text

A junction coverage compatibility score to quantify the reliability of transcript abundance estimates and annotation catalogs

Life Science Alliance ◽

10.26508/lsa.201800175 ◽

2019 ◽

Vol 2 (1) ◽

pp. e201800175 ◽

Cited By ~ 10

Author(s):

Charlotte Soneson ◽

Michael I Love ◽

Rob Patro ◽

Shobbir Hussain ◽

Dheeraj Malhotra ◽

...

Keyword(s):

Splice Junction ◽

Transcript Level ◽

Transcript Abundance ◽

Genomic Region ◽

Rna Seq ◽

Poor Agreement ◽

Abundance Estimates ◽

Small Set ◽

Good Agreement

Most methods for statistical analysis of RNA-seq data take a matrix of abundance estimates for some type of genomic features as their input, and consequently the quality of any obtained results is directly dependent on the quality of these abundances. Here, we present the junction coverage compatibility score, which provides a way to evaluate the reliability of transcript-level abundance estimates and the accuracy of transcript annotation catalogs. It works by comparing the observed number of reads spanning each annotated splice junction in a genomic region to the predicted number of junction-spanning reads, inferred from the estimated transcript abundances and the genomic coordinates of the corresponding annotated transcripts. We show that although most genes show good agreement between the observed and predicted junction coverages, there is a small set of genes that do not. Genes with poor agreement are found regardless of the method used to estimate transcript abundances, and the corresponding transcript abundances should be treated with care in any downstream analyses.

Download Full-text

Differential Transcript Usage Analysis Incorporating Quantification Uncertainty Via Compositional Measurement Error Regression Modeling

10.1101/2020.05.22.111450 ◽

2020 ◽

Cited By ~ 1

Author(s):

Scott Van Buren ◽

Naim Rashid

Keyword(s):

Transcript Level ◽

Transcript Abundance ◽

Rna Seq ◽

Continuous Covariates ◽

Abundance Estimates ◽

Testing Performance ◽

Computational Procedures ◽

Computational Simplicity ◽

Usage Analysis ◽

Positive Results

Differential transcript usage (DTU) occurs when the relative transcript abundance of a gene changes between different conditions. Existing approaches to analyze DTU often rely on computational procedures that can have speed and scalability issues as the number of samples increases. In this paper, we propose a new method, termed CompDTU, that utilizes compositional regression to model transcript-level relative abundance proportions that are of interest in DTU analyses. This procedure does not suffer from speed and scalability issues due to the relative computational simplicity, making it ideally suited for DTU analysis with large sample sizes. The method also allows for the testing of and controlling for multiple categorical or continuous covariates. Additionally, many existing approaches for DTU ignore quantification uncertainty present in RNA-Seq data, where prior work has shown that accounting for such uncertainty may improve testing performance. We extend our CompDTU method to incorporate quantification uncertainty using bootstrap replicates of abundance estimates from Salmon and term this method CompDTUme. Through several power analyses, we show that CompDTU improves sensitivity and reduces false positive results relative to existing methods. Additionally, CompDTUme results in further improvements in performance over CompDTU with sufficient sample size for genes with high levels of quantification uncertainty while maintaining favorable speed and scalability.

Download Full-text

Zea mays RNA-seq estimated transcript abundances are strongly affected by read mapping bias

BMC Genomics ◽

10.1186/s12864-021-07577-3 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Shuhua Zhan ◽

Cortland Griswold ◽

Lewis Lukens

Keyword(s):

Gene Expression ◽

Zea Mays ◽

Reference Genome ◽

Transcript Abundance ◽

Gene Transcript ◽

Rna Seq ◽

Individual Genome ◽

Abundance Estimates ◽

Mapping Bias ◽

Quantify Gene Expression

Abstract Background Genetic variation for gene expression is a source of phenotypic variation for natural and agricultural species. The common approach to map and to quantify gene expression from genetically distinct individuals is to assign their RNA-seq reads to a single reference genome. However, RNA-seq reads from alleles dissimilar to this reference genome may fail to map correctly, causing transcript levels to be underestimated. Presently, the extent of this mapping problem is not clear, particularly in highly diverse species. We investigated if mapping bias occurred and if chromosomal features associated with mapping bias. Zea mays presents a model species to assess these questions, given it has genotypically distinct and well-studied genetic lines. Results In Zea mays, the inbred B73 genome is the standard reference genome and template for RNA-seq read assignments. In the absence of mapping bias, B73 and a second inbred line, Mo17, would each have an approximately equal number of regulatory alleles that increase gene expression. Remarkably, Mo17 had 2–4 times fewer such positively acting alleles than did B73 when RNA-seq reads were aligned to the B73 reference genome. Reciprocally, over one-half of the B73 alleles that increased gene expression were not detected when reads were aligned to the Mo17 genome template. Genes at dissimilar chromosomal ends were strongly affected by mapping bias, and genes at more similar pericentromeric regions were less affected. Biased transcript estimates were higher in untranslated regions and lower in splice junctions. Bias occurred across software and alignment parameters. Conclusions Mapping bias very strongly affects gene transcript abundance estimates in maize, and bias varies across chromosomal features. Individual genome or transcriptome templates are likely necessary for accurate transcript estimation across genetically variable individuals in maize and other species.

Download Full-text

DeepShape: estimating isoform-level ribosome abundance and distribution with Ribo-seq data

BMC Bioinformatics ◽

10.1186/s12859-019-3244-0 ◽

2019 ◽

Vol 20 (S24) ◽

Cited By ~ 1

Author(s):

Hongfei Cui ◽

Hailin Hu ◽

Jianyang Zeng ◽

Ting Chen

Keyword(s):

Transcript Level ◽

Transcript Abundance ◽

Ribosome Profiling ◽

Computational Method ◽

Translational Efficiency ◽

Rna Seq ◽

Basic Step ◽

Synonymous Codons ◽

Abundance And Distribution ◽

Different Characteristics

Abstract Background Ribosome profiling brings insight to the process of translation. A basic step in profile construction at transcript level is to map Ribo-seq data to transcripts, and then assign a huge number of multiple-mapped reads to similar isoforms. Existing methods either discard the multiple mapped-reads, or allocate them randomly, or assign them proportionally according to transcript abundance estimated from RNA-seq data. Results Here we present DeepShape, an RNA-seq free computational method to estimate ribosome abundance of isoforms, and simultaneously compute their ribosome profiles using a deep learning model. Our simulation results demonstrate that DeepShape can provide more accurate estimations on both ribosome abundance and profiles when compared to state-of-the-art methods. We applied DeepShape to a set of Ribo-seq data from PC3 human prostate cancer cells with and without PP242 treatment. In the four cell invasion/metastasis genes that are translationally regulated by PP242 treatment, different isoforms show very different characteristics of translational efficiency and regulation patterns. Transcript level ribosome distributions were analyzed by “Codon Residence Index (CRI)” proposed in this study to investigate the relative speed that a ribosome moves on a codon compared to its synonymous codons. We observe consistent CRI patterns in PC3 cells. We found that the translation of several codons could be regulated by PP242 treatment. Conclusion In summary, we demonstrate that DeepShape can serve as a powerful tool for Ribo-seq data analysis.

Download Full-text

Sashimi plots: Quantitative visualization of alternative isoform expression from RNA-seq data

10.1101/002576 ◽

2014 ◽

Cited By ~ 3

Author(s):

Yarden Katz ◽

Eric T Wang ◽

Jacob Stilterra ◽

Schraga Schwartz ◽

Bang Wong ◽

...

Keyword(s):

Region Of Interest ◽

Splice Junction ◽

Genomic Region ◽

Rna Seq ◽

Mrna Isoforms ◽

Experimental Conditions ◽

Human Genes ◽

Quantitative Visualization ◽

Isoform Expression ◽

Multiple Samples

Analysis of RNA sequencing (RNA-Seq) data revealed that the vast majority of human genes express multiple mRNA isoforms, produced by alternative pre-mRNA splicing and other mechanisms, and that most alternative isoforms vary in expression between human tissues. As RNA-Seq datasets grow in size, it remains challenging to visualize isoform expression across multiple samples. We present Sashimi plots, a quantitative multi-sample visualization of RNA-Seq reads aligned to gene annotations, which enables quantitative comparison of isoform usage across samples or experimental conditions. Given an input annotation and spliced alignments of reads from a sample, a region of interest is visualized in a Sashimi plot as follows: (i) alignments in exons are represented as read densities (optionally normalized by length of genomic region and coverage), and (ii) splice junction reads are drawn as arcs connecting a pair of exons, where arc width is drawn proportional to the number of reads aligning to the junction.

Download Full-text

RNA-Seq Alignment to Individualized Genomes Improves Transcript Abundance Estimates in Multiparent Populations

Genetics ◽

10.1534/genetics.114.165886 ◽

2014 ◽

Vol 198 (1) ◽

pp. 59-73 ◽

Cited By ~ 55

Author(s):

Steven C. Munger ◽

Narayanan Raghupathy ◽

Kwangbom Choi ◽

Allen K. Simons ◽

Daniel M. Gatti ◽

...

Keyword(s):

Transcript Abundance ◽

Rna Seq ◽

Abundance Estimates

Download Full-text

Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences

F1000Research ◽

10.12688/f1000research.7563.2 ◽

2016 ◽

Vol 4 ◽

pp. 1521 ◽

Cited By ~ 268

Author(s):

Charlotte Soneson ◽

Michael I. Love ◽

Mark D. Robinson

Keyword(s):

Statistical Inference ◽

High Throughput Sequencing ◽

Real Data ◽

Transcript Level ◽

R Package ◽

Data Sets ◽

Rna Seq ◽

Abundance Estimates ◽

Gene Level ◽

Genomic Regions

High-throughput sequencing of cDNA (RNA-seq) is used extensively to characterize the transcriptome of cells. Many transcriptomic studies aim at comparing either abundance levels or the transcriptome composition between given conditions, and as a first step, the sequencing reads must be used as the basis for abundance quantification of transcriptomic features of interest, such as genes or transcripts. Various quantification approaches have been proposed, ranging from simple counting of reads that overlap given genomic regions to more complex estimation of underlying transcript abundances. In this paper, we show that gene-level abundance estimates and statistical inference offer advantages over transcript-level analyses, in terms of performance and interpretability. We also illustrate that the presence of differential isoform usage can lead to inflated false discovery rates in differential gene expression analyses on simple count matrices but that this can be addressed by incorporating offsets derived from transcript-level abundance estimates. We also show that the problem is relatively minor in several real data sets. Finally, we provide an R package (tximport) to help users integrate transcript-level abundance estimates from common quantification pipelines into count-based statistical inference engines.

Download Full-text

Relative Abundance of Transcripts (RATs): Identifying differential isoform abundance from RNA-seq

F1000Research ◽

10.12688/f1000research.17916.1 ◽

2019 ◽

Vol 8 ◽

pp. 213 ◽

Cited By ~ 6

Author(s):

Kimon Froussios ◽

Kira Mourão ◽

Gordon Simpson ◽

Geoff Barton ◽

Nicholas Schurch

Keyword(s):

Matthews Correlation Coefficient ◽

Transcript Abundance ◽

R Package ◽

Effect Sizes ◽

Rna Seq ◽

Threshold Values ◽

Qrt Pcr ◽

False Discovery ◽

Abundance Estimates ◽

Higher Sensitivity

The biological importance of changes in RNA expression is reflected by the wide variety of tools available to characterise these changes from RNA-seq data. Several tools exist for detecting differential transcript isoform usage (DTU) from aligned or assembled RNA-seq data, but few exist for DTU detection from alignment-free RNA-seq quantifications. We present the RATs, an R package that identifies DTU transcriptome-wide directly from transcript abundance estimates. RATs is unique in applying bootstrapping to estimate the reliability of detected DTU events and shows good performance at all replication levels (median false positive fraction < 0.05). We compare RATs to two existing DTU tools, DRIM-Seq & SUPPA2, using two publicly available simulated RNA-seq datasets and a published human RNA-seq dataset, in which 248 genes have been previously identified as displaying significant DTU. RATs with default threshold values on the simulated Human data has a sensitivity of 0.55, a Matthews correlation coefficient of 0.71 and a false discovery rate (FDR) of 0.04, outperforming both other tools. Applying the same thresholds for SUPPA2 results in a higher sensitivity (0.61) but poorer FDR performance (0.33). RATs and DRIM-seq use different methods for measuring DTU effect-sizes complicating the comparison of results between these tools, however, for a likelihood-ratio threshold of 30, DRIM-Seq has similar FDR performance to RATs (0.06), but worse sensitivity (0.47). These differences persist for the simulated drosophila dataset. On the published human RNA-seq dataset the greatest agreement between the tools tested is 53%, observed between RATs and SUPPA2. The bootstrapping quality filter in RATs is responsible for removing the majority of DTU events called by SUPPA2 that are not reported by RATs. All methods, including the previously published qRT-PCR of three of the 248 detected DTU events, were found to be sensitive to annotation differences between Ensembl v60 and v87.

Download Full-text

Salmon provides accurate, fast, and bias-aware transcript expression estimates using dual-phase inference

10.1101/021592 ◽

2015 ◽

Cited By ~ 80

Author(s):

Rob Patro ◽

Geet Duggal ◽

Michael I Love ◽

Rafael A Irizarry ◽

Carl Kingsford

Keyword(s):

Differential Expression Analysis ◽

Gc Content ◽

Transcript Abundance ◽

Dual Phase ◽

Rna Seq ◽

Read Mapping ◽

Mapping Procedure ◽

Abundance Estimates ◽

Order Of Magnitude ◽

Speed And Accuracy

We introduce Salmon, a new method for quantifying transcript abundance from RNA-seq reads that is highly-accurate and very fast. Salmon is the first transcriptome-wide quantifier to model and correct for fragment GC content bias, which we demonstrate substantially improves the accuracy of abundance estimates and the reliability of subsequent differential expression analysis compared to existing methods that do not account for these biases. Salmon achieves its speed and accuracy by combining a new dual-phase parallel inference algorithm and feature-rich bias models with an ultra-fast read mapping procedure. These innovations yield both exceptional accuracy and order-of-magnitude speed benefits over alignment-based methods.

Download Full-text

Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences

F1000Research ◽

10.12688/f1000research.7563.1 ◽

2015 ◽

Vol 4 ◽

pp. 1521 ◽

Cited By ~ 704

Author(s):

Charlotte Soneson ◽

Michael I. Love ◽

Mark D. Robinson

Keyword(s):

Statistical Inference ◽

High Throughput Sequencing ◽

Simulated Data ◽

Real Data ◽

Transcript Level ◽

R Package ◽

Rna Seq ◽

Abundance Estimates ◽

Gene Level ◽

The Difference

High-throughput sequencing of cDNA (RNA-seq) is used extensively to characterize the transcriptome of cells. Many transcriptomic studies aim at comparing either abundance levels or the transcriptome composition between given conditions, and as a first step, the sequencing reads must be used as the basis for abundance quantification of transcriptomic features of interest, such as genes or transcripts. Several different quantification approaches have been proposed, ranging from simple counting of reads that overlap given genomic regions to more complex estimation of underlying transcript abundances. In this paper, we show that gene-level abundance estimates and statistical inference offer advantages over transcript-level analyses, in terms of performance and interpretability. We also illustrate that while the presence of differential isoform usage can lead to inflated false discovery rates in differential expression analyses on simple count matrices and transcript-level abundance estimates improve the performance in simulated data, the difference is relatively minor in several real data sets. Finally, we provide an R package (tximport) to help users integrate transcript-level abundance estimates from common quantification pipelines into count-based statistical inference engines.

Download Full-text