Modeling of RNA-seq fragment sequence bias reduces systematic errors in transcript abundance estimation

Mapping Intimacies ◽

10.1101/025767 ◽

2015 ◽

Cited By ~ 6

Author(s):

Michael I Love ◽

John B Hogenesch ◽

Rafael A Irizarry

Keyword(s):

Computational Methods ◽

Gc Content ◽

Science Research ◽

Transcript Abundance ◽

Transcript Expression ◽

Rna Seq ◽

Fold Reduction ◽

Visualization Tools ◽

Positive Results ◽

Sequence Bias

RNA-seq technology is widely used in biomedical and basic science research. These studies rely on complex computational methods that quantify expression levels for observed transcripts. We find that current computational methods can lead to hundreds of false positive results related to alternative isoform usage. This flaw in the current methodology stems from a lack of modeling sample-specific bias that leads to drops in coverage and is related to sequence features like fragment GC content and GC stretches. By incorporating features that explain this bias into transcript expression models, we greatly increase the specificity of transcript expression estimates, with more than a four-fold reduction in the number of false positives for reported changes in expression. We introduce alpine, a method for estimation of bias-corrected transcript abundance. The method is available as a Bioconductor package that includes data visualization tools useful for bias discovery.

Download Full-text

Identifying differential isoform abundance with RATs: a universal tool and a warning

10.1101/132761 ◽

2017 ◽

Cited By ~ 8

Author(s):

Kimon Froussios ◽

Kira Mourão ◽

Gordon G. Simpson ◽

Geoffrey J. Barton ◽

Nick J. Schurch

Keyword(s):

Transcript Abundance ◽

Pcr Primers ◽

R Package ◽

Original Study ◽

Transcript Expression ◽

Rna Seq ◽

Qrt Pcr ◽

False Discovery ◽

Alignment Free ◽

Go Terms

AbstractMotivationThe biological importance of changes in gene and transcript expression is well recognised and is reflected by the wide variety of tools available to characterise these changes. Regulation via Differential Transcript Usage (DTU) is emerging as an important phenomenon. Several tools exist for the detection of DTU from read alignment or assembly data, but options for detection of DTU from alignment-free quantifications are limited.ResultsWe present an R package named RATs – (Relative Abundance of Transcripts) – that identifies DTU transcriptome-wide directly from transcript abundance estimations. RATs is agnostic to quantification methods and exploits bootstrapped quantifications, if available, to inform the significance of detected DTU events. RATs contextualises the DTU results and shows good False Discovery performance (median FDR ≤0.05) at all replication levels. We applied RATs to a human RNA-seq dataset associated with idiopathic pulmonary fibrosis with three DTU events validated by qRT-PCR. RATs found all three genes exhibited statistically significant changes in isoform proportions based on Ensembl v60 annotations, but the DTU for two were not reliably reproduced across bootstrapped quantifications. RATs also identified 500 novel DTU events that are enriched for eleven GO terms related to regulation of the response to stimulus, regulation of immune system processes, and symbiosis/parasitism. Repeating this analysis with the Ensembl v87 annotation showed the isoform abundance profiles of two of the three validated DTU genes changed radically. RATs identified 414 novel DTU events that are enriched for five GO terms, none of which are in common with those previously identified. Only 141 of the DTU evens are common between the two analyses, and only 8 are among the 248 reported by the original study. Furthermore, the original qRT-PCR probes no longer match uniquely to their original transcripts, calling into question the interpretation of these data. We suggest parallel full-length isoform sequencing, annotation pre-filtering and sequencing of the transcripts captured by qRT-PCR primers as possible ways to improve the validation of RNA-seq results in future experiments.AvailabilityThe package is available through Github at https://github.com/bartongroup/Rats.

Download Full-text

Differential Transcript Usage Analysis Incorporating Quantification Uncertainty Via Compositional Measurement Error Regression Modeling

10.1101/2020.05.22.111450 ◽

2020 ◽

Cited By ~ 1

Author(s):

Scott Van Buren ◽

Naim Rashid

Keyword(s):

Transcript Level ◽

Transcript Abundance ◽

Rna Seq ◽

Continuous Covariates ◽

Abundance Estimates ◽

Testing Performance ◽

Computational Procedures ◽

Computational Simplicity ◽

Usage Analysis ◽

Positive Results

Differential transcript usage (DTU) occurs when the relative transcript abundance of a gene changes between different conditions. Existing approaches to analyze DTU often rely on computational procedures that can have speed and scalability issues as the number of samples increases. In this paper, we propose a new method, termed CompDTU, that utilizes compositional regression to model transcript-level relative abundance proportions that are of interest in DTU analyses. This procedure does not suffer from speed and scalability issues due to the relative computational simplicity, making it ideally suited for DTU analysis with large sample sizes. The method also allows for the testing of and controlling for multiple categorical or continuous covariates. Additionally, many existing approaches for DTU ignore quantification uncertainty present in RNA-Seq data, where prior work has shown that accounting for such uncertainty may improve testing performance. We extend our CompDTU method to incorporate quantification uncertainty using bootstrap replicates of abundance estimates from Salmon and term this method CompDTUme. Through several power analyses, we show that CompDTU improves sensitivity and reduces false positive results relative to existing methods. Additionally, CompDTUme results in further improvements in performance over CompDTU with sufficient sample size for genes with high levels of quantification uncertainty while maintaining favorable speed and scalability.

Download Full-text

Modeling of RNA-seq fragment sequence bias reduces systematic errors in transcript abundance estimation

Nature Biotechnology ◽

10.1038/nbt.3682 ◽

2016 ◽

Vol 34 (12) ◽

pp. 1287-1291 ◽

Cited By ~ 72

Author(s):

Michael I Love ◽

John B Hogenesch ◽

Rafael A Irizarry

Keyword(s):

Transcript Abundance ◽

Systematic Errors ◽

Abundance Estimation ◽

Rna Seq ◽

Sequence Bias

Download Full-text

Salmon provides accurate, fast, and bias-aware transcript expression estimates using dual-phase inference

10.1101/021592 ◽

2015 ◽

Cited By ~ 80

Author(s):

Rob Patro ◽

Geet Duggal ◽

Michael I Love ◽

Rafael A Irizarry ◽

Carl Kingsford

Keyword(s):

Differential Expression Analysis ◽

Gc Content ◽

Transcript Abundance ◽

Dual Phase ◽

Rna Seq ◽

Read Mapping ◽

Mapping Procedure ◽

Abundance Estimates ◽

Order Of Magnitude ◽

Speed And Accuracy

We introduce Salmon, a new method for quantifying transcript abundance from RNA-seq reads that is highly-accurate and very fast. Salmon is the first transcriptome-wide quantifier to model and correct for fragment GC content bias, which we demonstrate substantially improves the accuracy of abundance estimates and the reliability of subsequent differential expression analysis compared to existing methods that do not account for these biases. Salmon achieves its speed and accuracy by combining a new dual-phase parallel inference algorithm and feature-rich bias models with an ultra-fast read mapping procedure. These innovations yield both exceptional accuracy and order-of-magnitude speed benefits over alignment-based methods.

Download Full-text

Faculty Opinions recommendation of Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.14267340.15779565 ◽

2012 ◽

Author(s):

Marylyn Ritchie ◽

Stephen Turner

Keyword(s):

Expression Analysis ◽

Transcript Expression ◽

Rna Seq ◽

Differential Gene

Download Full-text

Zea mays RNA-seq estimated transcript abundances are strongly affected by read mapping bias

BMC Genomics ◽

10.1186/s12864-021-07577-3 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Shuhua Zhan ◽

Cortland Griswold ◽

Lewis Lukens

Keyword(s):

Gene Expression ◽

Zea Mays ◽

Reference Genome ◽

Transcript Abundance ◽

Gene Transcript ◽

Rna Seq ◽

Individual Genome ◽

Abundance Estimates ◽

Mapping Bias ◽

Quantify Gene Expression

Abstract Background Genetic variation for gene expression is a source of phenotypic variation for natural and agricultural species. The common approach to map and to quantify gene expression from genetically distinct individuals is to assign their RNA-seq reads to a single reference genome. However, RNA-seq reads from alleles dissimilar to this reference genome may fail to map correctly, causing transcript levels to be underestimated. Presently, the extent of this mapping problem is not clear, particularly in highly diverse species. We investigated if mapping bias occurred and if chromosomal features associated with mapping bias. Zea mays presents a model species to assess these questions, given it has genotypically distinct and well-studied genetic lines. Results In Zea mays, the inbred B73 genome is the standard reference genome and template for RNA-seq read assignments. In the absence of mapping bias, B73 and a second inbred line, Mo17, would each have an approximately equal number of regulatory alleles that increase gene expression. Remarkably, Mo17 had 2–4 times fewer such positively acting alleles than did B73 when RNA-seq reads were aligned to the B73 reference genome. Reciprocally, over one-half of the B73 alleles that increased gene expression were not detected when reads were aligned to the Mo17 genome template. Genes at dissimilar chromosomal ends were strongly affected by mapping bias, and genes at more similar pericentromeric regions were less affected. Biased transcript estimates were higher in untranslated regions and lower in splice junctions. Bias occurred across software and alignment parameters. Conclusions Mapping bias very strongly affects gene transcript abundance estimates in maize, and bias varies across chromosomal features. Individual genome or transcriptome templates are likely necessary for accurate transcript estimation across genetically variable individuals in maize and other species.

Download Full-text

FOXO1 Mitigation of FOXL2C143W/SMAD3 Transcriptomic Landscape in a Model of Granulosa Cell Tumor

Journal of the Endocrine Society ◽

10.1210/jendso/bvab048.2084 ◽

2021 ◽

Vol 5 (Supplement_1) ◽

pp. A1018-A1019

Author(s):

Christian Secchi ◽

Paola Benaglio ◽

Francesca Mulas ◽

Martina Belli ◽

Dwayne Stupack ◽

...

Keyword(s):

Granulosa Cell ◽

Granulosa Cell Tumor ◽

Direct Consequence ◽

Transcript Abundance ◽

Cell Tumor ◽

Gene Promoters ◽

Rna Seq ◽

Illumina Hiseq ◽

Rare Type ◽

Exact Test

Abstract Background: Adult granulosa cell tumor (aGCT) is a rare type of stromal cell malignant cancer of the ovary. Postmenopausal genital bleeding is the main aGCT clinical sign which is attributed to estrogen excess driven by CYP19 upregulation. Typically, aGCTs that are diagnosed at an initial stage can be treated with surgery. However, recurrences are mostly fatal1. Current studies are focused on finding new molecular markers and targets that aim to treat the aGCTs recurrence. Between 95-97% of aGCTs harbor a somatic mutation in the FOXL2 gene, Cys134Trp (c.402C<G)2. A TGF-β pathway protein, SMAD3, was identified as an essential partner in FOXL2C134W transcriptional activity driving CYP19 upregulation3. Recently, the antitumoral FOXO1 gene has been recognized as a potential target for suppressing the FOXL2C134W pathogenic action4. Aim: The objective of this study was to examine whether FOXO1 upregulation affects the FOXL2C143W/SMAD3 transcriptomic landscape. Methods: RNA-seq analysis was performed comparing the effect of FOXL2WT/SMAD3 and FOXL2C143W/SMAD3 overexpression in presence of FOXO1 by transfection of an established human GC line (HGrC1). RNA-seq libraries were prepared using the illumina TrueSeq and sequenced using an illumina HiSeq Platform4000. To quantify transcript abundance for each sample we used salmon (1.1.0) with default parameters, using indexes from hg38. Data was subsequently imported in R using the tximport package and processed with the DESeq2 package. Results: RNA-seq data show that FOXL2C143W/SMAD3 significantly drives 717 genes compared with the WT and enabled us to identify targets (TGFB2, SMARCA4, HSPG2, MKI67, NFKBIA) and neoplastic pathways directly associated with the mutant. To provide evidence that the differences in gene expression were attributed to a direct consequence of FOXL2 binding, we annotated gene promoters with previously published FOXL2 ChIP-seq analysis. The majority (73-40%) of the differential expressed genes (DEGs) between FOXL2C134W and FOXL2WT had a FOXL2 binding site at their promoters, which was a significantly higher proportion than in non-DEGs (Fisher’s exact test, murine: p= 7.9x10-157; human, p= 9.9x10-39). Surprisingly, the number of DEGs between FOXL2C134W + FOXO1 and FOXL2WT was much lower (230) with respect to the number of DEGs between FOXL2C134W and FOXL2WT (717, of which 130 in common; linear regression slope ß = 0 .58), suggesting that the effect of FOXL2C134W compared with FOXL2WT is moderated by the addition of FOXO1. Conclusions: Our transcriptomic study provides the first evidence that FOXO1 can efficiently mitigate 40% of the altered genome-wide effect specifically related to FOXL2C134W in a model of human aGCT.1 Farkkila, A. et al. Ann Med (2017). 2 Jamieson, S. & Fuller, P. J. Endocr Rev (2012). 3 Belli, M. et al. Endocrinology (2018). 4 Belli, M et al. J Endocr Soc (2019).

Download Full-text

Alignment and mapping methodology influence transcript abundance estimation

10.1101/657874 ◽

2019 ◽

Cited By ~ 6

Author(s):

Avi Srivastava ◽

Laraib Malik ◽

Hirak Sarkar ◽

Mohsen Zakeri ◽

Fatemeh Almodaresi ◽

...

Keyword(s):

Differential Expression ◽

Expression Analysis ◽

Differential Expression Analysis ◽

Computational Cost ◽

Simulated Data ◽

Transcript Abundance ◽

Mapping Method ◽

Rna Seq ◽

Transcript Quantification ◽

Quantification Model

AbstractBackgroundThe accuracy of transcript quantification using RNA-seq data depends on many factors, such as the choice of alignment or mapping method and the quantification model being adopted. While the choice of quantification model has been shown to be important, considerably less attention has been given to comparing the effect of various read alignment approaches on quantification accuracy.ResultsWe investigate the influence of mapping and alignment on the accuracy of transcript quantification in both simulated and experimental data, as well as the effect on subsequent differential expression analysis. We observe that, even when the quantification model itself is held fixed, the effect of choosing a different alignment methodology, or aligning reads using different parameters, on quantification estimates can sometimes be large, and can affect downstream differential expression analyses as well. These effects can go unnoticed when assessment is focused too heavily on simulated data, where the alignment task is often simpler than in experimentally-acquired samples. We also introduce a new alignment methodology, called selective alignment, to overcome the shortcomings of lightweight approaches without incurring the computational cost of traditional alignment.ConclusionWe observe that, on experimental datasets, the performance of lightweight mapping and alignment-based approaches varies significantly and highlight some of the underlying factors. We show this variation both in terms of quantification and downstream differential expression analysis. In all comparisons, we also show the improved performance of our proposed selective alignment method and suggest best practices for performing RNA-seq quantification.

Download Full-text

YeATS - a tool suite for analyzing RNA-seq derived transcriptome identifies a highly transcribed putative extensin in heartwood/sapwood transition zone in black walnut

F1000Research ◽

10.12688/f1000research.6617.2 ◽

2015 ◽

Vol 4 ◽

pp. 155 ◽

Cited By ~ 17

Author(s):

Sandeep Chakraborty ◽

Monica Britton ◽

Jill Wegrzyn ◽

Timothy Butterfield ◽

Pedro José Martínez-García ◽

...

Keyword(s):

Transition Zone ◽

Transcript Abundance ◽

Black Walnut ◽

Open Reading Frames ◽

Rna Seq ◽

Reading Frame ◽

Homologous Genes ◽

Associated Proteins ◽

Molecular Components ◽

Reading Frames

The transcriptome provides a functional footprint of the genome by enumerating the molecular components of cells and tissues. The field of transcript discovery has been revolutionized through high-throughput mRNA sequencing (RNA-seq). Here, we present a methodology that replicates and improves existing methodologies, and implements a workflow for error estimation and correction followed by genome annotation and transcript abundance estimation for RNA-seq derived transcriptome sequences (YeATS - Yet Another Tool Suite for analyzing RNA-seq derived transcriptome). A unique feature of YeATS is the upfront determination of the errors in the sequencing or transcript assembly process by analyzing open reading frames of transcripts. YeATS identifies transcripts that have not been merged, result in broken open reading frames or contain long repeats as erroneous transcripts. We present the YeATS workflow using a representative sample of the transcriptome from the tissue at the heartwood/sapwood transition zone in black walnut. A novel feature of the transcriptome that emerged from our analysis was the identification of a highly abundant transcript that had no known homologous genes (GenBank accession: KT023102). The amino acid composition of the longest open reading frame of this gene classifies this as a putative extensin. Also, we corroborated the transcriptional abundance of proline-rich proteins, dehydrins, senescence-associated proteins, and the DNAJ family of chaperone proteins. Thus, YeATS presents a workflow for analyzing RNA-seq data with several innovative features that differentiate it from existing software.

Download Full-text

Flexible comparison of batch correction methods for single-cell RNA-seq using BatchBench

10.1101/2020.05.22.111211 ◽

2020 ◽

Author(s):

Ruben Chazarra-Gil ◽

Stijn van Dongen ◽

Vladimir Yu Kiselev ◽

Martin Hemberg

Keyword(s):

Single Cell ◽

Computational Methods ◽

Rna Seq ◽

Batch Effects ◽

Systematic Comparison ◽

Batch Correction ◽

Link Type ◽

Biological Signals ◽

The Cost

AbstractAs the cost of single-cell RNA-seq experiments has decreased, an increasing number of datasets are now available. Combining newly generated and publicly accessible datasets is challenging due to non-biological signals, commonly known as batch effects. Although there are several computational methods available that can remove batch effects, evaluating which method performs best is not straightforward. Here we present BatchBench (https://github.com/cellgeni/batchbench), a modular and flexible pipeline for comparing batch correction methods for single-cell RNA-seq data. We apply BatchBench to eight methods, highlighting their methodological differences and assess their performance and computational requirements through a compendium of well-studied datasets. This systematic comparison guides users in the choice of batch correction tool, and the pipeline makes it easy to evaluate other datasets.

Download Full-text