scholarly journals Modeling of RNA-seq fragment sequence bias reduces systematic errors in transcript abundance estimation

2015 ◽  
Author(s):  
Michael I Love ◽  
John B Hogenesch ◽  
Rafael A Irizarry

RNA-seq technology is widely used in biomedical and basic science research. These studies rely on complex computational methods that quantify expression levels for observed transcripts. We find that current computational methods can lead to hundreds of false positive results related to alternative isoform usage. This flaw in the current methodology stems from a lack of modeling sample-specific bias that leads to drops in coverage and is related to sequence features like fragment GC content and GC stretches. By incorporating features that explain this bias into transcript expression models, we greatly increase the specificity of transcript expression estimates, with more than a four-fold reduction in the number of false positives for reported changes in expression. We introduce alpine, a method for estimation of bias-corrected transcript abundance. The method is available as a Bioconductor package that includes data visualization tools useful for bias discovery.

2017 ◽  
Author(s):  
Kimon Froussios ◽  
Kira Mourão ◽  
Gordon G. Simpson ◽  
Geoffrey J. Barton ◽  
Nick J. Schurch

AbstractMotivationThe biological importance of changes in gene and transcript expression is well recognised and is reflected by the wide variety of tools available to characterise these changes. Regulation via Differential Transcript Usage (DTU) is emerging as an important phenomenon. Several tools exist for the detection of DTU from read alignment or assembly data, but options for detection of DTU from alignment-free quantifications are limited.ResultsWe present an R package named RATs – (Relative Abundance of Transcripts) – that identifies DTU transcriptome-wide directly from transcript abundance estimations. RATs is agnostic to quantification methods and exploits bootstrapped quantifications, if available, to inform the significance of detected DTU events. RATs contextualises the DTU results and shows good False Discovery performance (median FDR ≤0.05) at all replication levels. We applied RATs to a human RNA-seq dataset associated with idiopathic pulmonary fibrosis with three DTU events validated by qRT-PCR. RATs found all three genes exhibited statistically significant changes in isoform proportions based on Ensembl v60 annotations, but the DTU for two were not reliably reproduced across bootstrapped quantifications. RATs also identified 500 novel DTU events that are enriched for eleven GO terms related to regulation of the response to stimulus, regulation of immune system processes, and symbiosis/parasitism. Repeating this analysis with the Ensembl v87 annotation showed the isoform abundance profiles of two of the three validated DTU genes changed radically. RATs identified 414 novel DTU events that are enriched for five GO terms, none of which are in common with those previously identified. Only 141 of the DTU evens are common between the two analyses, and only 8 are among the 248 reported by the original study. Furthermore, the original qRT-PCR probes no longer match uniquely to their original transcripts, calling into question the interpretation of these data. We suggest parallel full-length isoform sequencing, annotation pre-filtering and sequencing of the transcripts captured by qRT-PCR primers as possible ways to improve the validation of RNA-seq results in future experiments.AvailabilityThe package is available through Github at https://github.com/bartongroup/Rats.


Author(s):  
Scott Van Buren ◽  
Naim Rashid

Differential transcript usage (DTU) occurs when the relative transcript abundance of a gene changes between different conditions. Existing approaches to analyze DTU often rely on computational procedures that can have speed and scalability issues as the number of samples increases. In this paper, we propose a new method, termed CompDTU, that utilizes compositional regression to model transcript-level relative abundance proportions that are of interest in DTU analyses. This procedure does not suffer from speed and scalability issues due to the relative computational simplicity, making it ideally suited for DTU analysis with large sample sizes. The method also allows for the testing of and controlling for multiple categorical or continuous covariates. Additionally, many existing approaches for DTU ignore quantification uncertainty present in RNA-Seq data, where prior work has shown that accounting for such uncertainty may improve testing performance. We extend our CompDTU method to incorporate quantification uncertainty using bootstrap replicates of abundance estimates from Salmon and term this method CompDTUme. Through several power analyses, we show that CompDTU improves sensitivity and reduces false positive results relative to existing methods. Additionally, CompDTUme results in further improvements in performance over CompDTU with sufficient sample size for genes with high levels of quantification uncertainty while maintaining favorable speed and scalability.


2016 ◽  
Vol 34 (12) ◽  
pp. 1287-1291 ◽  
Author(s):  
Michael I Love ◽  
John B Hogenesch ◽  
Rafael A Irizarry

2015 ◽  
Author(s):  
Rob Patro ◽  
Geet Duggal ◽  
Michael I Love ◽  
Rafael A Irizarry ◽  
Carl Kingsford

We introduce Salmon, a new method for quantifying transcript abundance from RNA-seq reads that is highly-accurate and very fast. Salmon is the first transcriptome-wide quantifier to model and correct for fragment GC content bias, which we demonstrate substantially improves the accuracy of abundance estimates and the reliability of subsequent differential expression analysis compared to existing methods that do not account for these biases. Salmon achieves its speed and accuracy by combining a new dual-phase parallel inference algorithm and feature-rich bias models with an ultra-fast read mapping procedure. These innovations yield both exceptional accuracy and order-of-magnitude speed benefits over alignment-based methods.


BMC Genomics ◽  
2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Shuhua Zhan ◽  
Cortland Griswold ◽  
Lewis Lukens

Abstract Background Genetic variation for gene expression is a source of phenotypic variation for natural and agricultural species. The common approach to map and to quantify gene expression from genetically distinct individuals is to assign their RNA-seq reads to a single reference genome. However, RNA-seq reads from alleles dissimilar to this reference genome may fail to map correctly, causing transcript levels to be underestimated. Presently, the extent of this mapping problem is not clear, particularly in highly diverse species. We investigated if mapping bias occurred and if chromosomal features associated with mapping bias. Zea mays presents a model species to assess these questions, given it has genotypically distinct and well-studied genetic lines. Results In Zea mays, the inbred B73 genome is the standard reference genome and template for RNA-seq read assignments. In the absence of mapping bias, B73 and a second inbred line, Mo17, would each have an approximately equal number of regulatory alleles that increase gene expression. Remarkably, Mo17 had 2–4 times fewer such positively acting alleles than did B73 when RNA-seq reads were aligned to the B73 reference genome. Reciprocally, over one-half of the B73 alleles that increased gene expression were not detected when reads were aligned to the Mo17 genome template. Genes at dissimilar chromosomal ends were strongly affected by mapping bias, and genes at more similar pericentromeric regions were less affected. Biased transcript estimates were higher in untranslated regions and lower in splice junctions. Bias occurred across software and alignment parameters. Conclusions Mapping bias very strongly affects gene transcript abundance estimates in maize, and bias varies across chromosomal features. Individual genome or transcriptome templates are likely necessary for accurate transcript estimation across genetically variable individuals in maize and other species.


2021 ◽  
Vol 5 (Supplement_1) ◽  
pp. A1018-A1019
Author(s):  
Christian Secchi ◽  
Paola Benaglio ◽  
Francesca Mulas ◽  
Martina Belli ◽  
Dwayne Stupack ◽  
...  

Abstract Background: Adult granulosa cell tumor (aGCT) is a rare type of stromal cell malignant cancer of the ovary. Postmenopausal genital bleeding is the main aGCT clinical sign which is attributed to estrogen excess driven by CYP19 upregulation. Typically, aGCTs that are diagnosed at an initial stage can be treated with surgery. However, recurrences are mostly fatal1. Current studies are focused on finding new molecular markers and targets that aim to treat the aGCTs recurrence. Between 95-97% of aGCTs harbor a somatic mutation in the FOXL2 gene, Cys134Trp (c.402C<G)2. A TGF-β pathway protein, SMAD3, was identified as an essential partner in FOXL2C134W transcriptional activity driving CYP19 upregulation3. Recently, the antitumoral FOXO1 gene has been recognized as a potential target for suppressing the FOXL2C134W pathogenic action4. Aim: The objective of this study was to examine whether FOXO1 upregulation affects the FOXL2C143W/SMAD3 transcriptomic landscape. Methods: RNA-seq analysis was performed comparing the effect of FOXL2WT/SMAD3 and FOXL2C143W/SMAD3 overexpression in presence of FOXO1 by transfection of an established human GC line (HGrC1). RNA-seq libraries were prepared using the illumina TrueSeq and sequenced using an illumina HiSeq Platform4000. To quantify transcript abundance for each sample we used salmon (1.1.0) with default parameters, using indexes from hg38. Data was subsequently imported in R using the tximport package and processed with the DESeq2 package. Results: RNA-seq data show that FOXL2C143W/SMAD3 significantly drives 717 genes compared with the WT and enabled us to identify targets (TGFB2, SMARCA4, HSPG2, MKI67, NFKBIA) and neoplastic pathways directly associated with the mutant. To provide evidence that the differences in gene expression were attributed to a direct consequence of FOXL2 binding, we annotated gene promoters with previously published FOXL2 ChIP-seq analysis. The majority (73-40%) of the differential expressed genes (DEGs) between FOXL2C134W and FOXL2WT had a FOXL2 binding site at their promoters, which was a significantly higher proportion than in non-DEGs (Fisher’s exact test, murine: p= 7.9x10-157; human, p= 9.9x10-39). Surprisingly, the number of DEGs between FOXL2C134W + FOXO1 and FOXL2WT was much lower (230) with respect to the number of DEGs between FOXL2C134W and FOXL2WT (717, of which 130 in common; linear regression slope ß = 0 .58), suggesting that the effect of FOXL2C134W compared with FOXL2WT is moderated by the addition of FOXO1. Conclusions: Our transcriptomic study provides the first evidence that FOXO1 can efficiently mitigate 40% of the altered genome-wide effect specifically related to FOXL2C134W in a model of human aGCT.1 Farkkila, A. et al. Ann Med (2017). 2 Jamieson, S. & Fuller, P. J. Endocr Rev (2012). 3 Belli, M. et al. Endocrinology (2018). 4 Belli, M et al. J Endocr Soc (2019).


2019 ◽  
Author(s):  
Avi Srivastava ◽  
Laraib Malik ◽  
Hirak Sarkar ◽  
Mohsen Zakeri ◽  
Fatemeh Almodaresi ◽  
...  

AbstractBackgroundThe accuracy of transcript quantification using RNA-seq data depends on many factors, such as the choice of alignment or mapping method and the quantification model being adopted. While the choice of quantification model has been shown to be important, considerably less attention has been given to comparing the effect of various read alignment approaches on quantification accuracy.ResultsWe investigate the influence of mapping and alignment on the accuracy of transcript quantification in both simulated and experimental data, as well as the effect on subsequent differential expression analysis. We observe that, even when the quantification model itself is held fixed, the effect of choosing a different alignment methodology, or aligning reads using different parameters, on quantification estimates can sometimes be large, and can affect downstream differential expression analyses as well. These effects can go unnoticed when assessment is focused too heavily on simulated data, where the alignment task is often simpler than in experimentally-acquired samples. We also introduce a new alignment methodology, called selective alignment, to overcome the shortcomings of lightweight approaches without incurring the computational cost of traditional alignment.ConclusionWe observe that, on experimental datasets, the performance of lightweight mapping and alignment-based approaches varies significantly and highlight some of the underlying factors. We show this variation both in terms of quantification and downstream differential expression analysis. In all comparisons, we also show the improved performance of our proposed selective alignment method and suggest best practices for performing RNA-seq quantification.


F1000Research ◽  
2015 ◽  
Vol 4 ◽  
pp. 155 ◽  
Author(s):  
Sandeep Chakraborty ◽  
Monica Britton ◽  
Jill Wegrzyn ◽  
Timothy Butterfield ◽  
Pedro José Martínez-García ◽  
...  

The transcriptome provides a functional footprint of the genome by enumerating the molecular components of cells and tissues. The field of transcript discovery has been revolutionized through high-throughput mRNA sequencing (RNA-seq). Here, we present a methodology that replicates and improves existing methodologies, and implements a workflow for error estimation and correction followed by genome annotation and transcript abundance estimation for RNA-seq derived transcriptome sequences (YeATS - Yet Another Tool Suite for analyzing RNA-seq derived transcriptome). A unique feature of YeATS is the upfront determination of the errors in the sequencing or transcript assembly process by analyzing open reading frames of transcripts. YeATS identifies transcripts that have not been merged, result in broken open reading frames or contain long repeats as erroneous transcripts. We present the YeATS workflow using a representative sample of the transcriptome from the tissue at the heartwood/sapwood transition zone in black walnut. A novel feature of the transcriptome that emerged from our analysis was the identification of a highly abundant transcript that had no known homologous genes (GenBank accession: KT023102). The amino acid composition of the longest open reading frame of this gene classifies this as a putative extensin. Also, we corroborated the transcriptional abundance of proline-rich proteins, dehydrins, senescence-associated proteins, and the DNAJ family of chaperone proteins. Thus, YeATS presents a workflow for analyzing RNA-seq data with several innovative features that differentiate it from existing software.


2020 ◽  
Author(s):  
Ruben Chazarra-Gil ◽  
Stijn van Dongen ◽  
Vladimir Yu Kiselev ◽  
Martin Hemberg

AbstractAs the cost of single-cell RNA-seq experiments has decreased, an increasing number of datasets are now available. Combining newly generated and publicly accessible datasets is challenging due to non-biological signals, commonly known as batch effects. Although there are several computational methods available that can remove batch effects, evaluating which method performs best is not straightforward. Here we present BatchBench (https://github.com/cellgeni/batchbench), a modular and flexible pipeline for comparing batch correction methods for single-cell RNA-seq data. We apply BatchBench to eight methods, highlighting their methodological differences and assess their performance and computational requirements through a compendium of well-studied datasets. This systematic comparison guides users in the choice of batch correction tool, and the pipeline makes it easy to evaluate other datasets.


Sign in / Sign up

Export Citation Format

Share Document