Uncertainty in RNA-seq gene expression data

Combining DGE and RNA-sequencing data to identify new polyA+ non-coding transcripts in the human genome

Nucleic Acids Research ◽

10.1093/nar/gkt1300 ◽

2013 ◽

Vol 42 (5) ◽

pp. 2820-2832 ◽

Cited By ~ 14

Author(s):

Nicolas Philippe ◽

Elias Bou Samra ◽

Anthony Boureux ◽

Alban Mancheron ◽

Florence Rufflé ◽

...

Keyword(s):

Human Genome ◽

Rna Sequencing ◽

Dynamic Range ◽

Tiling Array ◽

Expression Data ◽

Rna Seq ◽

Sequencing Data ◽

Data Set ◽

Protein Coding ◽

Protein Coding Genes

Abstract Recent sequencing technologies that allow massive parallel production of short reads are the method of choice for transcriptome analysis. Particularly, digital gene expression (DGE) technologies produce a large dynamic range of expression data by generating short tag signatures for each cell transcript. These tags can be mapped back to a reference genome to identify new transcribed regions that can be further covered by RNA-sequencing (RNA-Seq) reads. Here, we applied an integrated bioinformatics approach that combines DGE tags, RNA-Seq, tiling array expression data and species-comparison to explore new transcriptional regions and their specific biological features, particularly tissue expression or conservation. We analysed tags from a large DGE data set (designated as ‘TranscriRef’). We then annotated 750 000 tags that were uniquely mapped to the human genome according to Ensembl. We retained transcripts originating from both DNA strands and categorized tags corresponding to protein-coding genes, antisense, intronic- or intergenic-transcribed regions and computed their overlap with annotated non-coding transcripts. Using this bioinformatics approach, we identified ∼34 000 novel transcribed regions located outside the boundaries of known protein-coding genes. As demonstrated using sequencing data from human pluripotent stem cells for biological validation, the method could be easily applied for the selection of tissue-specific candidate transcripts. DigitagCT is available at http://cractools.gforge.inria.fr/softwares/digitagct.

Download Full-text

Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi

10.1101/219287 ◽

2017 ◽

Author(s):

Jens Keilwagen ◽

Frank Hartung ◽

Michael Paulini ◽

Sven O. Twardziok ◽

Jan Grau

Keyword(s):

Ab Initio ◽

Gene Prediction ◽

Nematode Species ◽

Intron Position ◽

Rna Seq ◽

Sequencing Data ◽

Protein Coding ◽

Protein Coding Genes ◽

Multiple Reference ◽

Transcript Identification

MotivationGenome annotation is of key importance in many research questions. The identification of protein-coding genes is often based on transcriptome sequencing data, ab-initio or homology-based prediction. Recently, it was demonstrated that intron position conservation improves homology-based gene prediction, and that experimental data improves ab-initio gene prediction.ResultsHere, we present an extension of the gene prediction tool GeMoMa that utilizes amino acid sequence conservation, intron position conservation and optionally RNA-seq data for homology-based gene prediction. We show on published benchmark data for plants, animals and fungi that GeMoMa performs better than the gene prediction programs BRAKER1, MAKER2, and CodingQuarry, and purely RNA-seq-based pipelines for transcript identification. In addition, we demonstrate that using multiple reference organisms may help to further improve the performance of GeMoMa. Finally, we apply GeMoMa to four nematode species and to the recently published barley reference genome indicating that current annotations of protein-coding genes may be refined using GeMoMa predictions.AvailabilityGeMoMa has been published under GNU GPL3 and is freely available at http://www.jstacs.de/index.php/[email protected]

Download Full-text

Integrated modeling of protein-coding genes in theManduca sextagenome using RNA-seq data from the biochemical model insect

10.1603/ice.2016.110841 ◽

2016 ◽

Cited By ~ 1

Author(s):

Xiaolong Cao

Keyword(s):

Integrated Modeling ◽

Rna Seq ◽

Protein Coding ◽

Protein Coding Genes ◽

Biochemical Model

Download Full-text

Side-by-side analysis of alternative approaches on multi-level RNA-seq data

10.1101/131862 ◽

2017 ◽

Author(s):

Irina Mohorianu

Keyword(s):

Differential Expression ◽

Rna Seq ◽

Sequencing Data ◽

Essential Components ◽

Sequencing Quality ◽

Quality Checks ◽

Multi Level ◽

Using Data ◽

Key Steps ◽

Robust Prediction

AbstractBackgroundRNA sequencing (RNA-seq) is widely used for RNA quantification across environmental, biological and medical sciences; it enables the description of genome-wide patterns of expression and the deduction of regulatory interactions and networks. The aim of computational analyses is to achieve an accurate output, i.e. rigorous quantification of genes/transcripts to allow a reliable prediction of differential expression (DE), despite the variable levels of noise and biases present in sequencing data. The evaluation of sequencing quality and normalization are essential components of this process.ResultsWe investigate the discriminative power of existing approaches for the quality checking of mRNA-seq data and also propose additional, quantitative, quality checks. To accommodate the analysis of a nested, multi-level design using data on D. melanogaster, we incorporated the sample layout into the analysis. We describe a “subsampling without replacement”-based normalization and identification of DE that accounts for the experimental design i.e. the hierarchy and amplitude of effect sizes within samples. We also evaluate the differential expression call in comparison to existing approaches. To assess the broader applicability of these methods, we applied this series of steps to a published set of H. sapiens mRNA-seq samples.ConclusionsThe dataset-tailored methods improved sample comparability and delivered a robust prediction of subtle gene expression changes. Overall, the proposed approach offers the potential to improve key steps in the analysis of RNA-seq data by incorporating the structure and characteristics of biological experiments into the data analysis. 38

Download Full-text

Identifying inaccuracies in gene expression estimates from unstranded RNA-seq data

Scientific Reports ◽

10.1038/s41598-019-52584-w ◽

2019 ◽

Vol 9 (1) ◽

Cited By ~ 1

Author(s):

Mikhail Pomaznoy ◽

Ashu Sethi ◽

Jason Greenbaum ◽

Bjoern Peters

Keyword(s):

Gene Expression ◽

Differential Expression Analysis ◽

Cell Types ◽

Library Preparation ◽

Rna Seq ◽

Protein Coding ◽

Protein Coding Genes ◽

Machine Learning Model ◽

Specific Manner ◽

Library Preparation Protocol

Abstract RNA-seq methods are widely utilized for transcriptomic profiling of biological samples. However, there are known caveats of this technology which can skew the gene expression estimates. Specifically, if the library preparation protocol does not retain RNA strand information then some genes can be erroneously quantitated. Although strand-specific protocols have been established, a significant portion of RNA-seq data is generated in non-strand-specific manner. We used a comprehensive stranded RNA-seq dataset of 15 blood cell types to identify genes for which expression would be erroneously estimated if strand information was not available. We found that about 10% of all genes and 2.5% of protein coding genes have a two-fold or higher difference in estimated expression when strand information of the reads was ignored. We used parameters of read alignments of these genes to construct a machine learning model that can identify which genes in an unstranded dataset might have incorrect expression estimates and which ones do not. We also show that differential expression analysis of genes with biased expression estimates in unstranded read data can be recovered by limiting the reads considered to those which span exonic boundaries. The resulting approach is implemented as a package available at https://github.com/mikpom/uslcount.

Download Full-text

RNA sequencing analysis for profiling activation of cancer-associated molecular pathways.

Journal of Clinical Oncology ◽

10.1200/jco.2019.37.15_suppl.e13032 ◽

2019 ◽

Vol 37 (15_suppl) ◽

pp. e13032-e13032 ◽

Cited By ~ 2

Author(s):

Anton Buzdin ◽

Andrew Garazha ◽

Maxim Sorokin ◽

Alex Glusker ◽

Alexey Aleshin ◽

...

Keyword(s):

Gene Expression ◽

Original Data ◽

Tissue Expression ◽

Molecular Pathways ◽

Sequencing Analysis ◽

Rna Seq ◽

Sequencing Data ◽

Healthy Human ◽

Tissue Samples ◽

Normal Tissues

e13032 Background: Intracellular molecular pathways (IMPs) control all major events in the living cell. They are considered hotspots in contemporary oncology because knowledge of IMPs activation is essential for understanding mechanisms of molecular pathogenesis in oncology. Profiling IMPs requires RNA-seq data for tumors and for a collection of reference normal tissues. However, there is a shortage now in such profiles for normal tissues from healthy human donors, uniformly profiled in a single series of experiments. Access to the largest dataset of normal profiles GTEx is only partly available through the dbGaP. In TCGA database, norms are adjacent to surgically removed tumors and may be affected by tumor-linked growth factors, inflammation and altered vascularization. ENCODE datasets were for the autopsies of normal tissues, but they can’t form statistically significant reference groups. Methods: Tissue samples representing 20 organs were taken from post-mortal human healthy donors killed in road accidents no later than 36 hours after death, blood samples were taken from healthy volunteers. Gene expression was profiled in RNA-seq experiments using the same reagents, equipment and protocols. Bioinformatic algorithms for IMP analysis were developed and validated using experimental and public gene expression datasets. Results: From original sequencing data we constructed the biggest fully open reference expression database of normal human tissues including 465 profiles termed Oncobox Atlas of Normal Tissue Expression (ANTE, original data: GSE120795). We next developed a method termed Oncobox for interrogating activation of IMPs in human cancers. It includes modules of expression data harmonization and comparison and an algorithm for automatic annotation of molecular pathways. The Oncobox system enables accurate scoring of thousands molecular pathways using RNA-seq data. Oncobox pathway analysis is also applicable for quantitative proteomics and microRNA data in oncology. Conclusions: The Oncobox system can be used for a plethora of applications in cancer research including finding differentially regulated genes and IMPs, and for discovery of new pathway-related diagnostic and prognostic biomarkers.

Download Full-text

TSEBRA: transcript selector for BRAKER

BMC Bioinformatics ◽

10.1186/s12859-021-04482-0 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Lars Gabriel ◽

Katharina J. Hoff ◽

Tomáš Brůna ◽

Mark Borodovsky ◽

Mario Stanke

Keyword(s):

Statistical Models ◽

Gene Prediction ◽

Software Tool ◽

Genome Project ◽

Rna Seq ◽

Protein Coding ◽

Homologous Protein ◽

Protein Coding Genes ◽

Overlapping Transcripts ◽

Eukaryotic Genomes

Abstract Background BRAKER is a suite of automatic pipelines, BRAKER1 and BRAKER2, for the accurate annotation of protein-coding genes in eukaryotic genomes. Each pipeline trains statistical models of protein-coding genes based on provided evidence and, then predicts protein-coding genes in genomic sequences using both the extrinsic evidence and statistical models. For training and prediction, BRAKER1 and BRAKER2 incorporate complementary extrinsic evidence: BRAKER1 uses only RNA-seq data while BRAKER2 uses only a database of cross-species proteins. The BRAKER suite has so far not been able to reliably exceed the accuracy of BRAKER1 and BRAKER2 when incorporating both types of evidence simultaneously. Currently, for a novel genome project where both RNA-seq and protein data are available, the best option is to run both pipelines independently, and to pick one, likely better output. Therefore, one or another type of the extrinsic evidence would remain unexploited. Results We present TSEBRA, a software that selects gene predictions (transcripts) from the sets generated by BRAKER1 and BRAKER2. TSEBRA uses a set of rules to compare scores of overlapping transcripts based on their support by RNA-seq and homologous protein evidence. We show in computational experiments on genomes of 11 species that TSEBRA achieves higher accuracy than either BRAKER1 or BRAKER2 running alone and that TSEBRA compares favorably with the combiner tool EVidenceModeler. Conclusion TSEBRA is an easy-to-use and fast software tool. It can be used in concert with the BRAKER pipeline to generate a gene prediction set supported by both RNA-seq and homologous protein evidence.

Download Full-text

Computational studies on RNA processing in higher eukaryotes

10.21248/gups.61384 ◽

2021 ◽

Author(s):

◽

Mirko Brüggemann

Keyword(s):

Beta Cell ◽

Rna Processing ◽

Binding Sites ◽

Pancreatic Beta Cells ◽

Splicing Factor ◽

Sequencing Data ◽

Diabetes Susceptibility ◽

Protein Coding ◽

Protein Coding Genes ◽

Phd Thesis

Most cellular processes are regulated by RNA-binding proteins (RBPs). These RBPs usually use defined binding sites to recognize and directly interact with their target RNA molecule. Individual-nucleotide resolution UV crosslinking and immunoprecipitation (iCLIP) experiments are an important tool to de- scribe such interactions in cell cultures in-vivo. This experimental protocol yields millions of individual sequencing reads from which the binding spec- trum of the RBP under study can be deduced. In this PhD thesis I studied how RNA processing is driven from RBP binding by analyzing iCLIP-derived sequencing datasets. First, I described a complete data analysis pipeline to detect RBP binding sites from iCLIP sequencing reads. This workflow covers all essential process- ing steps, from the first quality control to the final annotation of binding sites. I described the accurate integration of biological iCLIP replicates to boost the initial peak calling step while ensuring high specificity through replicate re- producibility analysis. Further I proposed a routine to level binding site width to streamline downstream analysis processes. This was exemplified in the re- analysis of the binding spectrum of the U2 small nuclear RNA auxiliary factor 2 (U2AF2, U2AF65). I recaptured the known dominance of U2AF65 to bind to intronic sequences of protein-coding genes, where it likely recognizes the polypyrimidine tract as part of the core spliceosome machinery. In the second part of my thesis, I analyzed the binding spectrum of the serine and arginine rich splicing factor 6 (SRSF6) in the context of diabetes. In pancreatic beta-cells, the expression of SRSF6 is regulated by the transcription factor GLIS3, which encodes for a diabetes susceptibility gene. It is known that SRSF6 promotes beta-cell death through the splicing dysregulation of genes essential to beta-cell function and survival. However, the exact mechanism of how these RNAs are targeted by SRSF6 remains poorly understood. Here, I applied the defined iCLIP processing pipeline to describe the binding landscape of the splicing factor SRSF6 in the human pancreatic beta-cell line EndoC-H1. The initial binding sites definition revealed a predominant binding to coding sequences (CDS) of protein-coding genes. This was followed up by extensive motif analysis which revealed a so far, in human, unknown purine-rich binding motif. SRSF6 seemed to specifically recognize repetitions of the triplet GAA. I also showed that the number of contiguous triplets correlated with increasing binding site strength. I further integrated RNA-sequencing data from the same cell type, with SRSF6 in KD and in basal conditions, to analyze SRSF6- related splicing changes. I showed that the exact positioning of SRSF6 on alternatively spliced exons regulates the produced transcript isoforms. This mechanism seemed to control exons in several known susceptibility genes for diabetes. In summary, in my PhD thesis, I presented a comprehensive workflow for the processing of iCLIP-derived sequencing data. I applied this pipeline on a dataset from pancreatic beta-cells to unveil the impact of SRSF6-mediated splicing changes. Thus, my analysis provides novel insights into the regulation of diabetes susceptibility genes.

Download Full-text

NASA GeneLab RNA-Seq Consensus Pipeline: Standardized Processing of Short-Read RNA-Seq Data

10.1101/2020.11.06.371724 ◽

2020 ◽

Author(s):

Eliah G. Overbey ◽

Amanda M. Saravia-Butler ◽

Zhe Zhang ◽

Komal S. Rathi ◽

Homer Fogle ◽

...

Keyword(s):

Expression Profiles ◽

Gene Expression Profiles ◽

Rna Seq ◽

Sequencing Data ◽

Analysis Pipeline ◽

Short Read ◽

Gene Quantification ◽

Working Groups ◽

Using Data ◽

Data Analysis Pipeline

SummaryWith the development of transcriptomic technologies, we are able to quantify precise changes in gene expression profiles from astronauts and other organisms exposed to spaceflight. Members of NASA GeneLab and GeneLab-associated analysis working groups (AWGs) have developed a consensus pipeline for analyzing short-read RNA-sequencing data from spaceflight-associated experiments. The pipeline includes quality control, read trimming, mapping, and gene quantification steps, culminating in the detection of differentially expressed genes. This data analysis pipeline and the results of its execution using data submitted to GeneLab are now all publicly available through the GeneLab database. We present here the full details and rationale for the construction of this pipeline in order to promote transparency, reproducibility and reusability of pipeline data, to provide a template for data processing of future spaceflight-relevant datasets, and to encourage cross-analysis of data from other databases with the data available in GeneLab.

Download Full-text

Complete Genome Sequence of Brevundimonas sp. Strain SGAir0440, Isolated from Indoor Air in Singapore

Microbiology Resource Announcements ◽

10.1128/mra.00594-19 ◽

2019 ◽

Vol 8 (31) ◽

Author(s):

Rikky W. Purbojati ◽

Daniela I. Drautz-Moses ◽

Akira Uchida ◽

Anthony Wong ◽

Megan E. Clare ◽

...

Keyword(s):

Single Molecule ◽

Indoor Air ◽

Complete Genome Sequence ◽

Complete Genome ◽

Sequencing Data ◽

Circular Chromosome ◽

Protein Coding ◽

Content Type ◽

Air Samples ◽

Protein Coding Genes

Brevundimonas sp. strain SGAir0440 was isolated from indoor air samples collected in Singapore. Its genome was assembled using single-molecule real-time sequencing data, resulting in one circular chromosome with a length of 3.1 Mbp. The genome consists of 3,033 protein-coding genes, 48 tRNAs, and 6 rRNA operons.

Download Full-text