VERSE: a versatile and efficient RNA-Seq read counting tool

Mapping Intimacies ◽

10.1101/053306 ◽

2016 ◽

Cited By ~ 10

Author(s):

Qin Zhu ◽

Stephen A Fisher ◽

Jamie Shallcross ◽

Junhyong Kim

Keyword(s):

Reference Genome ◽

Digital Gene Expression ◽

Supplementary Information ◽

Expression Data ◽

Rna Seq ◽

Gene Level ◽

Different Types ◽

Intergenic Regions ◽

Supplementary Material ◽

Assignment Scheme

AbstractMotivationRNA-Seq is a powerful technology that delivers digital gene expression data. To measure expression strength at the gene level, one popular approach is direct read counting after aligning the reads to a reference genome/transcriptome. HTSeq is one of the most popular ways of counting reads, yet its slow running speed of poses a bottleneck to many RNA-Seq pipelines. Gene level counting programs also lack a robust scheme for quantifying reads that map to non-exonic genomic features, such as intronic and intergenic regions, even though these reads are prevalent in most RNA-Seq data.ResultsIn this paper we present VERSE, an RNA-Seq read counting tool which builds upon the speed of featureCounts and implements the counting modes of HTSeq. VERSE is more than 30x faster than HTSeq when computing the same gene counts. VERSE also supports a hierarchical assignment scheme, which allows reads to be assigned uniquely and sequentially to different types of features according to user-defined priorities.AvailabilityVERSE is implemented in C. It is built on top of featureCounts. VERSE is open source and can be downloaded freely from Github (https://github.com/qinzhu/VERSE)[email protected] informationTables and figures illustrating the counting modes implemented in VERSE and the differences between hierarchical and independent assignment.

Download Full-text

The value of genotype-specific reference for transcriptome analyses

10.1101/2021.09.14.460213 ◽

2021 ◽

Author(s):

Wenbin Guo ◽

Max Coulter ◽

Robbie Waugh ◽

Runxuan Zhang

Keyword(s):

Alternative Splicing ◽

Reference Genome ◽

Transcriptome Assembly ◽

Specific Reference ◽

Rna Seq ◽

High Quality ◽

Common Reference ◽

Transcript Quantification ◽

Gene Level ◽

Reference Transcript

High quality transcriptome assembly using short reads from RNA-seq data still heavily relies upon reference-based approaches, of which the primary step is to align RNA-seq reads to a single reference genome of haploid sequence. However, it is increasingly apparent that while different genotypes within a species share core genes, they also contain variable numbers of specific genes that are only present a subset of individuals. Using a common reference may thus lead to a loss of genotype-specific information in the assembled transcript dataset and the generation of erroneous, incomplete or misleading transcriptomics analysis results. With the recent development of pan-genome information in many species, it is important that we understand the limitations of single genotype references for transcriptomics analysis. In this study, we quantitively evaluated the advantages of using genotype-specific reference genomes for transcriptome assembly and analysis using cultivated barley as a model. We mapped barley cultivar Barke RNA-seq reads to the Barke genome and to the cultivar Morex genome (common barley genome reference) to construct a genotype specific Reference Transcript Dataset (sRTD) and a common Reference Transcript Datasets (cRTD), respectively. We compared the two RTDs according to their transcript diversity, transcript sequence and structure similarity and the accuracy they provided for transcript quantification and differential expression analysis. Our evaluation shows that the sRTD has a significantly higher diversity of transcripts and alternative splicing events. Despite using a high-quality reference genome for assembly of the cRTD, we miss ca. 40% transcripts present in the sRTD and cRTD only has ca. 70% true assemblies. We found that the sRTD is more accurate for transcript quantification as well as differential expression and differential alternative splicing analysis. However, gene level quantification and comparative expression analysis are less affected by the source RTD, which indicates that analysing transcriptomic data at the gene level may be a reasonable compromise when a high-quality genotype-specific reference is not available.

Download Full-text

Differential transcript usage analysis of bulk and single-cell RNA-seq data with DTUrtle

Bioinformatics ◽

10.1093/bioinformatics/btab629 ◽

2021 ◽

Author(s):

Tobias Tekath ◽

Martin Dugas

Keyword(s):

Single Cell ◽

Transcript Level ◽

R Package ◽

Supplementary Information ◽

Data Sets ◽

Rna Seq ◽

Cell Type ◽

Gene Level ◽

Analysis Workflow ◽

Usage Analysis

Abstract Motivation Each year, the number of published bulk and single-cell RNA-seq data sets is growing exponentially. Studies analyzing such data are commonly looking at gene-level differences, while the collected RNA-seq data inherently represents reads of transcript isoform sequences. Utilizing transcriptomic quantifiers, RNA-seq reads can be attributed to specific isoforms, allowing for analysis of transcript-level differences. A differential transcript usage (DTU) analysis is testing for proportional differences in a gene’s transcript composition, and has been of rising interest for many research questions, such as analysis of differential splicing or cell type identification. Results We present the R package DTUrtle, the first DTU analysis workflow for both bulk and single-cell RNA-seq data sets, and the first package to conduct a ‘classical’ DTU analysis in a single-cell context. DTUrtle extends established statistical frameworks, offers various result aggregation and visualization options and a novel detection probability score for tagged-end data. It has been successfully applied to bulk and single-cell RNA-seq data of human and mouse, confirming and extending key results. Additionally, we present novel potential DTU applications like the identification of cell type specific transcript isoforms as biomarkers. Availability The R package DTUrtle is available at https://github.com/TobiTekath/DTUrtle with extensive vignettes and documentation at https://tobitekath.github.io/DTUrtle/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Siberian sturgeon multi-tissue reference transcriptome database

Database ◽

10.1093/database/baaa082 ◽

2020 ◽

Vol 2020 ◽

Author(s):

Christophe Klopp ◽

Cédric Cabau ◽

Gonzalo Greif ◽

André Lasalle ◽

Santiago Di Landro ◽

...

Keyword(s):

Reference Genome ◽

Transcriptome Assembly ◽

Fish Farming ◽

Siberian Sturgeon ◽

Supplementary Information ◽

Rna Seq ◽

High Quality ◽

Reference Transcriptome ◽

Functional Studies ◽

Transcriptome Database

Abstract Motivation: Siberian sturgeon is a long lived and late maturing fish farmed for caviar production in 50 countries. Functional genomics enable to find genes of interest for fish farming. In the absence of a reference genome, a reference transcriptome is very useful for sequencing based functional studies. Results: We present here a high-quality transcriptome assembly database built using RNA-seq reads coming from brain, pituitary, gonadal, liver, stomach, kidney, anterior kidney, heart, embryonic and pre-larval tissues. It will facilitate crucial research on topics such as puberty, reproduction, growth, food intake and immunology. This database represents a major contribution to the publicly available sturgeon transcriptome reference datasets. Availability: The database is publicly available at http://siberiansturgeontissuedb.sigenae.org Supplementary information: Supplementary data are available at Database online.

Download Full-text

SIMLR: a tool for large-scale single-cell analysis by multi-kernel learning

10.1101/118901 ◽

2017 ◽

Cited By ~ 9

Author(s):

Bo Wang ◽

Daniele Ramazzotti ◽

Luca De Sano ◽

Junjie Zhu ◽

Emma Pierson ◽

...

Keyword(s):

Single Cell ◽

Large Scale ◽

Single Cell Analysis ◽

R Package ◽

Supplementary Information ◽

Cell Analysis ◽

Rna Seq ◽

A Cell ◽

Supplementary Material ◽

Public Datasets

AbstractMotivationWe here present SIMLR (Single-cell Interpretation via Multi-kernel LeaRning), an open-source tool that implements a novel framework to learn a cell-to-cell similarity measure from single-cell RNA-seq data. SIMLR can be effectively used to perform tasks such as dimension reduction, clustering, and visualization of heterogeneous populations of cells. SIMLR was benchmarked against state-of-the-art methods for these three tasks on several public datasets, showing it to be scalable and capable of greatly improving clustering performance, as well as providing valuable insights by making the data more interpretable via better a visualization.Availability and ImplementationSIMLR is available on GitHub in both R and MATLAB implementations. Furthermore, it is also available as an R package on [email protected] or [email protected] InformationSupplementary data are available at Bioinformatics online.

Download Full-text

StereoGene: Rapid Estimation of Genomewide Correlation of Continuous or Interval Feature Data

10.1101/059584 ◽

2016 ◽

Author(s):

Elena D. Stavrovskaya ◽

Tejasvi Niranjan ◽

Elana J. Fertig ◽

Sarah J. Wheelan ◽

Alexander Favorov ◽

...

Keyword(s):

Partial Correlation ◽

Reference Genome ◽

Developmental Trajectories ◽

Supplementary Information ◽

Regulation Of Transcription ◽

Transcription Start ◽

Genomic Features ◽

Transcription Start Sites ◽

Supplementary Material ◽

Kernel Correlation

AbstractMotivationGenomics features with similar genomewide distributions are generally hypothesized to be functionally related, for example, co-localization of histones and transcription start sites indicate chromatin regulation of transcription factor activity. Therefore, statistical algorithms to perform spatial, genomewide correlation among genomic features are required.ResultsHere, we propose a method, StereoGene, that rapidly estimates genomewide correlation among pairs of genomic features. These features may represent high throughput data mapped to reference genome or sets of genomic annotations in that reference genome. StereoGene enables correlation of continuous data directly, avoiding the data binarization and subsequent data loss. Correlations are computed among neighboring genomic positions using kernel correlation. Representing the correlation as a function of the genome position, StereoGene outputs the local correlation track as part of the analysis. StereoGene also accounts for confounders such as input DNA by partial correlation. We apply our method to numerous comparisons of ChIP-Seq datasets from the Human Epigenome Atlas and FANTOM CAGE to demonstrate its wide applicability. We observe the changes in the correlation between epigenomic features across developmental trajectories of several tissue types consistent with known biology, and find a novel spatial correlation of CAGE clusters with donor splice sites and with poly(A) sites. These analyses provide examples for the broad applicability of StereoGene for regulatory genomics.AvailabilityThe StereoGene C++ source code, program documentation, Galaxy integration scripts and examples are available from the project homepage http://stereogene.bioinf.fbb.msu.ru/[email protected] informationSupplementary data are available online.

Download Full-text

MOSim: Multi-Omics Simulation in R

10.1101/421834 ◽

2018 ◽

Cited By ~ 5

Author(s):

Carlos Martínez-Mira ◽

Ana Conesa ◽

Sonia Tarazona

Keyword(s):

Time Series Data ◽

Simulated Data ◽

R Package ◽

Experimental Designs ◽

Supplementary Information ◽

Series Data ◽

Data Sets ◽

Expression Data ◽

Supplementary Material ◽

Omic Data

AbstractMotivationAs new integrative methodologies are being developed to analyse multi-omic experiments, validation strategies are required for benchmarking. In silico approaches such as simulated data are popular as they are fast and cheap. However, few tools are available for creating synthetic multi-omic data sets.ResultsMOSim is a new R package for easily simulating multi-omic experiments consisting of gene expression data, other regulatory omics and the regulatory relationships between them. MOSim supports different experimental designs including time series data.AvailabilityThe package is freely available under the GPL-3 license from the Bitbucket repository (https://bitbucket.org/ConesaLab/mosim/)[email protected] informationSupplementary material is available at bioRxiv online.

Download Full-text

Metric Learning on Expression Data for Gene Function Prediction

10.1101/651042 ◽

2019 ◽

Author(s):

Stavros Makrodimitris ◽

Marcel J.T. Reinders ◽

Roeland C.H.J. van Ham

Keyword(s):

Pearson Correlation ◽

Metric Learning ◽

Specific Weight ◽

Supplementary Information ◽

Expression Data ◽

Rna Seq ◽

Experimental Conditions ◽

Expression Of Genes ◽

Guilt By Association ◽

Python Package

AbstractMotivationCo-expression of two genes across different conditions is indicative of their involvement in the same biological process. However, using RNA-Seq datasets with many experimental conditions from diverse sources introduces batch effects and other artefacts that might obscure the real co-expression signal. Moreover, only a subset of experimental conditions is expected to be relevant for finding genes related to a particular Gene Ontology (GO) term. Therefore, we hypothesize that when the purpose is to find similar functioning genes that the co-expression of genes should not be determined on all samples but only on those samples informative for the GO term of interest.ResultsTo address both types of effects, we developed MLC (Metric Learning for Co-expression), a fast algorithm that assigns a GO-term-specific weight to each expression sample. The goal is to obtain a weighted co-expression measure that is more suitable than the unweighted Pearson correlation for applying Guilt-By-Association-based function predictions. More specifically, if two genes are annotated with a given GO term, MLC tries to maximize their weighted co-expression, and, in addition, if one of them is not annotated with that term, the weighted co-expression is minimized. Our experiments on publicly available Arabidopsis thaliana RNA-Seq data demonstrate that MLC outperforms standard Pearson correlation in term-centric performance.AvailabilityMLC is available as a Python package at www.github.com/stamakro/[email protected] informationSupplementary data are available online.

Download Full-text

QUBIC2: a novel and robust biclustering algorithm for analyses and interpretation of large-scale RNA-Seq data

Bioinformatics ◽

10.1093/bioinformatics/btz692 ◽

2019 ◽

Vol 36 (4) ◽

pp. 1143-1149 ◽

Cited By ~ 9

Author(s):

Juan Xie ◽

Anjun Ma ◽

Yu Zhang ◽

Bingqiang Liu ◽

Sha Cao ◽

...

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Large Scale ◽

Gaussian Model ◽

Functional Gene ◽

Superior Performance ◽

Supplementary Information ◽

Expression Data ◽

Rna Seq ◽

Gene Modules

Abstract Motivation The biclustering of large-scale gene expression data holds promising potential for detecting condition-specific functional gene modules (i.e. biclusters). However, existing methods do not adequately address a comprehensive detection of all significant bicluster structures and have limited power when applied to expression data generated by RNA-Sequencing (RNA-Seq), especially single-cell RNA-Seq (scRNA-Seq) data, where massive zero and low expression values are observed. Results We present a new biclustering algorithm, QUalitative BIClustering algorithm Version 2 (QUBIC2), which is empowered by: (i) a novel left-truncated mixture of Gaussian model for an accurate assessment of multimodality in zero-enriched expression data, (ii) a fast and efficient dropouts-saving expansion strategy for functional gene modules optimization using information divergency and (iii) a rigorous statistical test for the significance of all the identified biclusters in any organism, including those without substantial functional annotations. QUBIC2 demonstrated considerably improved performance in detecting biclusters compared to other five widely used algorithms on various benchmark datasets from E.coli, Human and simulated data. QUBIC2 also showcased robust and superior performance on gene expression data generated by microarray, bulk RNA-Seq and scRNA-Seq. Availability and implementation The source code of QUBIC2 is freely available at https://github.com/OSU-BMBL/QUBIC2. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

BYASE: a Python library for estimating gene and isoform level allele-specific expression

Bioinformatics ◽

10.1093/bioinformatics/btaa636 ◽

2020 ◽

Vol 36 (19) ◽

pp. 4955-4956

Author(s):

Lili Dong ◽

Jianan Wang ◽

Guohua Wang

Keyword(s):

Graphical User Interface ◽

Supplementary Information ◽

Rna Seq ◽

Biological Mechanisms ◽

Specific Expression ◽

Allele Specific Expression ◽

Source Codes ◽

Gene Level ◽

Allele Specific ◽

Python Package

Abstract Summary Allele-specific expression (ASE) is involved in many important biological mechanisms. We present a python package BYASE and its graphical user interface (GUI) tool BYASE-GUI for the identification of ASE from single-end and paired-end RNA-seq data based on Bayesian inference, which can simultaneously report differences in gene-level and isoform-level expression. BYASE uses both phased SNPs and non-phased SNPs, and supports polyploid organisms. Availability and implementation The source codes of BYASE and BYASE-GUI are freely available at https://github.com/ncjllld/byase and https://github.com/ncjllld/byase_gui. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

FuSe: a tool to move RNA-Seq analyses from chromosomal/gene loci to functional grouping of mRNA transcripts

Bioinformatics ◽

10.1093/bioinformatics/btaa735 ◽

2020 ◽

Author(s):

Rajinder Gupta ◽

Yannick Schrooders ◽

Marcha Verheijen ◽

Adrian Roth ◽

Jos Kleinjans ◽

...

Keyword(s):

Transcript Level ◽

Supplementary Information ◽

Rna Seq ◽

Functional Changes ◽

Gene Level ◽

Secondary Structure Of Proteins ◽

The Impact ◽

Structure Of Proteins ◽

Functional Grouping

Abstract Summary Typical RNA sequencing (RNA-Seq) analyses are performed either at the gene level by summing all reads from the same locus, assuming that all transcripts from a gene make a protein or at the transcript level, assuming that each transcript displays unique function. However, these assumptions are flawed, as a gene can code for different types of transcripts and different transcripts are capable of synthesizing similar, different or no protein. As a consequence, functional changes are not well illustrated by either gene or transcript analyses. We propose to improve RNA-Seq analyses by grouping the transcripts based on their similar functions. We developed FuSe to predict functional similarities using the primary and secondary structure of proteins. To estimate the likelihood of proteins with similar functions, FuSe computes two confidence scores: knowledge (KS) and discovery (DS) for protein pairs. Overlapping protein pairs exhibiting high confidence are grouped to form ‘similar function protein groups’ and expression is calculated for each functional group. The impact of using FuSe is demonstrated on in vitro cells exposed to paracetamol, which highlight genes responsible for cell adhesion and glycogen regulation which were earlier shown to be not differentially expressed with traditional analysis methods. Availability and implementation The source code is available at https://github.com/rajinder4489/FuSe. Data for APAP exposure are available in the BioStudies database (http://www.ebi.ac.uk/biostudies) under accession numbers S-HECA143, S-HECA(158) and S-HECA139. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text