OUTRIDER: A statistical method for detecting aberrantly expressed genes in RNA sequencing data

Mapping Intimacies ◽

10.1101/322149 ◽

2018 ◽

Cited By ~ 2

Author(s):

Felix Brechtmann ◽

Agnė Matusevičiūtė ◽

Christian Mertes ◽

Vicente A Yépez ◽

Žiga Avsec ◽

...

Keyword(s):

Gene Expression ◽

Rna Sequencing ◽

Negative Binomial ◽

Statistical Significance ◽

P Value ◽

Rna Seq ◽

Sequencing Data ◽

Data Set ◽

Aberrant Gene Expression ◽

Aberrant Gene

AbstractRNA sequencing (RNA-seq) is gaining popularity as a complementary assay to genome sequencing for precisely identifying the molecular causes of rare disorders. A powerful approach is to identify aberrant gene expression levels as potential pathogenic events. However, existing methods for detecting aberrant read counts in RNA-seq data either lack assessments of statistical significance, so that establishing cutoffs is arbitrary, or rely on subjective manual corrections for confounders. Here, we describe OUTRIDER (OUTlier in RNA-seq fInDER), an algorithm developed to address these issues. The algorithm uses an autoencoder to model read count expectations according to the co-variation among genes resulting from technical, environmental, or common genetic variations. Given these expectations, the RNA-seq read counts are assumed to follow a negative binomial distribution with a gene-specific dispersion. Outliers are then identified as read counts that significantly deviate from this distribution. The model is automatically fitted to achieve the best correction of artificially corrupted data. Precision–recall analyses using simulated outlier read counts demonstrated the importance of combining correction for co-variation and significance-based thresholds. OUTRIDER is open source and includes functions for filtering out genes not expressed in a data set, for identifying outlier samples with too many aberrantly expressed genes, and for the P-value-based detection of aberrant gene expression, with false discovery rate adjustment. Overall, OUTRIDER provides a computationally fast and scalable end-to-end solution for identifying aberrantly expressed genes, suitable for use by rare disease diagnostic platforms.

Download Full-text

Detection of aberrant gene expression events in RNA sequencing data

Nature Protocols ◽

10.1038/s41596-020-00462-5 ◽

2021 ◽

Vol 16 (2) ◽

pp. 1276-1296

Author(s):

Vicente A. Yépez ◽

Christian Mertes ◽

Michaela F. Müller ◽

Daniela Klaproth-Andrade ◽

Leonhard Wachutka ◽

...

Keyword(s):

Gene Expression ◽

Rna Sequencing ◽

Sequencing Data ◽

Aberrant Gene Expression ◽

Aberrant Gene

Download Full-text

SPARTA: Simple Program for Automated reference-based bacterial RNA-seq Transcriptome Analysis

10.1101/021915 ◽

2015 ◽

Author(s):

Benjamin K Johnson ◽

Matthew B Scholz ◽

Tracy K Teal ◽

Robert B Abramovitch

Keyword(s):

Gene Expression ◽

Differential Gene Expression ◽

Quality Analysis ◽

Rna Seq ◽

Sequencing Data ◽

Data Set ◽

Bacterial Rna ◽

Analysis Workflow ◽

Differential Gene ◽

Reference Counting

Summary: SPARTA is a reference-based bacterial RNA-seq analysis workflow application for single-end Illumina reads. SPARTA is turnkey software that simplifies the process of analyzing RNA-seq data sets, making bacterial RNA-seq analysis a routine process that can be undertaken on a personal computer or in the classroom. The easy-to-install, complete workflow processes whole transcriptome shotgun sequencing data files by trimming reads and removing adapters, mapping reads to a reference, counting gene features, calculating differential gene expression, and, importantly, checking for potential batch effects within the data set. SPARTA outputs quality analysis reports, gene feature counts and differential gene expression tables and scatterplots. The workflow is implemented in Python for file management and sequential execution of each analysis step and is available for Mac OS X, Microsoft Windows, and Linux. To promote the use of SPARTA as a teaching platform, a web-based tutorial is available explaining how RNA-seq data are processed and analyzed by the software. Availability and Implementation: Tutorial and workflow can be found at sparta.readthedocs.org. Teaching materials are located at sparta-teaching.readthedocs.org. Source code can be downloaded at www.github.com/abramovitchMSU/, implemented in Python and supported on Mac OS X, Linux, and MS Windows. Contact: Robert B. Abramovitch ([email protected]) Supplemental Information: Supplementary data are available online

Download Full-text

RNA sequencing data: hitchhiker's guide to expression analysis

10.7287/peerj.preprints.27283 ◽

2018 ◽

Author(s):

Koen Van Den Berge ◽

Katharina Hembach ◽

Charlotte Soneson ◽

Simone Tiberi ◽

Lieven Clement ◽

...

Keyword(s):

Gene Expression ◽

Rna Sequencing ◽

Large Scale ◽

Science Studies ◽

Rna Seq ◽

Sequencing Data ◽

Data Types ◽

The Past ◽

Long Read ◽

Statistical Approaches

Gene expression is the fundamental level at which the result of various genetic and regulatory programs are observable. The measurement of transcriptome-wide gene expression has convincingly switched from microarrays to sequencing in a matter of years. RNA sequencing (RNA-seq) provides a quantitative and open system for profiling transcriptional outcomes on a large scale and therefore facilitates a large diversity of applications, including basic science studies, but also agricultural or clinical situations. In the past 10 years or so, much has been learned about the characteristics of the RNA-seq datasets as well as the performance of the myriad of methods developed. In this review, we give an overall view of the developments in RNA-seq data analysis, including experimental design, with an explicit focus on quantification of gene expression and statistical approaches for differential expression. We also highlight emerging data types, such as single-cell RNA-seq and gene expression profiling using long-read technologies.

Download Full-text

Transcriptome diversity is a systematic source of bias in RNA-sequencing data

10.1101/2021.04.27.441712 ◽

2021 ◽

Author(s):

Pablo E. García-Nieto ◽

Ban Wang ◽

Hunter B. Fraser

Keyword(s):

Gene Expression ◽

Rna Sequencing ◽

Systematic Bias ◽

Simple Explanation ◽

Rna Seq ◽

Sequencing Data ◽

Biological Variables ◽

Systematic Effects ◽

Standard Practices ◽

Transcriptome Diversity

ABSTRACTBackgroundRNA sequencing has been widely used as an essential tool to probe gene expression. While standard practices have been established to analyze RNA-seq data, it is still challenging to detect and remove artifactual signals. Several factors such as sex, age, and sequencing technology have been found to bias these estimates. Probabilistic estimation of expression residuals (PEER) has been used to account for some systematic effects, but it has remained challenging to interpret these PEER factors.ResultsHere we show that transcriptome diversity – a simple metric based on Shannon entropy – explains a large portion of variability in gene expression, and is a major factor detected by PEER. We then show that transcriptome diversity has significant associations with multiple technical and biological variables across diverse organisms and datasets. This prevalent confounding factor provides a simple explanation for a major source of systematic biases in gene expression estimates.ConclusionsOur results show that transcriptome diversity is a metric that captures a systematic bias in RNA-seq and is the strongest known factor encoded in PEER covariates.

Download Full-text

SPsimSeq: semi-parametric simulation of bulk and single cell RNA sequencing data

10.1101/677740 ◽

2019 ◽

Cited By ~ 1

Author(s):

Alemu Takele Assefa ◽

Jo Vandesompele ◽

Olivier Thas

Keyword(s):

Gene Expression ◽

Single Cell ◽

Rna Sequencing ◽

Empirical Distribution ◽

Supplementary Information ◽

Rna Seq ◽

Sequencing Data ◽

Actual Distribution ◽

Wide Range ◽

Single Cell Rna Sequencing

SummarySPsimSeq is a semi-parametric simulation method for bulk and single cell RNA sequencing data. It simulates data from a good estimate of the actual distribution of a given real RNA-seq dataset. In contrast to existing approaches that assume a particular data distribution, our method constructs an empirical distribution of gene expression data from a given source RNA-seq experiment to faithfully capture the data characteristics of real data. Importantly, our method can be used to simulate a wide range of scenarios, such as single or multiple biological groups, systematic variations (e.g. confounding batch effects), and different sample sizes. It can also be used to simulate different gene expression units resulting from different library preparation protocols, such as read counts or UMI counts.Availability and implementationThe R package and associated documentation is available from https://github.com/CenterForStatistics-UGent/SPsimSeq.Supplementary informationSupplementary data are available at bioRχiv online.

Download Full-text

Machine Learning-Assisted Identification of Factors Contributing to the Technical Variability Between Bulk and Single-Cell RNA-Seq Experiments

10.21203/rs.3.rs-1247889/v1 ◽

2022 ◽

Author(s):

Sofya Lipnitskaya ◽

Yang Shen ◽

Stefan Legewie ◽

Holger Klein ◽

Kolja Becker

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Single Cell ◽

Rna Sequencing ◽

Quantitative Difference ◽

Rna Seq ◽

Sequencing Data ◽

Factors Affecting ◽

Expression Variability ◽

Technical Variability

Abstract Background: Recent studies in the area of transcriptomics performed on single-cell and population levels reveal noticeable variability in gene expression measurements provided by different RNA sequencing technologies. Due to increased noise and complexity of single-cell RNA-Seq (scRNA-Seq) data over the bulk experiment, there is a substantial number of variably-expressed genes and so-called dropouts, challenging the subsequent computational analysis and potentially leading to false positive discoveries. In order to investigate factors affecting technical variability between RNA sequencing experiments of different technologies, we performed a systematic assessment of single-cell and bulk RNA-Seq data, which have undergone the same pre-processing and sample preparation procedures. Results: Our analysis indicates that variability between gene expression measurements as well as dropout events are not exclusively caused by biological variability, low expression levels, or random variation. Furthermore, we propose FAVSeq, a machine learning-assisted pipeline for detection of factors contributing to gene expression variability in matched RNA-Seq data provided by two technologies. Based on the analysis of the matched bulk and single-cell dataset, we found the 3'-UTR and transcript lengths as the most relevant effectors of the observed variation between RNA-Seq experiments, while the same factors together with cellular compartments were shown to be associated with dropouts. Conclusions: Here, we investigated the sources of variation in RNA-Seq profiles of matched single-cell and bulk experiments. In addition, we proposed the FAVSeq pipeline for analyzing multimodal RNA sequencing data, which allowed to identify factors affecting quantitative difference in gene expression measurements as well as the presence of dropouts. Hereby, the derived knowledge can be employed further in order to improve the interpretation of RNA-Seq data and identify genes that can be affected by assay-based deviations. Source code is available under the MIT license at https://github.com/slipnitskaya/FAVSeq.

Download Full-text

Exploiting aberrant mRNA expression in autism for gene discovery and diagnosis

10.1101/029488 ◽

2015 ◽

Author(s):

Jinting Guan ◽

Ence Yang ◽

Jizhou Yang ◽

Yong Zeng ◽

Guoli Ji ◽

...

Keyword(s):

Gene Expression ◽

Expression Analysis ◽

Genetic Heterogeneity ◽

Gene Expression Analysis ◽

Autism Spectrum ◽

Data Set ◽

Expression Variability ◽

Aberrant Gene Expression ◽

Gene Sets ◽

Aberrant Gene

AbstractAutism spectrum disorder (ASD) is characterized by substantial phenotypic and genetic heterogeneity, which greatly complicates the identification of genetic factors that contribute to the disease. Study designs have mainly focused on group differences between cases and controls. The problem is that, by their nature, group difference-based methods (e.g., differential expression analysis) blur or collapse the heterogeneity within groups. By ignoring genes with variable within-group expression, an important axis of genetic heterogeneity contributing to expression variability among affected individuals has been overlooked. To this end, we develop a new gene expression analysis method—aberrant gene expression analysis, based on the multivariate distance commonly used for outlier detection. Our method detects the discrepancies in gene expression dispersion between groups and identifies genes with significantly different expression variability. Using this new method, we re-visited RNA sequencing data generated from post-mortem brain tissues of 47 ASD and 57 control samples. We identified 54 functional gene sets whose expression dispersion in ASD samples is more pronounced than that in controls, as well as 76 co-expression modules present in controls but absent in ASD samples due to ASD-specific aberrant gene expression. We also exploited aberrantly expressed genes as biomarkers for ASD diagnosis. With a whole blood expression data set, we identified three aberrantly expressed gene sets whose expression levels serve as discriminating variables achieving >70% classification accuracy. In summary, our method represents a novel discovery and diagnostic strategy for ASD. Our findings may help open an expression variability-centered research avenue for other genetically heterogeneous disorders.

Download Full-text

TEffectR: an R package for studying the potential effects of transposable elements on gene expression with linear regression model

PeerJ ◽

10.7717/peerj.8192 ◽

2019 ◽

Vol 7 ◽

pp. e8192 ◽

Cited By ~ 4

Author(s):

Gökhan Karakülah ◽

Nazmiye Arslan ◽

Cihangir Yandım ◽

Aslı Suner

Keyword(s):

Gene Expression ◽

Linear Regression ◽

Transposable Elements ◽

Regression Model ◽

Rna Sequencing ◽

Linear Regression Model ◽

R Package ◽

Breast Cancer Patients ◽

Sequencing Data ◽

Data Set

Introduction Recent studies highlight the crucial regulatory roles of transposable elements (TEs) on proximal gene expression in distinct biological contexts such as disease and development. However, computational tools extracting potential TE –proximal gene expression associations from RNA-sequencing data are still missing. Implementation Herein, we developed a novel R package, using a linear regression model, for studying the potential influence of TE species on proximal gene expression from a given RNA-sequencing data set. Our R package, namely TEffectR, makes use of publicly available RepeatMasker TE and Ensembl gene annotations as well as several functions of other R-packages. It calculates total read counts of TEs from sorted and indexed genome aligned BAM files provided by the user, and determines statistically significant relations between TE expression and the transcription of nearby genes under diverse biological conditions. Availability TEffectR is freely available at https://github.com/karakulahg/TEffectR along with a handy tutorial as exemplified by the analysis of RNA-sequencing data including normal and tumour tissue specimens obtained from breast cancer patients.

Download Full-text

Transcriptome RNA Sequencing Data Set of Gene Expression in Moraxella catarrhalis On- and Off-Phase Variants of the Type III DNA Methyltransferase ModM3

Microbiology Resource Announcements ◽

10.1128/mra.01559-19 ◽

2020 ◽

Vol 9 (14) ◽

Author(s):

Luke V. Blakeway ◽

Aimee Tan ◽

Ian R. Peak ◽

John M. Atack ◽

Kate L. Seib

Keyword(s):

Gene Expression ◽

Rna Sequencing ◽

Moraxella Catarrhalis ◽

Dna Methyltransferase ◽

Global Gene Expression ◽

Chronic Obstructive ◽

Sequencing Data ◽

Data Set ◽

Obstructive Pulmonary Disease ◽

Content Type

Moraxella catarrhalis is a leading bacterial cause of otitis media and exacerbations of chronic obstructive pulmonary disease. Here, we announce a transcriptome RNA sequencing data set detailing global gene expression in two M. catarrhalis CCRI-195ME variants with expression of the DNA methyltransferase ModM3 phase varied either on or off.

Download Full-text

Combining DGE and RNA-sequencing data to identify new polyA+ non-coding transcripts in the human genome

Nucleic Acids Research ◽

10.1093/nar/gkt1300 ◽

2013 ◽

Vol 42 (5) ◽

pp. 2820-2832 ◽

Cited By ~ 14

Author(s):

Nicolas Philippe ◽

Elias Bou Samra ◽

Anthony Boureux ◽

Alban Mancheron ◽

Florence Rufflé ◽

...

Keyword(s):

Human Genome ◽

Rna Sequencing ◽

Dynamic Range ◽

Tiling Array ◽

Expression Data ◽

Rna Seq ◽

Sequencing Data ◽

Data Set ◽

Protein Coding ◽

Protein Coding Genes

Abstract Recent sequencing technologies that allow massive parallel production of short reads are the method of choice for transcriptome analysis. Particularly, digital gene expression (DGE) technologies produce a large dynamic range of expression data by generating short tag signatures for each cell transcript. These tags can be mapped back to a reference genome to identify new transcribed regions that can be further covered by RNA-sequencing (RNA-Seq) reads. Here, we applied an integrated bioinformatics approach that combines DGE tags, RNA-Seq, tiling array expression data and species-comparison to explore new transcriptional regions and their specific biological features, particularly tissue expression or conservation. We analysed tags from a large DGE data set (designated as ‘TranscriRef’). We then annotated 750 000 tags that were uniquely mapped to the human genome according to Ensembl. We retained transcripts originating from both DNA strands and categorized tags corresponding to protein-coding genes, antisense, intronic- or intergenic-transcribed regions and computed their overlap with annotated non-coding transcripts. Using this bioinformatics approach, we identified ∼34 000 novel transcribed regions located outside the boundaries of known protein-coding genes. As demonstrated using sequencing data from human pluripotent stem cells for biological validation, the method could be easily applied for the selection of tissue-specific candidate transcripts. DigitagCT is available at http://cractools.gforge.inria.fr/softwares/digitagct.

Download Full-text