Gene Set Analysis Using Spatial Statistics

Angela L. Riffo-Campos; Guillermo Ayala; Francisco Montes

doi:10.3390/math9050521

Gene Set Analysis Using Spatial Statistics

Mathematics ◽

10.3390/math9050521 ◽

2021 ◽

Vol 9 (5) ◽

pp. 521

Author(s):

Angela L. Riffo-Campos ◽

Guillermo Ayala ◽

Francisco Montes

Keyword(s):

Gene Expression ◽

Differential Expression ◽

Differential Expression Analysis ◽

Gene Set Analysis ◽

Point Pattern ◽

Rna Seq ◽

P Values ◽

Gene Set ◽

Gene Differential Expression ◽

Per Gene

Gene differential expression consists of the study of the possible association between the gene expression, evaluated using different types of data as DNA microarray or RNA-Seq technologies, and the phenotype. This can be performed marginally for each gene (differential gene expression) or using a gene set collection (gene set analysis). A previous (marginal) per-gene analysis of differential expression is usually performed in order to obtain a set of significant genes or marginal p-values used later in the study of association between phenotype and gene expression. This paper proposes the use of methods of spatial statistics for testing gene set differential expression analysis using paired samples of RNA-Seq counts. This approach is not based on a previous per-gene differential expression analysis. Instead, we compare the paired counts within each sample/control using a binomial test. Each pair per gene will produce a p-value so gene expression profile is transformed into a vector of p-values which will be considered as an event belonging to a point pattern. This would be the first component of a bivariate point pattern. The second component is generated by applying two different randomization distributions to the correspondence between samples and treatment. The self-contained null hypothesis considered in gene set analysis can be formulated in terms of the associated point pattern as a random labeling of the considered bivariate point pattern. The gene sets were defined by the Gene Ontology (GO) terms and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways. The proposed methodology was tested in four RNA-Seq datasets of colorectal cancer (CRC) patients and the results were contrasted with those obtained using the edgeR-GOseq pipeline. The proposed methodology has proved to be consistent at the biological and statistical level, in particular using Cuzick and Edwards test with one realization of the second component and between-pair distribution.

Download Full-text

Reply to “A discriminative learning approach to differential expression analysis for single-cell RNA-seq”

10.1101/648733 ◽

2019 ◽

Author(s):

Etienne Becht ◽

Edward Zhao ◽

Robert Amezquita ◽

Raphael Gottardo

Keyword(s):

Single Cell ◽

Differential Expression ◽

Differential Expression Analysis ◽

Discriminative Learning ◽

Learning Approach ◽

Rna Seq ◽

Multivariate Logistic Regression ◽

P Values ◽

Gene Differential Expression ◽

Better Than

AbstractMultivariate logistic regression (mLR) has been recently proposed by Ntranos et al. to perform gene differential expression analyses of single-cell RNA-sequencing (scRNAseq) data. Herein we reproduce and extend some of their findings. We notably show that while mLR performs better in simulated datasets, these simulations do not recapitulate important features of experimental datasets. Indeed, our results suggest that MAST followed by Sidak aggregation of the p-values perform better than mLR on experimental datasets. Overall, we highlight that most of the new results obtained by Ntranos et al is likely due to the quantification of scRNAseq data at the transcript or transcript compatibility classes level, rather than the use of mLR.

Download Full-text

Easy and efficient ensemble gene set testing with EGSEA

F1000Research ◽

10.12688/f1000research.12544.1 ◽

2017 ◽

Vol 6 ◽

pp. 2010 ◽

Cited By ~ 17

Author(s):

Monther Alhamdoosh ◽

Charity W. Law ◽

Luyi Tian ◽

Julie M. Sheridan ◽

Milica Ng ◽

...

Keyword(s):

Gene Expression ◽

Differential Expression ◽

Expression Analysis ◽

Differential Expression Analysis ◽

Enrichment Analysis ◽

Gene Set Enrichment Analysis ◽

Gene Set ◽

P Gene ◽

Wide Range ◽

Gene Set Testing

Gene set enrichment analysis is a popular approach for prioritising the biological processes perturbed in genomic datasets. The Bioconductor project hosts over 80 software packages capable of gene set analysis. Most of these packages search for enriched signatures amongst differentially regulated genes to reveal higher level biological themes that may be missed when focusing only on evidence from individual genes. With so many different methods on offer, choosing the best algorithm and visualization approach can be challenging. The EGSEA package solves this problem by combining results from up to 12 prominent gene set testing algorithms to obtain a consensus ranking of biologically relevant results.This workflow demonstrates how EGSEA can extend limma-based differential expression analyses for RNA-seq and microarray data using experiments that profile 3 distinct cell populations important for studying the origins of breast cancer. Following data normalization and set-up of an appropriate linear model for differential expression analysis, EGSEA builds gene signature specific indexes that link a wide range of mouse or human gene set collections obtained from MSigDB, GeneSetDB and KEGG to the gene expression data being investigated. EGSEA is then configured and the ensemble enrichment analysis run, returning an object that can be queried using several S4 methods for ranking gene sets and visualizing results via heatmaps, KEGG pathway views, GO graphs, scatter plots and bar plots. Finally, an HTML report that combines these displays can fast-track the sharing of results with collaborators, and thus expedite downstream biological validation. EGSEA is simple to use and can be easily integrated with existing gene expression analysis pipelines for both human and mouse data.

Download Full-text

A comparison of per sample global scaling and per gene normalization methods for differential expression analysis of RNA-seq data

PLoS ONE ◽

10.1371/journal.pone.0176185 ◽

2017 ◽

Vol 12 (5) ◽

pp. e0176185 ◽

Cited By ~ 32

Author(s):

Xiaohong Li ◽

Guy N. Brock ◽

Eric C. Rouchka ◽

Nigel G. F. Cooper ◽

Dongfeng Wu ◽

...

Keyword(s):

Differential Expression ◽

Expression Analysis ◽

Differential Expression Analysis ◽

Rna Seq ◽

Gene Normalization ◽

Normalization Methods ◽

Global Scaling ◽

Per Gene

Download Full-text

Identifying stably expressed genes from multiple RNA-Seq data sets

PeerJ ◽

10.7717/peerj.2791 ◽

2016 ◽

Vol 4 ◽

pp. e2791 ◽

Cited By ~ 6

Author(s):

Bin Zhuo ◽

Sarah Emerson ◽

Jeff H. Chang ◽

Yanming Di

Keyword(s):

Differential Expression ◽

Reference Gene ◽

Biological Samples ◽

Differential Expression Analysis ◽

Variance Component Analysis ◽

Linear Mixed Effect Model ◽

Rna Seq ◽

Mixed Effect ◽

Gene Set ◽

Common Reference

We examined RNA-Seq data on 211 biological samples from 24 different Arabidopsis experiments carried out by different labs. We grouped the samples according to tissue types, and in each of the groups, we identified genes that are stably expressed across biological samples, treatment conditions, and experiments. We fit a Poisson log-linear mixed-effect model to the read counts for each gene and decomposed the total variance into between-sample, between-treatment and between-experiment variance components. Identifying stably expressed genes is useful for count normalization and differential expression analysis. The variance component analysis that we explore here is a first step towards understanding the sources and nature of the RNA-Seq count variation. When using a numerical measure to identify stably expressed genes, the outcome depends on multiple factors: the background sample set and the reference gene set used for count normalization, the technology used for measuring gene expression, and the specific numerical stability measure used. Since differential expression (DE) is measured by relative frequencies, we argue that DE is a relative concept. We advocate using an explicit reference gene set for count normalization to improve interpretability of DE results, and recommend using a common reference gene set when analyzing multiple RNA-Seq experiments to avoid potential inconsistent conclusions.

Download Full-text

Differential gene expression analysis tools exhibit substandard performance for long non-coding RNA–sequencing data

10.1101/220129 ◽

2017 ◽

Cited By ~ 2

Author(s):

Alemu Takele Assefa ◽

Katrijn De Paepe ◽

Celine Everaert ◽

Pieter Mestdagh ◽

Olivier Thas ◽

...

Keyword(s):

Gene Expression ◽

Differential Expression ◽

Expression Analysis ◽

Web Application ◽

Empirical Bayes ◽

Performance Metrics ◽

Differential Expression Analysis ◽

Rna Seq ◽

Sequencing Data ◽

Normalization Methods

ABSTRACTBackgroundProtein-coding RNAs (mRNA) have been the primary target of most transcriptome studies in the past, but in recent years, attention has expanded to include long non-coding RNAs (lncRNA). lncRNAs are typically expressed at low levels, and are inherently highly variable. This is a fundamental challenge for differential expression (DE) analysis. In this study, the performance of 14 popular tools for testing DE in RNA-seq data along with their normalization methods is comprehensively evaluated, with a particular focus on lncRNAs and low abundant mRNAs.ResultsThirteen performance metrics were used to evaluate DE tools and normalization methods using simulations and analyses of six diverse RNA-seq datasets. Non-parametric procedures are used to simulate gene expression data in such a way that realistic levels of expression and variability are preserved in the simulated data. Throughout the assessment, we kept track of the results for mRNA and lncRNA separately. All statistical models exhibited inferior performance for lncRNAs compared to mRNAs across all simulated scenarios and analysis of benchmark RNA-seq datasets. No single tool uniformly outperformed the others.ConclusionOverall, the linear modeling with empirical Bayes moderation (limma) and the nonparametric approach (SAMSeq) showed best performance: good control of the false discovery rate (FDR) and reasonable sensitivity. However, for achieving a sensitivity of at least 50%, more than 80 samples are required when studying expression levels in a realistic clinical settings such as in cancer research. About half of the methods showed severe excess of false discoveries, making these methods unreliable for differential expression analysis and jeopardizing reproducible science. The detailed results of our study can be consulted through a user-friendly web application, http://statapps.ugent.be/tools/AppDGE/

Download Full-text

Assembly-free rapid differential gene expression analysis in non-model organisms using DNA-protein alignment

10.1101/2021.04.23.441097 ◽

2021 ◽

Author(s):

Anish M.S. Shrestha ◽

Joyce Emlyn B. Guiao ◽

Kyle Christian R. Santiago

Keyword(s):

Gene Expression ◽

Differential Expression ◽

Expression Analysis ◽

De Novo ◽

Transcriptome Assembly ◽

Differential Expression Analysis ◽

Homology Search ◽

Model Organisms ◽

Rna Seq ◽

Protein Database

AbstractRNA-seq is being increasingly adopted for gene expression studies in a panoply of non-model organisms, with applications spanning the fields of agriculture, aquaculture, ecology, and environment. Conventional differential expression analysis for organisms without reference sequences requires performing computationally expensive and error-prone de-novo transcriptome assembly, followed by homology search against a high-confidence protein database for functional annotation. We propose a shortcut, where we obtain counts for differential expression analysis by directly aligning RNA-seq reads to the protein database. Through experiments on simulated and real data, we show drastic reductions in run-time and memory usage, with no loss in accuracy. A Snakemake implementation of our workflow is available at:https://bitbucket.org/project_samar/samar

Download Full-text

Meta-analysis of RNA-seq studies reveals genes responsible for life stage-dominant functions in Schistosoma mansoni

10.1101/308189 ◽

2018 ◽

Cited By ~ 3

Author(s):

Zhigang Lu ◽

Matthew Berriman

Keyword(s):

Gene Expression ◽

Schistosoma Mansoni ◽

Differential Expression ◽

Meta Analysis ◽

Life Stage ◽

Expression Patterns ◽

Differential Expression Analysis ◽

Housekeeping Genes ◽

Life Stages ◽

Rna Seq

AbstractBackgroundSince the genome of the parasitic flatworm Schistosoma mansoni was sequenced in 2009, various RNA-seq studies have been conducted to investigate differential gene expression between certain life stages. Based on these studies, the overview of gene expression in all life stages can improve our understanding of S. mansoni genome biology.Methodspublicly available RNA-seq data covering all life stages and gonads were mapped to the latest S. mansoni genome. Read counts were normalised across all samples and differential expression analysis was preformed using the generalized linear model (GLM) approach.Resultswe revealed for the first time the dissimilarities among all life stages. Genes that are abundantly-expressed in all life stages, as well as those preferentially-expressed in certain stage(s), were determined. The latter reveals genes responsible for stage-dominant functions of the parasite, which can be a guidance for the investigation and annotation of gene functions. In addition, distinct differential expression patterns were observed between adjacent life stages, which not only correlate well with original individual studies, but also provide additional information on changes in gene expression during parasite transitions. Furthermore, thirteen novel housekeeping genes across all life stages were identified, which is valuable for quantitative studies (e.g., qPCR).Conclusionsthe metaanalysis provides valuable information on the expression and potential functions of S. mansoni genes across all life stages, and can facilitate basic as well as applied research for the community.

Download Full-text

Identification of differentially distributed gene expression and distinct sets of cancer-related genes identified by changes in expression mean and variability

10.1101/2021.02.15.431343 ◽

2021 ◽

Author(s):

Aedan G. K. Roberts ◽

Daniel R. Catchpoole ◽

Paul J. Kennedy

Keyword(s):

Gene Expression ◽

Differential Expression ◽

Expression Analysis ◽

Differential Expression Analysis ◽

The Cancer Genome Atlas ◽

Distribution Analysis ◽

Cancer Genes ◽

Rna Seq ◽

Differential Distribution ◽

Expression Variability

AbstractBackgroundDifferential expression analysis of RNA-seq data has advanced rapidly since the introduction of the technology, and methods such as edgeR and DESeq2 have become standard parts of analysis pipelines. However, there is a growing body of research showing that differences in variability of gene expression or overall differences in the distribution of expression values – differential distribution – are also important both in normal biology and in diseases including cancer. Genes whose expression differs in distribution without a difference in mean expression level are ignored by differential expression methods.ResultsWe have developed a Bayesian hierarchical model which improves on existing methods for identifying differential dispersion in RNA-seq data, and provides an overall test for differential distribution. We have applied these methods to investigate differential dispersion and distribution in cancer using RNA-seq datasets from The Cancer Genome Atlas. Our results show that differential dispersion and distribution are able to identify cancer-related genes. Further, we find that differential dispersion identifies cancer-related genes that are missed by differential expression analysis, and that differential expression and differential dispersion identify functionally distinct sets of genes.ConclusionThis work highlights the importance of considering changes beyond differences in mean in the analysis of gene expression data, and suggests that analysis of expression variability may provide insights into genetic aspects of cancer that would not be revealed by differential expression analysis alone. For identification of cancer-related genes, differential distribution analysis allows the identification of genes whose expression is disrupted in terms of either mean or variability.

Download Full-text

Individual Level Differential Expression Analysis for Single Cell RNA-seq data

10.1101/2021.05.10.443350 ◽

2021 ◽

Author(s):

Mengqi Zhang ◽

Si Liu ◽

Zhen Miao ◽

Fang Han ◽

Raphael Gottardo ◽

...

Keyword(s):

Gene Expression ◽

Single Cell ◽

Differential Expression ◽

Expression Analysis ◽

Differential Expression Analysis ◽

Bulk Solution ◽

Rna Seq ◽

Cell Level ◽

Individual Level ◽

Level Data

Bulk RNA-seq data quantify the expression of a gene in an individual by one number (e.g., fragment count). In contrast, single cell RNA-seq (scRNA-seq) data provide much richer information: the distribution of gene expression across many cells. To assess differential expression across individuals using scRNA-seq data, a straightforward solution is to create ''pseudo'' bulk RNA-seq data by adding up the fragment counts of a gene across cells for each individual, and then apply methods designed for differential expression using bulk RNA-seq data. This pseudo-bulk solution reduces the distribution of gene expression across cells to a single number and thus loses a good amount of information. We propose to assess differential expression using the gene expression distribution measured by cell level data. We find denoising cell level data can substantially improve the power of this approach. We apply our method, named IDEAS (Individual level Differential Expression Analysis for scRNA-seq), to study the gene expression difference between autism subjects and controls. We find neurogranin-expressing neurons harbor a high proportion of differentially expressed genes, and ERBB signals in microglia are associated with autism.

Download Full-text

Best practices on the differential expression analysis of multi-species RNA-seq

Genome Biology ◽

10.1186/s13059-021-02337-8 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Matthew Chung ◽

Vincent M. Bruno ◽

David A. Rasko ◽

Christina A. Cuomo ◽

José F. Muñoz ◽

...

Keyword(s):

Best Practices ◽

Differential Expression ◽

Expression Analysis ◽

Differential Expression Analysis ◽

Single Species ◽

Rna Seq ◽

Species Analysis ◽

Differential Gene ◽

Multiple Species ◽

Downstream Analysis

AbstractAdvances in transcriptome sequencing allow for simultaneous interrogation of differentially expressed genes from multiple species originating from a single RNA sample, termed dual or multi-species transcriptomics. Compared to single-species differential expression analysis, the design of multi-species differential expression experiments must account for the relative abundances of each organism of interest within the sample, often requiring enrichment methods and yielding differences in total read counts across samples. The analysis of multi-species transcriptomics datasets requires modifications to the alignment, quantification, and downstream analysis steps compared to the single-species analysis pipelines. We describe best practices for multi-species transcriptomics and differential gene expression.

Download Full-text