A Biological Evaluation of Six Gene Set Analysis Methods for Identification of Differentially Expressed Pathways in Microarray Data

Transcriptional survey of alveolar macrophages in a murine model of chronic granulomatous inflammation reveals common themes with human sarcoidosis

AJP Lung Cellular and Molecular Physiology ◽

10.1152/ajplung.00289.2017 ◽

2018 ◽

Vol 314 (4) ◽

pp. L617-L625 ◽

Cited By ~ 8

Author(s):

Arjun Mohan ◽

Anagha Malur ◽

Matthew McPeek ◽

Barbara P. Barna ◽

Lynn M. Schnapp ◽

...

Keyword(s):

Differentially Expressed Genes ◽

Murine Model ◽

Alveolar Macrophages ◽

Gene Set Enrichment Analysis ◽

Differentially Expressed ◽

Multiwall Carbon Nanotube ◽

Gene Set Enrichment ◽

Integrated Network ◽

Gene Set ◽

Gene Sets

To advance our understanding of the pathobiology of sarcoidosis, we developed a multiwall carbon nanotube (MWCNT)-based murine model that shows marked histological and inflammatory signal similarities to this disease. In this study, we compared the alveolar macrophage transcriptional signatures of our animal model with human sarcoidosis to identify overlapping molecular programs. Whole genome microarrays were used to assess gene expression of alveolar macrophages in six MWCNT-exposed and six control animals. The results were compared with the transcriptional profiles of alveolar immune cells in 15 sarcoidosis patients and 12 healthy humans. Rigorous statistical methods were used to identify differentially expressed genes. To better elucidate activated pathways, integrated network and gene set enrichment analysis (GSEA) was performed. We identified over 1,000 differentially expressed between control and MWCNT mice. Gene ontology functional analysis showed overrepresentation of processes primarily involved in immunity and inflammation in MCWNT mice. Applying GSEA to both mouse and human samples revealed upregulation of 92 gene sets in MWCNT mice and 142 gene sets in sarcoidosis patients. Commonly activated pathways in both MWCNT mice and sarcoidosis included adaptive immunity, T-cell signaling, IL-12/IL-17 signaling, and oxidative phosphorylation. Differences in gene set enrichment between MWCNT mice and sarcoidosis patients were also observed. We applied network analysis to differentially expressed genes common between the MWCNT model and sarcoidosis to identify key drivers of disease. In conclusion, an integrated network and transcriptomics approach revealed substantial functional similarities between a murine model and human sarcoidosis particularly with respect to activation of immune-specific pathways.

Download Full-text

Measuring consistency among gene set analysis methods: A systematic study

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720019400109 ◽

2019 ◽

Vol 17 (05) ◽

pp. 1940010 ◽

Cited By ~ 1

Author(s):

Farhad Maleki ◽

Katie L. Ovens ◽

Daniel J. Hogan ◽

Elham Rezaei ◽

Alan M. Rosenberg ◽

...

Keyword(s):

Gene Set Analysis ◽

Rna Seq ◽

Systematic Analysis ◽

Gene Set ◽

Large Gene ◽

Analysis Methods ◽

Gene Sets ◽

Significant Gene ◽

Biological Insight ◽

Relevant Gene

Gene set analysis is a quantitative approach for generating biological insight from gene expression datasets. The abundance of gene set analysis methods speaks to their popularity, but raises the question of the extent to which results are affected by the choice of method. Our systematic analysis of 13 popular methods using 6 different datasets, from both DNA microarray and RNA-Seq origin, shows that this choice matters a great deal. We observed that the overall number of gene sets reported by each method differed by up to 2 orders of magnitude, and there was a bias toward reporting large gene sets with some methods. Furthermore, there was substantial disagreement between the 20 most statistically significant gene sets reported by the methods. This was also observed when expanding to the 100 most statistically significant reported gene sets. For different datasets of the same phenotype/condition, the top 20 and top 100 most significant results also showed little to no agreement even when using the same method. GAGE, PAGE, and ORA were the only methods able to achieve relatively high reproducibility when comparing the 20 and 100 most statistically significant gene sets. Biological validation on a juvenile idiopathic arthritis (JIA) dataset showed wide variation in terms of the relevance of the top 20 and top 100 most significant gene sets to known biology of the disease, where GAGE predicted the most relevant gene sets, followed by GSEA, ORA, and PAGE.

Download Full-text

ADAGE signature analysis: differential expression analysis with data-defined gene sets

10.1101/156620 ◽

2017 ◽

Author(s):

Jie Tan ◽

Matthew Huyck ◽

Dongbo Hu ◽

René A. Zelaya ◽

Deborah A. Hogan ◽

...

Keyword(s):

Differential Expression ◽

Web Server ◽

R Package ◽

Functional Gene ◽

Gene Set Enrichment Analysis ◽

Signature Analysis ◽

Biological Processes ◽

Gene Set ◽

Gene Sets ◽

Public Data

AbstractBackgroundGene set enrichment analysis and overrepresentation analyses are commonly used methods to determine the biological processes affected by a differential expression experiment. This approach requires biologically relevant gene sets, which are currently curated manually, limiting their availability and accuracy in many organisms without extensively curated resources. New feature learning approaches can now be paired with existing data collections to directly extract functional gene sets from big data.ResultsHere we introduce a method to identify perturbed processes. In contrast with methods that use curated gene sets, this approach uses signatures extracted from public expression data. We first extract expression signatures from public data using ADAGE, a neural network-based feature extraction approach. We next identify signatures that are differentially active under a given treatment. Our results demonstrate that these signatures represent biological processes that are perturbed by the experiment. Because these signatures are directly learned from data without supervision, they can identify uncurated or novel biological processes. We implemented ADAGE signature analysis for the bacterial pathogen Pseudomonas aeruginosa. For the convenience of different user groups, we implemented both an R package (ADAGEpath) and a web server (http://adage.greenelab.com) to run these analyses. Both are open-source to allow easy expansion to other organisms or signature generation methods. We applied ADAGE signature analysis to an example dataset in which wild-type and Δanr mutant cells were grown as biofilms on the Cystic Fibrosis genotype bronchial epithelial cells. We mapped active signatures in the dataset to KEGG pathways and compared with pathways identified using GSEA. The two approaches generally return consistent results; however, ADAGE signature analysis also identified a signature that revealed the molecularly supported link between the MexT regulon and Anr.ConclusionsWe designed ADAGE signature analysis to perform gene set analysis using data-defined functional gene signatures. This approach addresses an important gap for biologists studying non-traditional model organisms and those without extensive curated resources available. We built both an R package and web server to provide ADAGE signature analysis to the community.

Download Full-text

Alterations in the host transcriptome in vitro and in vivo following severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection

10.21203/rs.3.rs-37567/v1 ◽

2020 ◽

Author(s):

Xiaomei Lei ◽

Zhijun Feng ◽

Xiaojun Wang ◽

Xiaodong He

Keyword(s):

Gene Expression ◽

Cell Cycle ◽

Microarray Data ◽

Molecular Mechanisms ◽

Enrichment Analysis ◽

Gene Set Enrichment Analysis ◽

Gene Set ◽

Gene Sets ◽

Core Genes

Abstract Background. Exploring alterations in the host transcriptome following SARS-CoV-2 infection is not only highly warranted to help us understand molecular mechanisms of the disease, but also provide new prospective for screening effective antiviral drugs, finding new therapeutic targets, and evaluating the risk of systemic inflammatory response syndrome (SIRS) early.Methods. We downloaded three gene expression matrix files from the Gene Expression Omnibus (GEO) database, and extracted the gene expression data of the SARS-CoV-2 infection and non-infection in human samples and different cell line samples, and then performed gene set enrichment analysis (GSEA), respectively. Thereafter, we integrated the results of GSEA and obtained co-enriched gene sets and co-core genes in three various microarray data. Finally, we also constructed a protein-protein interaction (PPI) network and molecular modules for co-core genes and performed Kyoto encyclopedia of genes and genomes (KEGG) pathway analysis for the genes from modules to clarify their possible biological processes and underlying signaling pathway. Results. A total of 11 co-enriched gene sets were identiﬁed from the three various microarray data. Among them, 10 gene sets were activated, and involved in immune response and inflammatory reaction. 1 gene set was suppressed, and participated in cell cycle. The analysis of molecular modules showed that 2 modules might play a vital role in the pathogenic process of SARS-CoV-2 infection. The KEGG enrichment analysis showed that genes from module one enriched in signaling pathways related to inflammation, but genes from module two enriched in signaling of cell cycle and DNA replication. Particularly, necroptosis signaling, a newly identified type of programmed cell death that differed from apoptosis, was also determined in our findings. Additionally, for patients with SARS-CoV-2 infection, genes from module one showed a relatively high-level expression while genes from module two showed low-level. Conclusions. We identified two molecular modules were used to assess severity and predict the prognosis of the patients with SARS-CoV-2 infection. In addition, these results provide a unique opportunity to explore more molecular pathways as new potential targets on therapy in COVID 19.

Download Full-text

Effect of the absolute statistic on gene-sampling gene-set analysis methods

Statistical Methods in Medical Research ◽

10.1177/0962280215574014 ◽

2015 ◽

Vol 26 (3) ◽

pp. 1248-1260 ◽

Cited By ~ 2

Author(s):

Dougu Nam

Keyword(s):

False Positive ◽

Genome Wide Association Study ◽

False Positive Rate ◽

Gene Set Enrichment Analysis ◽

Gene Set Analysis ◽

Receiver Operating Curve ◽

Gene Set ◽

Analysis Methods ◽

Positive Rate ◽

The Absolute

Gene-set enrichment analysis and its modified versions have commonly been used for identifying altered functions or pathways in disease from microarray data. In particular, the simple gene-sampling gene-set analysis methods have been heavily used for datasets with only a few sample replicates. The biggest problem with this approach is the highly inflated false-positive rate. In this paper, the effect of absolute gene statistic on gene-sampling gene-set analysis methods is systematically investigated. Thus far, the absolute gene statistic has merely been regarded as a supplementary method for capturing the bidirectional changes in each gene set. Here, it is shown that incorporating the absolute gene statistic in gene-sampling gene-set analysis substantially reduces the false-positive rate and improves the overall discriminatory ability. Its effect was investigated by power, false-positive rate, and receiver operating curve for a number of simulated and real datasets. The performances of gene-set analysis methods in one-tailed (genome-wide association study) and two-tailed (gene expression data) tests were also compared and discussed.

Download Full-text

Statistical and Biological Evaluation of Different Gene Set Analysis Methods

Procedia Environmental Sciences ◽

10.1016/j.proenv.2011.10.106 ◽

2011 ◽

Vol 8 ◽

pp. 693-699 ◽

Cited By ~ 3

Author(s):

Wenjun Cao ◽

Yunming Li ◽

Danhong Liu ◽

Changsheng Chen ◽

Yongyong Xu

Keyword(s):

Biological Evaluation ◽

Gene Set Analysis ◽

Gene Set ◽

Analysis Methods

Download Full-text

Host transcriptome alterations in vitro and in vivo following severe acute respiratory syndrome coronavirus 2 infection

10.21203/rs.3.rs-37567/v2 ◽

2021 ◽

Author(s):

Yannian Luo ◽

Juan Xu ◽

Mingzhen Zhou ◽

Xiaomei Lei ◽

Wen Cao ◽

...

Keyword(s):

Gene Expression ◽

Cell Cycle ◽

Microarray Data ◽

Molecular Mechanisms ◽

Enrichment Analysis ◽

Gene Set Enrichment Analysis ◽

Gene Set ◽

Gene Sets ◽

Core Genes

Abstract Background. Exploring alterations in the host transcriptome following SARS-CoV-2 infection is not only highly warranted to help us understand molecular mechanisms of the disease, but also provide new prospective for screening effective antiviral drugs, finding new therapeutic targets, and evaluating the risk of systemic inflammatory response syndrome (SIRS) early.Methods. We downloaded three gene expression matrix files from the Gene Expression Omnibus (GEO) database, and extracted the gene expression data of the SARS-CoV-2 infection and non-infection in human samples and different cell line samples, and then performed gene set enrichment analysis (GSEA), respectively. Thereafter, we integrated the results of GSEA and obtained co-enriched gene sets and co-core genes in three various microarray data. Finally, we also constructed a protein-protein interaction (PPI) network and molecular modules for co-core genes and performed Kyoto encyclopedia of genes and genomes (KEGG) pathway analysis for the genes from modules to clarify their possible biological processes and underlying signaling pathway. Results. A total of 11 co-enriched gene sets were identiﬁed from the three various microarray data. Among them, 10 gene sets were activated, and involved in immune response and inflammatory reaction. 1 gene set was suppressed, and participated in cell cycle. The analysis of molecular modules showed that 2 modules might play a vital role in the pathogenic process of SARS-CoV-2 infection. The KEGG enrichment analysis showed that genes from module one enriched in signaling pathways related to inflammation, but genes from module two enriched in signaling of cell cycle and DNA replication. Particularly, necroptosis signaling, a newly identified type of programmed cell death that differed from apoptosis, was also determined in our findings. Additionally, for patients with SARS-CoV-2 infection, genes from module one showed a relatively high-level expression while genes from module two showed low-level. Conclusions. We identified two molecular modules were used to assess severity and predict the prognosis of the patients with SARS-CoV-2 infection. In addition, these results provide a unique opportunity to explore more molecular pathways as new potential targets on therapy in COVID 19.

Download Full-text

Silver: Forging almost Gold Standard Datasets

Genes ◽

10.3390/genes12101523 ◽

2021 ◽

Vol 12 (10) ◽

pp. 1523

Author(s):

Farhad Maleki ◽

Katie Ovens ◽

Ian McQuillan ◽

Anthony J. Kusalik

Keyword(s):

Gold Standard ◽

Best Practice ◽

Evaluation Studies ◽

A Priori ◽

Real Data ◽

Gene Set Analysis ◽

Gene Set ◽

Analysis Methods ◽

Gene Sets ◽

New Gene

Gene set analysis has been widely used to gain insight from high-throughput expression studies. Although various tools and methods have been developed for gene set analysis, there is no consensus among researchers regarding best practice(s). Most often, evaluation studies have reported contradictory recommendations of which methods are superior. Therefore, an unbiased quantitative framework for evaluations of gene set analysis methods will be valuable. Such a framework requires gene expression datasets where enrichment status of gene sets is known a priori. In the absence of such gold standard datasets, artificial datasets are commonly used for evaluations of gene set analysis methods; however, they often rely on oversimplifying assumptions that make them biased in favor of or against a given method. In this paper, we propose a quantitative framework for evaluation of gene set analysis methods by synthesizing expression datasets using real data, without relying on oversimplifying or unrealistic assumptions, while preserving complex gene–gene correlations and retaining the distribution of expression values. The utility of the quantitative approach is shown by evaluating ten widely used gene set analysis methods. An implementation of the proposed method is publicly available. We suggest using Silver to evaluate existing and new gene set analysis methods. Evaluation using Silver provides a better understanding of current methods and can aid in the development of gene set analysis methods to achieve higher specificity without sacrificing sensitivity.

Download Full-text

Application of biclustering of gene expression data and gene set enrichment analysis methods to identify potentially disease causing nanomaterials

Beilstein Journal of Nanotechnology ◽

10.3762/bjnano.6.252 ◽

2015 ◽

Vol 6 ◽

pp. 2438-2448 ◽

Cited By ~ 14

Author(s):

Andrew Williams ◽

Sabina Halappanavar

Keyword(s):

Gene Expression ◽

Pulmonary Fibrosis ◽

Expression Profiles ◽

Enrichment Analysis ◽

Gene Set Enrichment Analysis ◽

Data Driven ◽

Gene Set Enrichment ◽

Gene Set ◽

Analysis Methods ◽

Gene Sets

Background: The presence of diverse types of nanomaterials (NMs) in commerce is growing at an exponential pace. As a result, human exposure to these materials in the environment is inevitable, necessitating the need for rapid and reliable toxicity testing methods to accurately assess the potential hazards associated with NMs. In this study, we applied biclustering and gene set enrichment analysis methods to derive essential features of altered lung transcriptome following exposure to NMs that are associated with lung-specific diseases. Several datasets from public microarray repositories describing pulmonary diseases in mouse models following exposure to a variety of substances were examined and functionally related biclusters of genes showing similar expression profiles were identified. The identified biclusters were then used to conduct a gene set enrichment analysis on pulmonary gene expression profiles derived from mice exposed to nano-titanium dioxide (nano-TiO2), carbon black (CB) or carbon nanotubes (CNTs) to determine the disease significance of these data-driven gene sets. Results: Biclusters representing inflammation (chemokine activity), DNA binding, cell cycle, apoptosis, reactive oxygen species (ROS) and fibrosis processes were identified. All of the NM studies were significant with respect to the bicluster related to chemokine activity (DAVID; FDR p-value = 0.032). The bicluster related to pulmonary fibrosis was enriched in studies where toxicity induced by CNT and CB studies was investigated, suggesting the potential for these materials to induce lung fibrosis. The pro-fibrogenic potential of CNTs is well established. Although CB has not been shown to induce fibrosis, it induces stronger inflammatory, oxidative stress and DNA damage responses than nano-TiO2 particles. Conclusion: The results of the analysis correctly identified all NMs to be inflammogenic and only CB and CNTs as potentially fibrogenic. In addition to identifying several previously defined, functionally relevant gene sets, the present study also identified two novel genes sets: a gene set associated with pulmonary fibrosis and a gene set associated with ROS, underlining the advantage of using a data-driven approach to identify novel, functionally related gene sets. The results can be used in future gene set enrichment analysis studies involving NMs or as features for clustering and classifying NMs of diverse properties.

Download Full-text

A comparative study on gene-set analysis methods for assessing differential expression associated with the survival phenotype

BMC Bioinformatics ◽

10.1186/1471-2105-12-377 ◽

2011 ◽

Vol 12 (1) ◽

Cited By ~ 9

Author(s):

Seungyeoun Lee ◽

Jinheum Kim ◽

Sunho Lee

Keyword(s):

Comparative Study ◽

Differential Expression ◽

Gene Set Analysis ◽

Gene Set ◽

Analysis Methods

Download Full-text