scholarly journals MOGSA: integrative single sample gene-set analysis of multiple omics data

2016 ◽  
Author(s):  
Chen Meng ◽  
Azfar Basunia ◽  
Bjoern Peters ◽  
Amin Moghaddas Gholami ◽  
Bernhard Kuster ◽  
...  

AbstractGene set analysis (GSA) summarizes individual molecular measurements to more interpretable pathways or gene sets and has become an indispensable step in the interpretation of large scale omics data. However, GSA methods are limited to the analysis of single omics data. Here, we introduce a new computation method termed multi-omics gene set analysis (MOGSA), a multivariate single sample gene-set analysis method that integrates multiple experimental and molecular data types measured over the same set of samples. The method learns a low dimensional representation of most variant correlated features (genes, proteins, etc.) across multiple omics data sets, transforms the features onto the same scale and calculates an integrated gene set score from the most informative features in each data type. MOGSA does not require filtering data to the intersection of features (gene IDs), therefore, all molecular features, including those that lack annotation may be included in the analysis. We demonstrate that integrating multiple diverse sources of molecular data increases the power to discover subtle changes in gene-sets and may reduce the impact of unreliable information in any single data type. Using simulated data, we show that integrative analysis with MOGSA outperforms other single sample GSA methods. We applied MOGSA to three studies with experimental data. First, we used NCI60 transcriptome and proteome data to demonstrate the benefit of removing a source of noise in the omics data. Second, we discovered similarities and differences in mRNA, protein and phosphorylation profiles of induced pluripotent and embryonic stem cell lines. We demonstrate how to assess the influence of each data type or feature to a MOGSA gene set score. Finally, we report that three molecular subtypes are robustly discovered when copy number variation and mRNA profiling data of 308 bladder cancers from The Cancer Genome Atlas are integrated using MOGSA. MOGSA is available in the Bioconductor R package “mogsa”.

2019 ◽  
Vol 18 (8 suppl 1) ◽  
pp. S153-S168 ◽  
Author(s):  
Chen Meng ◽  
Azfar Basunia ◽  
Bjoern Peters ◽  
Amin Moghaddas Gholami ◽  
Bernhard Kuster ◽  
...  

2020 ◽  
Vol 40 (7) ◽  
Author(s):  
Pashupati P. Mishra ◽  
Ismo Hänninen ◽  
Emma Raitoharju ◽  
Saara Marttila ◽  
Binisha H. Mishra ◽  
...  

Abstract Smoking as a major risk factor for morbidity affects numerous regulatory systems of the human body including DNA methylation. Most of the previous studies with genome-wide methylation data are based on conventional association analysis and earliest threshold-based gene set analysis that lacks sensitivity to be able to reveal all the relevant effects of smoking. The aim of the present study was to investigate the impact of active smoking on DNA methylation at three biological levels: 5′-C-phosphate-G-3′ (CpG) sites, genes and functionally related genes (gene sets). Gene set analysis was done with mGSZ, a modern threshold-free method previously developed by us that utilizes all the genes in the experiment and their differential methylation scores. Application of such method in DNA methylation study is novel. Epigenome-wide methylation levels were profiled from Young Finns Study (YFS) participants’ whole blood from 2011 follow-up using Illumina Infinium HumanMethylation450 BeadChips. We identified three novel smoking related CpG sites and replicated 57 of the previously identified ones. We found that smoking is associated with hypomethylation in shore (genomic regions 0–2 kilobases from CpG island). We identified smoking related methylation changes in 13 gene sets with false discovery rate (FDR) ≤ 0.05, among which is olfactory receptor activity, the flagship novel finding of the present study. Overall, we extended the current knowledge by identifying: (i) three novel smoking related CpG sites, (ii) similar effects as aging on average methylation in shore, and (iii) a novel finding that olfactory receptor activity pathway responds to tobacco smoke and toxin exposure through epigenetic mechanisms.


F1000Research ◽  
2019 ◽  
Vol 8 ◽  
pp. 129 ◽  
Author(s):  
Michael Prummer

Differential gene expression (DGE) studies often suffer from poor interpretability of their primary results, i.e., thousands of differentially expressed genes. This has led to the introduction of gene set analysis (GSA) methods that aim at identifying interpretable global effects by grouping genes into sets of common context, such as, molecular pathways, biological function or tissue localization. In practice, GSA often results in hundreds of differentially regulated gene sets. Similar to the genes they contain, gene sets are often regulated in a correlative fashion because they share many of their genes or they describe related processes. Using these kind of neighborhood information to construct networks of gene sets allows to identify highly connected sub-networks as well as poorly connected islands or singletons. We show here how topological information and other network features can be used to filter and prioritize gene sets in routine DGE studies. Community detection in combination with automatic labeling and the network representation of gene set clusters further constitute an appealing and intuitive visualization of GSA results. The RICHNET workflow described here does not require human intervention and can thus be conveniently incorporated in automated analysis pipelines.


2019 ◽  
Vol 17 (05) ◽  
pp. 1940010 ◽  
Author(s):  
Farhad Maleki ◽  
Katie L. Ovens ◽  
Daniel J. Hogan ◽  
Elham Rezaei ◽  
Alan M. Rosenberg ◽  
...  

Gene set analysis is a quantitative approach for generating biological insight from gene expression datasets. The abundance of gene set analysis methods speaks to their popularity, but raises the question of the extent to which results are affected by the choice of method. Our systematic analysis of 13 popular methods using 6 different datasets, from both DNA microarray and RNA-Seq origin, shows that this choice matters a great deal. We observed that the overall number of gene sets reported by each method differed by up to 2 orders of magnitude, and there was a bias toward reporting large gene sets with some methods. Furthermore, there was substantial disagreement between the 20 most statistically significant gene sets reported by the methods. This was also observed when expanding to the 100 most statistically significant reported gene sets. For different datasets of the same phenotype/condition, the top 20 and top 100 most significant results also showed little to no agreement even when using the same method. GAGE, PAGE, and ORA were the only methods able to achieve relatively high reproducibility when comparing the 20 and 100 most statistically significant gene sets. Biological validation on a juvenile idiopathic arthritis (JIA) dataset showed wide variation in terms of the relevance of the top 20 and top 100 most significant gene sets to known biology of the disease, where GAGE predicted the most relevant gene sets, followed by GSEA, ORA, and PAGE.


2021 ◽  
Vol 12 ◽  
Author(s):  
Michal Marczyk ◽  
Agnieszka Macioszek ◽  
Joanna Tobiasz ◽  
Joanna Polanska ◽  
Joanna Zyla

A typical genome-wide association study (GWAS) analyzes millions of single-nucleotide polymorphisms (SNPs), several of which are in a region of the same gene. To conduct gene set analysis (GSA), information from SNPs needs to be unified at the gene level. A widely used practice is to use only the most relevant SNP per gene; however, there are other methods of integration that could be applied here. Also, the problem of nonrandom association of alleles at two or more loci is often neglected. Here, we tested the impact of incorporation of different integrations and linkage disequilibrium (LD) correction on the performance of several GSA methods. Matched normal and breast cancer samples from The Cancer Genome Atlas database were used to evaluate the performance of six GSA algorithms: Coincident Extreme Ranks in Numerical Observations (CERNO), Gene Set Enrichment Analysis (GSEA), GSEA-SNP, improved GSEA for GWAS (i-GSEA4GWAS), Meta-Analysis Gene-set Enrichment of variaNT Associations (MAGENTA), and Over-Representation Analysis (ORA). Association of SNPs to phenotype was calculated using modified McNemar’s test. Results for SNPs mapped to the same gene were integrated using Fisher and Stouffer methods and compared with the minimum p-value method. Four common measures were used to quantify the performance of all combinations of methods. Results of GSA analysis on GWAS were compared to the one performed on gene expression data. Comparing all evaluation metrics across different GSA algorithms, integrations, and LD correction, we highlighted CERNO, and MAGENTA with Stouffer as the most efficient. Applying LD correction increased prioritization and specificity of enrichment outcomes for all tested algorithms. When Fisher or Stouffer were used with LD, sensitivity and reproducibility were also better. Using any integration method was beneficial in comparison with a minimum p-value method in specific combinations. The correlation between GSA results from genomic and transcriptomic level was the highest when Stouffer integration was combined with LD correction. We thoroughly evaluated different approaches to GSA in GWAS in terms of performance to guide others to select the most effective combinations. We showed that LD correction and Stouffer integration could increase the performance of enrichment analysis and encourage the usage of these techniques.


2021 ◽  
Author(s):  
Chi-Hsuan Ho ◽  
Yu-Jyun Huang ◽  
Ying-Ju Lai ◽  
Rajarshi Mukherjee ◽  
Chuhsing Kate Hsiao

ABSTRACTGene-set analysis (GSA) has been one of the standard procedures for exploring potential biological functions when a group of differentially expressed genes have been derived. The development of its methodology has been an active research topic in recent decades. Many GSA methods, when newly proposed, rely on simulation studies to evaluate their performance with a common implicit assumption that the multivariate expression values are normally distributed. The validity of this assumption has been disputed in several studies but no systematic analysis has been carried out to assess the influence of this distributional assumption. Our goal in this study is not to propose a new GSA method but to first examine if the multi-dimensional gene expression data in gene sets follow a multivariate normal distribution (MVN). Six statistical methods in three categories of MVN tests were considered and applied to a total of twenty-two datasets of expression data from studies involving tumor and normal tissues, with ten signaling pathways chosen as the gene sets. Second, we evaluated the influence of non-normality on the performance of current GSA tools, including parametric and non-parametric methods. Specifically, the scenario of mixture distributions representing the case of different tumor subtypes was considered. Our first finding suggests that the MVN assumption should be carefully dealt with. It does not hold true in many applications tested here. The second investigation of the GSA tools demonstrates that the non-normality does affect the performance of these GSA methods, especially when subtypes exist. We conclude that the use of the inherent multivariate normality assumption should be assessed with care in evaluating new GSA tools, since this MVN assumption cannot be guaranteed and this assumption affects strongly the performance of GSA methods. If a newly proposed GSA method is to be evaluated, we recommend the incorporation of multivariate non-normal distributions or sampling from large databases if available.


2020 ◽  
Vol 21 (1) ◽  
Author(s):  
Ewoud Ewing ◽  
Nuria Planell-Picola ◽  
Maja Jagodic ◽  
David Gomez-Cabrero

Abstract Background Gene-set analysis tools, which make use of curated sets of molecules grouped based on their shared functions, aim to identify which gene-sets are over-represented in the set of features that have been associated with a given trait of interest. Such tools are frequently used in gene-centric approaches derived from RNA-sequencing or microarrays such as Ingenuity or GSEA, but they have also been adapted for interval-based analysis derived from DNA methylation or ChIP/ATAC-sequencing. Gene-set analysis tools return, as a result, a list of significant gene-sets. However, while these results are useful for the researcher in the identification of major biological insights, they may be complex to interpret because many gene-sets have largely overlapping gene contents. Additionally, in many cases the result of gene-set analysis consists of a large number of gene-sets making it complicated to identify the major biological insights. Results We present GeneSetCluster, a novel approach which allows clustering of identified gene-sets, from one or multiple experiments and/or tools, based on shared genes. GeneSetCluster calculates a distance score based on overlapping gene content, which is then used to cluster them together and as a result, GeneSetCluster identifies groups of gene-sets with similar gene-set definitions (i.e. gene content). These groups of gene-sets can aid the researcher to focus on such groups for biological interpretations. Conclusions GeneSetCluster is a novel approach for grouping together post gene-set analysis results based on overlapping gene content. GeneSetCluster is implemented as a package in R. The package and the vignette can be downloaded at https://github.com/TranslationalBioinformaticsUnit


Genes ◽  
2021 ◽  
Vol 12 (10) ◽  
pp. 1523
Author(s):  
Farhad Maleki ◽  
Katie Ovens ◽  
Ian McQuillan ◽  
Anthony J. Kusalik

Gene set analysis has been widely used to gain insight from high-throughput expression studies. Although various tools and methods have been developed for gene set analysis, there is no consensus among researchers regarding best practice(s). Most often, evaluation studies have reported contradictory recommendations of which methods are superior. Therefore, an unbiased quantitative framework for evaluations of gene set analysis methods will be valuable. Such a framework requires gene expression datasets where enrichment status of gene sets is known a priori. In the absence of such gold standard datasets, artificial datasets are commonly used for evaluations of gene set analysis methods; however, they often rely on oversimplifying assumptions that make them biased in favor of or against a given method. In this paper, we propose a quantitative framework for evaluation of gene set analysis methods by synthesizing expression datasets using real data, without relying on oversimplifying or unrealistic assumptions, while preserving complex gene–gene correlations and retaining the distribution of expression values. The utility of the quantitative approach is shown by evaluating ten widely used gene set analysis methods. An implementation of the proposed method is publicly available. We suggest using Silver to evaluate existing and new gene set analysis methods. Evaluation using Silver provides a better understanding of current methods and can aid in the development of gene set analysis methods to achieve higher specificity without sacrificing sensitivity.


Sign in / Sign up

Export Citation Format

Share Document