Epigenome-450K-wide methylation signatures of active cigarette smoking: The Young Finns Study

Pashupati P. Mishra; Ismo Hänninen; Emma Raitoharju; Saara Marttila; Binisha H. Mishra; Nina Mononen; Mika Kähönen; Mikko Hurme; Olli Raitakari; Petri Törönen; Liisa Holm; Terho Lehtimäki

doi:10.1042/bsr20200596

Epigenome-450K-wide methylation signatures of active cigarette smoking: The Young Finns Study

Bioscience Reports ◽

10.1042/bsr20200596 ◽

2020 ◽

Vol 40 (7) ◽

Author(s):

Pashupati P. Mishra ◽

Ismo Hänninen ◽

Emma Raitoharju ◽

Saara Marttila ◽

Binisha H. Mishra ◽

...

Keyword(s):

Dna Methylation ◽

Olfactory Receptor ◽

Cpg Island ◽

Receptor Activity ◽

Gene Set Analysis ◽

Gene Set ◽

Cpg Sites ◽

Gene Sets ◽

The Impact ◽

Young Finns Study

Abstract Smoking as a major risk factor for morbidity affects numerous regulatory systems of the human body including DNA methylation. Most of the previous studies with genome-wide methylation data are based on conventional association analysis and earliest threshold-based gene set analysis that lacks sensitivity to be able to reveal all the relevant effects of smoking. The aim of the present study was to investigate the impact of active smoking on DNA methylation at three biological levels: 5′-C-phosphate-G-3′ (CpG) sites, genes and functionally related genes (gene sets). Gene set analysis was done with mGSZ, a modern threshold-free method previously developed by us that utilizes all the genes in the experiment and their differential methylation scores. Application of such method in DNA methylation study is novel. Epigenome-wide methylation levels were profiled from Young Finns Study (YFS) participants’ whole blood from 2011 follow-up using Illumina Infinium HumanMethylation450 BeadChips. We identified three novel smoking related CpG sites and replicated 57 of the previously identified ones. We found that smoking is associated with hypomethylation in shore (genomic regions 0–2 kilobases from CpG island). We identified smoking related methylation changes in 13 gene sets with false discovery rate (FDR) ≤ 0.05, among which is olfactory receptor activity, the flagship novel finding of the present study. Overall, we extended the current knowledge by identifying: (i) three novel smoking related CpG sites, (ii) similar effects as aging on average methylation in shore, and (iii) a novel finding that olfactory receptor activity pathway responds to tobacco smoke and toxin exposure through epigenetic mechanisms.

Download Full-text

MOGSA: integrative single sample gene-set analysis of multiple omics data

10.1101/046904 ◽

2016 ◽

Cited By ~ 3

Author(s):

Chen Meng ◽

Azfar Basunia ◽

Bjoern Peters ◽

Amin Moghaddas Gholami ◽

Bernhard Kuster ◽

...

Keyword(s):

Embryonic Stem ◽

Molecular Data ◽

Data Type ◽

Single Sample ◽

Gene Set Analysis ◽

Computation Method ◽

Omics Data ◽

Gene Set ◽

Gene Sets ◽

The Impact

AbstractGene set analysis (GSA) summarizes individual molecular measurements to more interpretable pathways or gene sets and has become an indispensable step in the interpretation of large scale omics data. However, GSA methods are limited to the analysis of single omics data. Here, we introduce a new computation method termed multi-omics gene set analysis (MOGSA), a multivariate single sample gene-set analysis method that integrates multiple experimental and molecular data types measured over the same set of samples. The method learns a low dimensional representation of most variant correlated features (genes, proteins, etc.) across multiple omics data sets, transforms the features onto the same scale and calculates an integrated gene set score from the most informative features in each data type. MOGSA does not require filtering data to the intersection of features (gene IDs), therefore, all molecular features, including those that lack annotation may be included in the analysis. We demonstrate that integrating multiple diverse sources of molecular data increases the power to discover subtle changes in gene-sets and may reduce the impact of unreliable information in any single data type. Using simulated data, we show that integrative analysis with MOGSA outperforms other single sample GSA methods. We applied MOGSA to three studies with experimental data. First, we used NCI60 transcriptome and proteome data to demonstrate the benefit of removing a source of noise in the omics data. Second, we discovered similarities and differences in mRNA, protein and phosphorylation profiles of induced pluripotent and embryonic stem cell lines. We demonstrate how to assess the influence of each data type or feature to a MOGSA gene set score. Finally, we report that three molecular subtypes are robustly discovered when copy number variation and mRNA profiling data of 308 bladder cancers from The Cancer Genome Atlas are integrated using MOGSA. MOGSA is available in the Bioconductor R package “mogsa”.

Download Full-text

Enhancing gene set enrichment using networks

F1000Research ◽

10.12688/f1000research.17824.2 ◽

2019 ◽

Vol 8 ◽

pp. 129 ◽

Cited By ~ 1

Author(s):

Michael Prummer

Keyword(s):

Biological Function ◽

Automated Analysis ◽

Gene Set Analysis ◽

Molecular Pathways ◽

Human Intervention ◽

Gene Set Enrichment ◽

Topological Information ◽

Gene Set ◽

Gene Sets ◽

Differential Gene

Differential gene expression (DGE) studies often suffer from poor interpretability of their primary results, i.e., thousands of differentially expressed genes. This has led to the introduction of gene set analysis (GSA) methods that aim at identifying interpretable global effects by grouping genes into sets of common context, such as, molecular pathways, biological function or tissue localization. In practice, GSA often results in hundreds of differentially regulated gene sets. Similar to the genes they contain, gene sets are often regulated in a correlative fashion because they share many of their genes or they describe related processes. Using these kind of neighborhood information to construct networks of gene sets allows to identify highly connected sub-networks as well as poorly connected islands or singletons. We show here how topological information and other network features can be used to filter and prioritize gene sets in routine DGE studies. Community detection in combination with automatic labeling and the network representation of gene set clusters further constitute an appealing and intuitive visualization of GSA results. The RICHNET workflow described here does not require human intervention and can thus be conveniently incorporated in automated analysis pipelines.

Download Full-text

Stage-differentiated modelling of DNA methylation landscapes uncovers salient biomarkers and prognostic signatures in colorectal cancer progression

10.21203/rs.3.rs-244816/v1 ◽

2021 ◽

Author(s):

Sangeetha Muthamilselvan ◽

Abirami Raghavendran ◽

Ashok Palaniappan

Keyword(s):

Colorectal Cancer ◽

Dna Methylation ◽

Cpg Island ◽

Early Stage ◽

P Value ◽

Consensus Analysis ◽

One Stage ◽

Differentially Methylated Genes ◽

Methylation Patterns ◽

The Impact

Abstract Background: Aberrant DNA methylation acts epigenetically to skew the gene transcription rate up or down, with causative roles in the etiology of cancers. However research on the role of DNA methylation in driving the progression of cancers is limited. In this study, we have developed a comprehensive computational framework for the stage-differentiated modelling of DNA methylation landscapes in colorectal cancer (CRC), and unravelled significant stagewise signposts of CRC progression. Methods: The methylation β - matrix was derived from the public-domain TCGA data, converted into M-value matrix, annotated with AJCC stages, and analysed for stage-salient genes using multiple approaches involving stage-differentiated linear modelling of methylation patterns and/or expression patterns. Differentially methylated genes (DMGs) were identified using a contrast against controls (adjusted p-value <0.001 and |log fold-change of M-value| >2). These results were filtered using a series of all possible pairwise stage contrasts (p-value <0.05) to obtain stage-salient DMGs. These were then subjected to a consensus analysis, followed by Kaplan–Meier survival analysis to evaluate the impact of methylation patterns of consensus stage-salient biomarkers on disease prognosis.Results: We found significant genome-wide changes in methylation patterns in cancer cases relative to controls agnostic of stage. Our stage-differentiated analysis yielded the following stage-salient genes: one stage-I gene (FBN1), one stage-II gene (FOXG1), one stage-III gene (HCN1) and four stage-IV genes (NELL1, ZNF135, FAM123A, LAMA1). All the biomarkers were hypermethylated, indicating down-regulation and signifying a CpG island Methylator Phenotype (CIMP) manifestation. A significant prognostic signature consisting of FBN1 and FOXG1 survived all the steps of our analysis pipeline, and represents a novel early-stage biomarker. Conclusions: We have designed a workflow for stage-differentiated consensus analysis, and identified stage-salient diagnostic biomarkers and an early-stage prognostic biomarker panel. Our studies further yield a novel CIMP-like signature of potential clinical import underlying CRC progression.

Download Full-text

Measuring consistency among gene set analysis methods: A systematic study

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720019400109 ◽

2019 ◽

Vol 17 (05) ◽

pp. 1940010 ◽

Cited By ~ 1

Author(s):

Farhad Maleki ◽

Katie L. Ovens ◽

Daniel J. Hogan ◽

Elham Rezaei ◽

Alan M. Rosenberg ◽

...

Keyword(s):

Gene Set Analysis ◽

Rna Seq ◽

Systematic Analysis ◽

Gene Set ◽

Large Gene ◽

Analysis Methods ◽

Gene Sets ◽

Significant Gene ◽

Biological Insight ◽

Relevant Gene

Gene set analysis is a quantitative approach for generating biological insight from gene expression datasets. The abundance of gene set analysis methods speaks to their popularity, but raises the question of the extent to which results are affected by the choice of method. Our systematic analysis of 13 popular methods using 6 different datasets, from both DNA microarray and RNA-Seq origin, shows that this choice matters a great deal. We observed that the overall number of gene sets reported by each method differed by up to 2 orders of magnitude, and there was a bias toward reporting large gene sets with some methods. Furthermore, there was substantial disagreement between the 20 most statistically significant gene sets reported by the methods. This was also observed when expanding to the 100 most statistically significant reported gene sets. For different datasets of the same phenotype/condition, the top 20 and top 100 most significant results also showed little to no agreement even when using the same method. GAGE, PAGE, and ORA were the only methods able to achieve relatively high reproducibility when comparing the 20 and 100 most statistically significant gene sets. Biological validation on a juvenile idiopathic arthritis (JIA) dataset showed wide variation in terms of the relevance of the top 20 and top 100 most significant gene sets to known biology of the disease, where GAGE predicted the most relevant gene sets, followed by GSEA, ORA, and PAGE.

Download Full-text

SUN-715 IIM May Influence Matured Oocytes’ DNA Methylation of PCOS Patients

Journal of the Endocrine Society ◽

10.1210/jendso/bvaa046.1031 ◽

2020 ◽

Vol 4 (Supplement_1) ◽

Author(s):

Congru Li ◽

Yang Yu

Keyword(s):

Dna Methylation ◽

In Vitro Fertilization ◽

Embryonic Development ◽

Embryo Transfer ◽

Cpg Island ◽

Reproductive Technologies ◽

Childbearing Age ◽

Vitro Fertilization ◽

The Impact

Abstract Polycystic ovary syndrome (PCOS) is the most common endocrine disorder in women of childbearing age and is the main cause of anovulatory infertility. To increase the number of oocytes obtained, controlled ovarian stimulation (COS) has become a routine choice for in vitro fertilization-embryo transfer (IVF-ET), which is one of the common assisted reproductive technologies for PCOS patients. However, for these patients, there is a high risk of ovarian hyperstimulation syndrome (OHSS). Obtaining in vitro maturation (IVM) of immature oocytes, and then in vitro fertilization and embryo transfer of mature oocytes provides a possible way for people to solve the above problems. Since the IVM technology will expose oocytes to in vitro conditions for a longer period of time, theoretically increasing the risk of the oocytes being affected by the culture environment, further research and explorations are needed for study in gene programming, epigenetics, etc. Therefore, to explore the impact of IVM operation on embryonic development is of great significance for further clarifying assisted reproductive safety and improving IVM operation conditions. Here we focused on DNA methylation reprogramming process which was essential for embryonic development. We tested the DNA methylation of sperm, IVM oocytes and IVM generated early stage embryos including pronucleus, 4cell, 8cell, morula, inner cell mass, trophoectoderm (TE) as well as six-week embryos by Nimble Gen Human DNA Methylation 3x729K CpG Island Plus RefSeq Promoter Array and compared the data with our published genome-wide DNA methylomes of human gametes and early embryos generated from in vivo maturation oocytes. We showed that IVM embryos show abnormal DNA methylation reprogramming pattern. By analyzing the abnormally reprogrammed promoters, we further found that IVM may affect the functions of demethylation related genes. Oocytes from IVM manipulation were tested with higher DNA methylation levels, and their abnormal methylated promoters mainly enriched in immune and metabolism pathways. Furthermore, we investigated the DNA methylation of TE, which was directly related with implantation process and revealed the abnormal methylated promoters were related with metabolism pathway too. Our data support that IVM may influence the DNA methylome of oocytes, which in turn affects the methylome of their embryos. However, due to the limited number of samples and the inability of the chip to cover all CpG sites, the results of this study require further research and validation.

Download Full-text

RPS19 and JAK2 Are Not Silenced by DNA Methylation in Diamond-Blackfan Anemia.

Blood ◽

10.1182/blood.v108.11.1312.1312 ◽

2006 ◽

Vol 108 (11) ◽

pp. 1312-1312

Author(s):

Jaroslav Jelinek ◽

Jean-Pierre J. Issa ◽

Rong He ◽

Radek Cmejla ◽

Jana Cmejlova ◽

...

Keyword(s):

Dna Methylation ◽

Tyrosine Kinase ◽

Ribosomal Proteins ◽

Cpg Island ◽

Methylation Frequency ◽

Diamond Blackfan Anemia ◽

Cpg Sites ◽

Causal Therapy ◽

Epigenetic Suppression ◽

Control Samples

Abstract Diamond Blackfan Anemia (DBA) is a congenital disorder characterized by decreased red blood cell production accompanied by developmental abnormalities in 30% patients. Twenty-five percent of DBA patients display heterozygous mutations of ribosomal protein S19 (RPS19) on chromosome 19q13.2. No mutations were found in genes for other ribosomal proteins of the translation initiation complex. Although a second DBA locus has been proposed on in the region 8p23.3–8p22, the precise molecular defect is not known in 75% of DBA patients. The exact mechanism of how RPS19 mutations affect erythropoiesis remains unclear. Haploinsufficency of RPS19 may hamper translation machinery important for rapid erythroid differentiation. Reduced gene expression of a cluster of ribosomal proteins including RPS19 in DBA patients was recently reported. No causal therapy for DBA is available, with the exception of bone marrow transplantation. Some DBA patients benefit therapeutically from corticosteroids, cyclosporine A, or metoclopramide. Recently, a long-lasting remission was described in a DBA patient treated with valproic acid, a histone deacetylase inhibitor, suggesting epigenetic suppression of genes critical for erythropoiesis may be involved in the pathogenesis of DBA. DNA methylation of promoter-associated CpG islands is an epigenetic modification resulting in transcriptional silencing functionally equivalent to a loss-of-function mutation. Constitutive activation of JAK2 tyrosine kinase by a somatic V617F mutation leads to excessive erythropoiesis in polycythemia vera, an antonym of DBA. We hypothesized that silencing by DNA methylation of promoter-associated CpG island of the RPS19 or JAK2 genes may play a role in DBA. To test the hypothesis, we analyzed DNA methylation of RPS19 and JAK2 genes in 14 patients from the Czech DBA Registry. Genomic DNA isolated from blood cells of 3 DBA patients carrying heterozygous RPS19 mutations, 11 DBA patients without RPS19 mutation and 4 control samples was treated with bisulfite to convert all unmethylated cytosines to uracils while methylated cytosines were spared from the conversion. A region spanning 13 CpG sites positioned from 1–160 bases downstream from transcription start site (TSS) of RPS19 gene was PCR amplified and cloned in a sequencing vector. Individual bacterial clones were isolated and PCR inserts were sequenced in 8–12 clones per sample. Bisulfite cloning and sequencing revealed that more than 99% of CpG sites were converted to TpG and thus not methylated either in DBA samples (only 4/1466 CpG sites were methylated, methylation frequency was 0.3%) or control samples (2/555 CpG sites methylated, methylation frequency 0.4%). To explore a possibility of epigenetic suppression of erythropoietin signaling in DBA we analyzed DNA methylation of the CpG island of JAK2 tyrosine kinase gene in the same set of samples. Bisulfite-treated DNA was PCR amplified and T/C polymorphisms corresponding to unmethylated or methylated CpG sites were quantified by pyrosequencing. All DBA and control samples showed the absence of DNA methylation at four CpG sites located 12 to 25 bases downstream of TSS. We conclude that epigenetic silencing by DNA methylation is not involved in the expression of ribosomal structural protein RPS19; neither it affects the expression of a transducer of erythropoietin signaling JAK2 tyrosine kinase.

Download Full-text

Genome-Wide DNA Methylation Analysis Shows Enrichment of Differential Methylation in “Open Seas” and Enhancers and Reveals Hypomethylation in DNMT3A Mutated Cytogenetically Normal AML (CN-AML)

Blood ◽

10.1182/blood.v120.21.653.653 ◽

2012 ◽

Vol 120 (21) ◽

pp. 653-653 ◽

Cited By ~ 2

Author(s):

Ying Qu ◽

Andreas Lennartsson ◽

Verena I. Gaidzik ◽

Stefan Deneberg ◽

Sofia Bengtzén ◽

...

Keyword(s):

Gene Expression ◽

Dna Methylation ◽

Cpg Island ◽

Cpg Islands ◽

Differential Methylation ◽

Methylation Analysis ◽

Cpg Sites ◽

Genome Wide ◽

Genomic Regions ◽

Methylation Patterns

Abstract Abstract 653 DNA methylation is involved in multiple biologic processes including normal cell differentiation and tumorigenesis. In AML, methylation patterns have been shown to differ significantly from normal hematopoietic cells. Most studies of DNA methylation in AML have previously focused on CpG islands within the promoter of genes, representing only a very small proportion of the DNA methylome. In this study, we performed genome-wide methylation analysis of 62 AML patients with CN-AML and CD34 positive cells from healthy controls by Illumina HumanMethylation450K Array covering 450.000 CpG sites in CpG islands as well as genomic regions far from CpG islands. Differentially methylated CpG sites (DMS) between CN-AML and normal hematopoietic cells were calculated and the most significant enrichment of DMS was found in regions more than 4kb from CpG Islands, in the so called open sea where hypomethylation was the dominant form of aberrant methylation. In contrast, CpG islands were not enriched for DMS and DMS in CpG islands were dominated by hypermethylation. DMS successively further away from CpG islands in CpG island shores (up to 2kb from CpG Island) and shelves (from 2kb to 4kb from Island) showed increasing degree of hypomethylation in AML cells. Among regions defined by their relation to gene structures, CpG dinucleotide located in theoretic enhancers were found to be the most enriched for DMS (Chi χ2<0.0001) with the majority of DMS showing decreased methylation compared to CD34 normal controls. To address the relation to gene expression, GEP (gene expression profiling) by microarray was carried out on 32 of the CN-AML patients. Totally, 339723 CpG sites covering 18879 genes were addressed on both platforms. CpG methylation in CpG islands showed the most pronounced anti-correlation (spearman ρ =-0.4145) with gene expression level, followed by CpG island shores (mean spearman rho for both sides' shore ρ=-0.2350). As transcription factors (TFs) have shown to be crucial for AML development, we especially studied differential methylation of an unbiased selection of 1638 TFs. The most enriched differential methylation between CN-AML and normal CD34 positive cells were found in TFs known to be involved in hematopoiesis and with Wilms tumor protein-1 (WT1), activator protein 1 (AP-1) and runt-related transcription factor 1 (RUNX1) being the most differentially methylated TFs. The differential methylation in WT 1 and RUNX1 was located in intragenic regions which were confirmed by pyro-sequencing. AML cases were characterized with respect to mutations in FLT3, NPM1, IDH1, IDH2 and DNMT3A. Correlation analysis between genome wide methylation patterns and mutational status showed statistically significant hypomethylation of CpG Island (p<0.0001) and to a lesser extent CpG island shores (p<0.001) and the presence of DNMT3A mutations. This links DNMT3A mutations for the first time to a hypomethylated phenotype. Further analyses correlating methylation patterns to other clinical data such as clinical outcome are ongoing. In conclusion, our study revealed that non-CpG island regions and in particular enhancers are the most aberrantly methylated genomic regions in AML and that WT 1 and RUNX1 are the most differentially methylated TFs. Furthermore, our data suggests a hypomethylated phenotype in DNMT3A mutated AML. Disclosures: No relevant conflicts of interest to declare.

Download Full-text

Importance of SNP Dependency Correction and Association Integration for Gene Set Analysis in Genome-Wide Association Studies

Frontiers in Genetics ◽

10.3389/fgene.2021.767358 ◽

2021 ◽

Vol 12 ◽

Author(s):

Michal Marczyk ◽

Agnieszka Macioszek ◽

Joanna Tobiasz ◽

Joanna Polanska ◽

Joanna Zyla

Keyword(s):

Association Studies ◽

Enrichment Analysis ◽

Gene Set Enrichment Analysis ◽

Genome Wide Association ◽

Gene Set Analysis ◽

Genome Wide Association Studies ◽

Gene Set Enrichment ◽

Gene Set ◽

Genome Wide ◽

The Impact

A typical genome-wide association study (GWAS) analyzes millions of single-nucleotide polymorphisms (SNPs), several of which are in a region of the same gene. To conduct gene set analysis (GSA), information from SNPs needs to be unified at the gene level. A widely used practice is to use only the most relevant SNP per gene; however, there are other methods of integration that could be applied here. Also, the problem of nonrandom association of alleles at two or more loci is often neglected. Here, we tested the impact of incorporation of different integrations and linkage disequilibrium (LD) correction on the performance of several GSA methods. Matched normal and breast cancer samples from The Cancer Genome Atlas database were used to evaluate the performance of six GSA algorithms: Coincident Extreme Ranks in Numerical Observations (CERNO), Gene Set Enrichment Analysis (GSEA), GSEA-SNP, improved GSEA for GWAS (i-GSEA4GWAS), Meta-Analysis Gene-set Enrichment of variaNT Associations (MAGENTA), and Over-Representation Analysis (ORA). Association of SNPs to phenotype was calculated using modified McNemar’s test. Results for SNPs mapped to the same gene were integrated using Fisher and Stouffer methods and compared with the minimum p-value method. Four common measures were used to quantify the performance of all combinations of methods. Results of GSA analysis on GWAS were compared to the one performed on gene expression data. Comparing all evaluation metrics across different GSA algorithms, integrations, and LD correction, we highlighted CERNO, and MAGENTA with Stouffer as the most efficient. Applying LD correction increased prioritization and specificity of enrichment outcomes for all tested algorithms. When Fisher or Stouffer were used with LD, sensitivity and reproducibility were also better. Using any integration method was beneficial in comparison with a minimum p-value method in specific combinations. The correlation between GSA results from genomic and transcriptomic level was the highest when Stouffer integration was combined with LD correction. We thoroughly evaluated different approaches to GSA in GWAS in terms of performance to guide others to select the most effective combinations. We showed that LD correction and Stouffer integration could increase the performance of enrichment analysis and encourage the usage of these techniques.

Download Full-text

The impact of distributional assumptions in gene-set and pathway analysis: how far can it go wrong?

10.1101/2021.02.01.429279 ◽

2021 ◽

Author(s):

Chi-Hsuan Ho ◽

Yu-Jyun Huang ◽

Ying-Ju Lai ◽

Rajarshi Mukherjee ◽

Chuhsing Kate Hsiao

Keyword(s):

Implicit Assumption ◽

Expression Data ◽

Systematic Analysis ◽

Gene Set ◽

Tumor Subtypes ◽

Normal Tissues ◽

Gene Sets ◽

Standard Procedures ◽

Active Research ◽

The Impact

ABSTRACTGene-set analysis (GSA) has been one of the standard procedures for exploring potential biological functions when a group of differentially expressed genes have been derived. The development of its methodology has been an active research topic in recent decades. Many GSA methods, when newly proposed, rely on simulation studies to evaluate their performance with a common implicit assumption that the multivariate expression values are normally distributed. The validity of this assumption has been disputed in several studies but no systematic analysis has been carried out to assess the influence of this distributional assumption. Our goal in this study is not to propose a new GSA method but to first examine if the multi-dimensional gene expression data in gene sets follow a multivariate normal distribution (MVN). Six statistical methods in three categories of MVN tests were considered and applied to a total of twenty-two datasets of expression data from studies involving tumor and normal tissues, with ten signaling pathways chosen as the gene sets. Second, we evaluated the influence of non-normality on the performance of current GSA tools, including parametric and non-parametric methods. Specifically, the scenario of mixture distributions representing the case of different tumor subtypes was considered. Our first finding suggests that the MVN assumption should be carefully dealt with. It does not hold true in many applications tested here. The second investigation of the GSA tools demonstrates that the non-normality does affect the performance of these GSA methods, especially when subtypes exist. We conclude that the use of the inherent multivariate normality assumption should be assessed with care in evaluating new GSA tools, since this MVN assumption cannot be guaranteed and this assumption affects strongly the performance of GSA methods. If a newly proposed GSA method is to be evaluated, we recommend the incorporation of multivariate non-normal distributions or sampling from large databases if available.

Download Full-text

GeneSetCluster: a tool for summarizing and integrating gene-set analysis results

BMC Bioinformatics ◽

10.1186/s12859-020-03784-z ◽

2020 ◽

Vol 21 (1) ◽

Cited By ~ 1

Author(s):

Ewoud Ewing ◽

Nuria Planell-Picola ◽

Maja Jagodic ◽

David Gomez-Cabrero

Keyword(s):

Gene Content ◽

Gene Set Analysis ◽

Gene Set ◽

Overlapping Gene ◽

Analysis Tools ◽

Novel Approach ◽

Gene Sets ◽

Distance Score ◽

Significant Gene ◽

Similar Gene

Abstract Background Gene-set analysis tools, which make use of curated sets of molecules grouped based on their shared functions, aim to identify which gene-sets are over-represented in the set of features that have been associated with a given trait of interest. Such tools are frequently used in gene-centric approaches derived from RNA-sequencing or microarrays such as Ingenuity or GSEA, but they have also been adapted for interval-based analysis derived from DNA methylation or ChIP/ATAC-sequencing. Gene-set analysis tools return, as a result, a list of significant gene-sets. However, while these results are useful for the researcher in the identification of major biological insights, they may be complex to interpret because many gene-sets have largely overlapping gene contents. Additionally, in many cases the result of gene-set analysis consists of a large number of gene-sets making it complicated to identify the major biological insights. Results We present GeneSetCluster, a novel approach which allows clustering of identified gene-sets, from one or multiple experiments and/or tools, based on shared genes. GeneSetCluster calculates a distance score based on overlapping gene content, which is then used to cluster them together and as a result, GeneSetCluster identifies groups of gene-sets with similar gene-set definitions (i.e. gene content). These groups of gene-sets can aid the researcher to focus on such groups for biological interpretations. Conclusions GeneSetCluster is a novel approach for grouping together post gene-set analysis results based on overlapping gene content. GeneSetCluster is implemented as a package in R. The package and the vignette can be downloaded at https://github.com/TranslationalBioinformaticsUnit

Download Full-text