GEESE: Metabolically driven latent space learning for gene expression data

Mapping Intimacies ◽

10.1101/365643 ◽

2018 ◽

Cited By ~ 1

Author(s):

Marco Barsacchi ◽

Helena Andres Terre ◽

Pietro Lió

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Metabolic Model ◽

Generative Models ◽

Biological Data ◽

Expression Data ◽

Biologically Relevant ◽

Latent Space ◽

Unseen Data ◽

Expression Microarrays

AbstractGene expression microarrays provide a characterisation of the transcriptional activity of a particular biological sample. Their high dimensionality hampers the process of pattern recognition and extraction. Several approaches have been proposed for gleaning information about the hidden structure of the data. Among these approaches, deep generative models provide a powerful way for approximating the manifold on which the data reside.Here we develop GEESE, a deep learning based framework that provides novel insight into the manifold learning for gene expression data, employing a metabolic model to constrain the learned representation. We evaluated the proposed framework, showing its ability to capture biologically relevant features, and encoding that features in a much simpler latent space. We showed how using a metabolic model to drive the autoencoder learning process helps in achieving better generalisation to unseen data. GEESE provides a novel perspective on the problem of unsupervised learning for biological data.AvailabilitySource code of GEESE is available athttps://bitbucket.org/mbarsacchi/geese/.

Download Full-text

Graph Convolutional Network for Drug Response Prediction Using Gene Expression Data

Mathematics ◽

10.3390/math9070772 ◽

2021 ◽

Vol 9 (7) ◽

pp. 772

Author(s):

Seonghun Kim ◽

Seockhun Bae ◽

Yinhua Piao ◽

Kyuri Jo

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Large Scale ◽

Drug Response ◽

Response Prediction ◽

Biological Data ◽

Expression Data ◽

Convolutional Network ◽

Essential Information ◽

Protein Protein Interaction

Genomic profiles of cancer patients such as gene expression have become a major source to predict responses to drugs in the era of personalized medicine. As large-scale drug screening data with cancer cell lines are available, a number of computational methods have been developed for drug response prediction. However, few methods incorporate both gene expression data and the biological network, which can harbor essential information about the underlying process of the drug response. We proposed an analysis framework called DrugGCN for prediction of Drug response using a Graph Convolutional Network (GCN). DrugGCN first generates a gene graph by combining a Protein-Protein Interaction (PPI) network and gene expression data with feature selection of drug-related genes, and the GCN model detects the local features such as subnetworks of genes that contribute to the drug response by localized filtering. We demonstrated the effectiveness of DrugGCN using biological data showing its high prediction accuracy among the competing methods.

Download Full-text

Analyzing Large Gene Expression Data Sets

Computational Text Analysis ◽

10.1093/oso/9780198567400.003.0014 ◽

2006 ◽

Author(s):

Soumya Raychaudhuri

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Expression Analysis ◽

Gene Expression Analysis ◽

Data Sets ◽

Expression Data ◽

Clustering Methods ◽

Biologically Relevant ◽

Large Gene ◽

Functional Coherence

The most interesting and challenging gene expression data sets to analyze are large multidimensional data sets that contain expression values for many genes across multiple conditions. In these data sets the use of scientific text can be particularly useful, since there are a myriad of genes examined under vastly different conditions, each of which may induce or repress expression of the same gene for different reasons. There is an enormous complexity to the data that we are examining—each gene is associated with dozens if not hundreds of expression values as well as multiple documents built up from vocabularies consisting of thousands of words. In Section 2.4 we reviewed common gene expression strategies, most of which revolve around defining groups of genes based on common profiles. A limitation of many gene expression analytic approaches is that they do not incorporate comprehensive background knowledge about the genes into the analysis. We present computational methods that leverage the peer-reviewed literature in the automatic analysis of gene expression data sets. Including the literature in gene expression data analysis offers an opportunity to incorporate background functional information about the genes when defining expression clusters. In Chapter 5 we saw how literature- based approaches could help in the analysis of single condition experiments. Here we will apply the strategies introduced in Chapter 6 to assess the coherence of groups of genes to enhance gene expression analysis approaches. The methods proposed here could, in fact, be applied to any multivariate genomics data type. The key concepts discussed in this chapter are listed in the frame box. We begin with a discussion of gene groups and their role in expression analysis; we briefly discuss strategies to assign keywords to groups and strategies to assess their functional coherence. We apply functional coherence measures to gene expression analysis; for examples we focus on a yeast expression data set. We first demonstrate how functional coherence can be used to focus in on the key biologically relevant gene groups derived by clustering methods such as self-organizing maps and k-means clustering.

Download Full-text

Compressing gene expression data using multiple latent space dimensionalities learns complementary biological representations

Genome Biology ◽

10.1186/s13059-020-02021-3 ◽

2020 ◽

Vol 21 (1) ◽

Cited By ~ 6

Author(s):

Gregory P. Way ◽

Michael Zietz ◽

Vincent Rubinetti ◽

Daniel S. Himmelstein ◽

Casey S. Greene

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Expression Data ◽

Latent Space

Download Full-text

A Methodology for Biologically Relevant Pattern Discovery from Gene Expression Data

Discovery Science - Lecture Notes in Computer Science ◽

10.1007/978-3-540-30214-8_18 ◽

2004 ◽

pp. 230-241 ◽

Cited By ~ 16

Author(s):

Ruggero G. Pensa ◽

Jérémy Besson ◽

Jean-François Boulicaut

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Pattern Discovery ◽

Expression Data ◽

Biologically Relevant ◽

Relevant Pattern

Download Full-text

HisCoM-PAGE: Hierarchical Structural Component Models for Pathway Analysis of Gene Expression Data

Genes ◽

10.3390/genes10110931 ◽

2019 ◽

Vol 10 (11) ◽

pp. 931 ◽

Cited By ~ 4

Author(s):

Mok ◽

Kim ◽

Lee ◽

Choi ◽

Lee ◽

...

Keyword(s):

Gene Expression ◽

Pancreatic Cancer ◽

Gene Expression Data ◽

Pathway Analysis ◽

Structural Component ◽

Biological Data ◽

Gene Set Enrichment Analysis ◽

Expression Data ◽

Global Test ◽

Causal Pathways

Although there have been several analyses for identifying cancer-associated pathways, based on gene expression data, most of these are based on single pathway analyses, and thus do not consider correlations between pathways. In this paper, we propose a hierarchical structural component model for pathway analysis of gene expression data (HisCoM-PAGE), which accounts for the hierarchical structure of genes and pathways, as well as the correlations among pathways. Specifically, HisCoM-PAGE focuses on the survival phenotype and identifies its associated pathways. Moreover, its application to real biological data analysis of pancreatic cancer data demonstrated that HisCoM-PAGE could successfully identify pathways associated with pancreatic cancer prognosis. Simulation studies comparing the performance of HisCoM-PAGE with other competing methods such as Gene Set Enrichment Analysis (GSEA), Global Test, and Wald-type Test showed HisCoM-PAGE to have the highest power to detect causal pathways in most simulation scenarios.

Download Full-text

Pathway based factor analysis of gene expression data produces highly heritable phenotypes that associate with age

10.1101/016154 ◽

2015 ◽

Cited By ~ 1

Author(s):

Andrew Anand Brown ◽

Zhihao Ding ◽

Ana Viñuela ◽

Dan Glass ◽

Leopold Parts ◽

...

Keyword(s):

Gene Expression ◽

Factor Analysis ◽

Gene Expression Data ◽

Biological Knowledge ◽

Expression Data ◽

Expression Levels ◽

Biologically Relevant ◽

Kegg Pathways ◽

Analysis Methods ◽

Gene Expression Levels

Statistical factor analysis methods have previously been used to remove noise components from high dimensional data prior to genetic association mapping, and in a guided fashion to summarise biologically relevant sources of variation. Here we show how the derived factors summarising pathway expression can be used to analyse the relationships between expression, heritability and ageing. We used skin gene expression data from 647 twins from the MuTHER Consortium and applied factor analysis to concisely summarise patterns of gene expression, both to remove broad confounding influences and to produce concise pathway-level phenotypes. We derived 930 "pathway phenotypes" which summarised patterns of variation across 186 KEGG pathways (five phenotypes per pathway). We identified 69 significant associations of age with phenotype from 57 distinct KEGG pathways at a stringent Bonferroni threshold (P<5.38E-5). These phenotypes are more heritable (h^2=0.32) than gene expression levels. On average, expression levels of 16% of genes within these pathways are associated with age. Several significant pathways relate to metabolising sugars and fatty acids, others with insulin signalling. We have demonstrated that factor analysis methods combined with biological knowledge can produce more reliable phenotypes with less stochastic noise than the individual gene expression levels, which increases our power to discover biologically relevant associations. These phenotypes could also be applied to discover associations with other environmental factors.

Download Full-text

Clustering Genes Using Heterogeneous Data Sources

International Journal of Knowledge Discovery in Bioinformatics ◽

10.4018/jkdb.2010040102 ◽

2010 ◽

Vol 1 (2) ◽

pp. 12-28 ◽

Cited By ~ 3

Author(s):

Erliang Zeng ◽

Chengyong Yang ◽

Tao Li ◽

Giri Narasimhan

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Incomplete Data ◽

Clustering Algorithm ◽

Biological Data ◽

Exploratory Analysis ◽

Data Sources ◽

Modular Organization ◽

Constrained Clustering ◽

Expression Data

Clustering of gene expression data is a standard exploratory technique used to identify closely related genes. Many other sources of data are also likely to be of great assistance in the analysis of gene expression data. This data provides a mean to begin elucidating the large-scale modular organization of the cell. The authors consider the challenging task of developing exploratory analytical techniques to deal with multiple complete and incomplete information sources. The Multi-Source Clustering (MSC) algorithm developed performs clustering with multiple, but complete, sources of data. To deal with incomplete data sources, the authors adopted the MPCK-means clustering algorithms to perform exploratory analysis on one complete source and other potentially incomplete sources provided in the form of constraints. This paper presents a new clustering algorithm MSC to perform exploratory analysis using two or more diverse but complete data sources, studies the effectiveness of constraints sets and robustness of the constrained clustering algorithm using multiple sources of incomplete biological data, and incorporates such incomplete data into constrained clustering algorithm in form of constraints sets.

Download Full-text

Sampling from Disentangled Representations of Single-Cell Data Using Generative Adversarial Networks

10.1101/2021.01.15.426872 ◽

2021 ◽

Author(s):

Hengshi Yu ◽

Joshua D. Welch

Keyword(s):

Gene Expression ◽

Single Cell ◽

Gene Expression Data ◽

Generative Models ◽

Generative Adversarial Networks ◽

Expression Data ◽

Gene Expression Response ◽

Adversarial Networks ◽

Cell Gene Expression ◽

Cell Gene

AbstractDeep generative models, including variational autoencoders (VAEs) and generative adversarial networks (GANs), have achieved remarkable successes in generating and manipulating highdimensional images. VAEs excel at learning disentangled image representations, while GANs excel at generating realistic images. Here, we systematically assess disentanglement and generation performance on single-cell gene expression data and find that these strengths and weaknesses of VAEs and GANs apply to single-cell gene expression data in a similar way. We also develop MichiGAN1, a novel neural network that combines the strengths of VAEs and GANs to sample from disentangled representations without sacrificing data generation quality. We learn disentangled representations of two large singlecell RNA-seq datasets [13, 68] and use MichiGAN to sample from these representations. MichiGAN allows us to manipulate semantically distinct aspects of cellular identity and predict single-cell gene expression response to drug treatment.

Download Full-text

Efficient Mining Frequent Closed Discriminative Biclusters by Sample-Growth

Computational Knowledge Discovery for Bioinformatics Research ◽

10.4018/978-1-4666-1785-8.ch006 ◽

2013 ◽

pp. 84-103

Author(s):

Miao Wang ◽

Xuequn Shang ◽

Shaohua Zhang ◽

Zhanhuai Li

Keyword(s):

Gene Expression ◽

Dna Microarray ◽

Gene Expression Data ◽

Biological Significance ◽

Microarray Dataset ◽

Experimental Results ◽

Expression Data ◽

Biologically Relevant ◽

Almost All ◽

Microarray Datasets

DNA microarray technology has generated a large number of gene expression data. Biclustering is a methodology allowing for condition set and gene set points clustering simultaneously. It finds clusters of genes possessing similar characteristics together with biological conditions creating these similarities. Almost all the current biclustering algorithms find bicluster in one microarray dataset. In order to reduce the noise influence and find more biological biclusters, the authors propose the FDCluster algorithm in order to mine frequent closed discriminative bicluster in multiple microarray datasets. FDCluster uses Apriori property and several novel techniques for pruning to mine biclusters efficiently. To increase the space usage, FDCluster also utilizes several techniques to generate frequent closed bicluster without candidate maintenance in memory. The experimental results show that FDCluster is more effective than traditional methods in either single micorarray dataset or multiple microarray datasets. This paper tests the biological significance using GO to show the proposed method is able to produce biologically relevant biclusters.

Download Full-text

Multimodal probabilistic generative models for time-course gene expression data and Gene Ontology (GO) tags

Mathematical Biosciences ◽

10.1016/j.mbs.2015.08.007 ◽

2015 ◽

Vol 268 ◽

pp. 80-91 ◽

Cited By ~ 1

Author(s):

Prasad Gabbur ◽

James Hoying ◽

Kobus Barnard

Keyword(s):

Gene Expression ◽

Gene Ontology ◽

Gene Expression Data ◽

Time Course ◽

Generative Models ◽

Expression Data

Download Full-text