scholarly journals GEESE: Metabolically driven latent space learning for gene expression data

2018 ◽  
Author(s):  
Marco Barsacchi ◽  
Helena Andres Terre ◽  
Pietro Lió

AbstractGene expression microarrays provide a characterisation of the transcriptional activity of a particular biological sample. Their high dimensionality hampers the process of pattern recognition and extraction. Several approaches have been proposed for gleaning information about the hidden structure of the data. Among these approaches, deep generative models provide a powerful way for approximating the manifold on which the data reside.Here we develop GEESE, a deep learning based framework that provides novel insight into the manifold learning for gene expression data, employing a metabolic model to constrain the learned representation. We evaluated the proposed framework, showing its ability to capture biologically relevant features, and encoding that features in a much simpler latent space. We showed how using a metabolic model to drive the autoencoder learning process helps in achieving better generalisation to unseen data. GEESE provides a novel perspective on the problem of unsupervised learning for biological data.AvailabilitySource code of GEESE is available athttps://bitbucket.org/mbarsacchi/geese/.

Mathematics ◽  
2021 ◽  
Vol 9 (7) ◽  
pp. 772
Author(s):  
Seonghun Kim ◽  
Seockhun Bae ◽  
Yinhua Piao ◽  
Kyuri Jo

Genomic profiles of cancer patients such as gene expression have become a major source to predict responses to drugs in the era of personalized medicine. As large-scale drug screening data with cancer cell lines are available, a number of computational methods have been developed for drug response prediction. However, few methods incorporate both gene expression data and the biological network, which can harbor essential information about the underlying process of the drug response. We proposed an analysis framework called DrugGCN for prediction of Drug response using a Graph Convolutional Network (GCN). DrugGCN first generates a gene graph by combining a Protein-Protein Interaction (PPI) network and gene expression data with feature selection of drug-related genes, and the GCN model detects the local features such as subnetworks of genes that contribute to the drug response by localized filtering. We demonstrated the effectiveness of DrugGCN using biological data showing its high prediction accuracy among the competing methods.


Author(s):  
Soumya Raychaudhuri

The most interesting and challenging gene expression data sets to analyze are large multidimensional data sets that contain expression values for many genes across multiple conditions. In these data sets the use of scientific text can be particularly useful, since there are a myriad of genes examined under vastly different conditions, each of which may induce or repress expression of the same gene for different reasons. There is an enormous complexity to the data that we are examining—each gene is associated with dozens if not hundreds of expression values as well as multiple documents built up from vocabularies consisting of thousands of words. In Section 2.4 we reviewed common gene expression strategies, most of which revolve around defining groups of genes based on common profiles. A limitation of many gene expression analytic approaches is that they do not incorporate comprehensive background knowledge about the genes into the analysis. We present computational methods that leverage the peer-reviewed literature in the automatic analysis of gene expression data sets. Including the literature in gene expression data analysis offers an opportunity to incorporate background functional information about the genes when defining expression clusters. In Chapter 5 we saw how literature- based approaches could help in the analysis of single condition experiments. Here we will apply the strategies introduced in Chapter 6 to assess the coherence of groups of genes to enhance gene expression analysis approaches. The methods proposed here could, in fact, be applied to any multivariate genomics data type. The key concepts discussed in this chapter are listed in the frame box. We begin with a discussion of gene groups and their role in expression analysis; we briefly discuss strategies to assign keywords to groups and strategies to assess their functional coherence. We apply functional coherence measures to gene expression analysis; for examples we focus on a yeast expression data set. We first demonstrate how functional coherence can be used to focus in on the key biologically relevant gene groups derived by clustering methods such as self-organizing maps and k-means clustering.


2020 ◽  
Vol 21 (1) ◽  
Author(s):  
Gregory P. Way ◽  
Michael Zietz ◽  
Vincent Rubinetti ◽  
Daniel S. Himmelstein ◽  
Casey S. Greene

Genes ◽  
2019 ◽  
Vol 10 (11) ◽  
pp. 931 ◽  
Author(s):  
Mok ◽  
Kim ◽  
Lee ◽  
Choi ◽  
Lee ◽  
...  

Although there have been several analyses for identifying cancer-associated pathways, based on gene expression data, most of these are based on single pathway analyses, and thus do not consider correlations between pathways. In this paper, we propose a hierarchical structural component model for pathway analysis of gene expression data (HisCoM-PAGE), which accounts for the hierarchical structure of genes and pathways, as well as the correlations among pathways. Specifically, HisCoM-PAGE focuses on the survival phenotype and identifies its associated pathways. Moreover, its application to real biological data analysis of pancreatic cancer data demonstrated that HisCoM-PAGE could successfully identify pathways associated with pancreatic cancer prognosis. Simulation studies comparing the performance of HisCoM-PAGE with other competing methods such as Gene Set Enrichment Analysis (GSEA), Global Test, and Wald-type Test showed HisCoM-PAGE to have the highest power to detect causal pathways in most simulation scenarios.


2015 ◽  
Author(s):  
Andrew Anand Brown ◽  
Zhihao Ding ◽  
Ana Viñuela ◽  
Dan Glass ◽  
Leopold Parts ◽  
...  

Statistical factor analysis methods have previously been used to remove noise components from high dimensional data prior to genetic association mapping, and in a guided fashion to summarise biologically relevant sources of variation. Here we show how the derived factors summarising pathway expression can be used to analyse the relationships between expression, heritability and ageing. We used skin gene expression data from 647 twins from the MuTHER Consortium and applied factor analysis to concisely summarise patterns of gene expression, both to remove broad confounding influences and to produce concise pathway-level phenotypes. We derived 930 "pathway phenotypes" which summarised patterns of variation across 186 KEGG pathways (five phenotypes per pathway). We identified 69 significant associations of age with phenotype from 57 distinct KEGG pathways at a stringent Bonferroni threshold (P<5.38E-5). These phenotypes are more heritable (h^2=0.32) than gene expression levels. On average, expression levels of 16% of genes within these pathways are associated with age. Several significant pathways relate to metabolising sugars and fatty acids, others with insulin signalling. We have demonstrated that factor analysis methods combined with biological knowledge can produce more reliable phenotypes with less stochastic noise than the individual gene expression levels, which increases our power to discover biologically relevant associations. These phenotypes could also be applied to discover associations with other environmental factors.


Author(s):  
Erliang Zeng ◽  
Chengyong Yang ◽  
Tao Li ◽  
Giri Narasimhan

Clustering of gene expression data is a standard exploratory technique used to identify closely related genes. Many other sources of data are also likely to be of great assistance in the analysis of gene expression data. This data provides a mean to begin elucidating the large-scale modular organization of the cell. The authors consider the challenging task of developing exploratory analytical techniques to deal with multiple complete and incomplete information sources. The Multi-Source Clustering (MSC) algorithm developed performs clustering with multiple, but complete, sources of data. To deal with incomplete data sources, the authors adopted the MPCK-means clustering algorithms to perform exploratory analysis on one complete source and other potentially incomplete sources provided in the form of constraints. This paper presents a new clustering algorithm MSC to perform exploratory analysis using two or more diverse but complete data sources, studies the effectiveness of constraints sets and robustness of the constrained clustering algorithm using multiple sources of incomplete biological data, and incorporates such incomplete data into constrained clustering algorithm in form of constraints sets.


2021 ◽  
Author(s):  
Hengshi Yu ◽  
Joshua D. Welch

AbstractDeep generative models, including variational autoencoders (VAEs) and generative adversarial networks (GANs), have achieved remarkable successes in generating and manipulating highdimensional images. VAEs excel at learning disentangled image representations, while GANs excel at generating realistic images. Here, we systematically assess disentanglement and generation performance on single-cell gene expression data and find that these strengths and weaknesses of VAEs and GANs apply to single-cell gene expression data in a similar way. We also develop MichiGAN1, a novel neural network that combines the strengths of VAEs and GANs to sample from disentangled representations without sacrificing data generation quality. We learn disentangled representations of two large singlecell RNA-seq datasets [13, 68] and use MichiGAN to sample from these representations. MichiGAN allows us to manipulate semantically distinct aspects of cellular identity and predict single-cell gene expression response to drug treatment.


Author(s):  
Miao Wang ◽  
Xuequn Shang ◽  
Shaohua Zhang ◽  
Zhanhuai Li

DNA microarray technology has generated a large number of gene expression data. Biclustering is a methodology allowing for condition set and gene set points clustering simultaneously. It finds clusters of genes possessing similar characteristics together with biological conditions creating these similarities. Almost all the current biclustering algorithms find bicluster in one microarray dataset. In order to reduce the noise influence and find more biological biclusters, the authors propose the FDCluster algorithm in order to mine frequent closed discriminative bicluster in multiple microarray datasets. FDCluster uses Apriori property and several novel techniques for pruning to mine biclusters efficiently. To increase the space usage, FDCluster also utilizes several techniques to generate frequent closed bicluster without candidate maintenance in memory. The experimental results show that FDCluster is more effective than traditional methods in either single micorarray dataset or multiple microarray datasets. This paper tests the biological significance using GO to show the proposed method is able to produce biologically relevant biclusters.


Sign in / Sign up

Export Citation Format

Share Document