Infer related genes from large scale gene expression dataset with embedding

Mapping Intimacies ◽

10.1101/362848 ◽

2018 ◽

Author(s):

Chi Tung Choy ◽

Chi Hang Wong ◽

Stephen Lam Chan

Keyword(s):

Gene Expression ◽

Large Scale ◽

Gene List ◽

Ground Truth ◽

Relevant Information ◽

Molecular Data ◽

Biological Data ◽

Gene Expression Dataset ◽

Biologically Relevant ◽

Unsupervised Data Mining

AbstractArtificial neural networks (ANNs) have been utilized for classification and prediction task with remarkable accuracy. However, its implications for unsupervised data mining using molecular data is under-explored. We adopted a method of unsupervised ANN, namely word embedding, to extract biologically relevant information from TCGA gene expression dataset. Ground truth relationship, such as cancer types of the input sample and semantic meaning of genes, were showed to retain in the resulting entity matrices. We also demonstrated the interpretability and usage of these matrices in shortlisting candidates from a long gene list. This method is feasible to mine big volume of biological data, and would be a valuable tool to discover novel knowledge from omics data. The resulting embedding matrices mined from TCGA gene expression data are interactively explorable online (http://bit.ly/tcga-embedding-cancer) and could serve as an informative reference.

Download Full-text

Graph Convolutional Network for Drug Response Prediction Using Gene Expression Data

Mathematics ◽

10.3390/math9070772 ◽

2021 ◽

Vol 9 (7) ◽

pp. 772

Author(s):

Seonghun Kim ◽

Seockhun Bae ◽

Yinhua Piao ◽

Kyuri Jo

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Large Scale ◽

Drug Response ◽

Response Prediction ◽

Biological Data ◽

Expression Data ◽

Convolutional Network ◽

Essential Information ◽

Protein Protein Interaction

Genomic profiles of cancer patients such as gene expression have become a major source to predict responses to drugs in the era of personalized medicine. As large-scale drug screening data with cancer cell lines are available, a number of computational methods have been developed for drug response prediction. However, few methods incorporate both gene expression data and the biological network, which can harbor essential information about the underlying process of the drug response. We proposed an analysis framework called DrugGCN for prediction of Drug response using a Graph Convolutional Network (GCN). DrugGCN first generates a gene graph by combining a Protein-Protein Interaction (PPI) network and gene expression data with feature selection of drug-related genes, and the GCN model detects the local features such as subnetworks of genes that contribute to the drug response by localized filtering. We demonstrated the effectiveness of DrugGCN using biological data showing its high prediction accuracy among the competing methods.

Download Full-text

Neural system-enriched gene expression: relationship to biological pathways and neurological diseases

Physiological Genomics ◽

10.1152/physiolgenomics.00220.2003 ◽

2004 ◽

Vol 18 (2) ◽

pp. 167-183 ◽

Cited By ~ 12

Author(s):

Jianhua Zhang ◽

Amy Moseley ◽

Anil G. Jegga ◽

Ashima Gupta ◽

David P. Witte ◽

...

Keyword(s):

Gene Expression ◽

Nervous System ◽

Intracellular Signaling ◽

Large Scale ◽

Expression Profiles ◽

Gene List ◽

Structure And Function ◽

System Structure ◽

Psychiatric Disease ◽

And Function

To understand the commitment of the genome to nervous system differentiation and function, we sought to compare nervous system gene expression to that of a wide variety of other tissues by gene expression database construction and mining. Gene expression profiles of 10 different adult nervous tissues were compared with that of 72 other tissues. Using ANOVA, we identified 1,361 genes whose expression was higher in the nervous system than other organs and, separately, 600 genes whose expression was at least threefold higher in one or more regions of the nervous system compared with their median expression across all organs. Of the 600 genes, 381 overlapped with the 1,361-gene list. Limited in situ gene expression analysis confirmed that identified genes did represent nervous system-enriched gene expression, and we therefore sought to evaluate the validity and significance of these top-ranked nervous system genes using known gene literature and gene ontology categorization criteria. Diverse functional categories were present in the 381 genes, including genes involved in intracellular signaling, cytoskeleton structure and function, enzymes, RNA metabolism and transcription, membrane proteins, as well as cell differentiation, death, proliferation, and division. We searched existing public sites and identified 110 known genes related to mental retardation, neurological disease, and neurodegeneration. Twenty-one of the 381 genes were within the 110-gene list, compared with a random expectation of 5. This suggests that the 381 genes provide a candidate set for further analyses in neurological and psychiatric disease studies and that as a field, we are as yet, far from a large-scale understanding of the genes that are critical for nervous system structure and function. Together, our data indicate the power of profiling an individual biologic system in a multisystem context to gain insight into the genomic basis of its structure and function.

Download Full-text

Co-expression networks for plant biology: why and how

Acta Biochimica et Biophysica Sinica ◽

10.1093/abbs/gmz080 ◽

2019 ◽

Vol 51 (10) ◽

pp. 981-988 ◽

Cited By ~ 6

Author(s):

Xiaolan Rao ◽

Richard A Dixon

Keyword(s):

Gene Expression ◽

Network Analysis ◽

Expression Analysis ◽

Expression Profiling ◽

Recent Literature ◽

Genomic Data ◽

Relevant Information ◽

Plant Biology ◽

Biologically Relevant

Abstract Co-expression network analysis is one of the most powerful approaches for interpretation of large transcriptomic datasets. It enables characterization of modules of co-expressed genes that may share biological functional linkages. Such networks provide an initial way to explore functional associations from gene expression profiling and can be applied to various aspects of plant biology. This review presents the applications of co-expression network analysis in plant biology and addresses optimized strategies from the recent literature for performing co-expression analysis on plant biological systems. Additionally, we describe the combined interpretation of co-expression analysis with other genomic data to enhance the generation of biologically relevant information.

Download Full-text

PolySTest: Robust statistical testing of proteomics data with missing values improves detection of biologically relevant features

10.1101/765818 ◽

2019 ◽

Cited By ~ 1

Author(s):

Veit Schwämmle ◽

Christina E Hagensen ◽

Adelina Rogowska-Wrzesinska ◽

Ole N. Jensen

Keyword(s):

Mass Spectrometry ◽

Large Scale ◽

Missing Values ◽

Statistical Tests ◽

Ground Truth ◽

Statistical Testing ◽

Molecular Networks ◽

Proteomics Data ◽

Biologically Relevant ◽

Data Browsing

AbstractStatistical testing remains one of the main challenges for high-confidence detection of differentially regulated proteins or peptides in large-scale quantitative proteomics experiments by mass spectrometry. Statistical tests need to be sufficiently robust to deal with experiment intrinsic data structures and variations and often also reduced feature coverage across different biological samples due to ubiquitous missing values. A robust statistical test provides accurate confidence scores of large-scale proteomics results, regardless of instrument platform, experimental protocol and software tools. However, the multitude of different combinations of experimental strategies, mass spectrometry techniques and informatics methods complicate the decision of choosing appropriate statistical approaches. We address this challenge by introducing PolySTest, a user-friendly web service for statistical testing, data browsing and data visualization. We introduce a new method, Miss Test, that simultaneously tests for missingness and feature abundance, thereby complementing common statistical tests by rescuing otherwise discarded data features. We demonstrate that PolySTest with integrated Miss Test achieves higher confidence and higher sensitivity for artificial and experimental proteomics data sets with known ground truth. Application of PolySTest to mass spectrometry based large-scale proteomics data obtained from differentiating muscle cells resulted in the rescue of 10%-20% additional proteins in the identified molecular networks relevant to muscle differentiation. We conclude that PolySTest is a valuable addition to existing tools and instrument enhancements that improve coverage and depth of large-scale proteomics experiments. A fully functional demo version of PolySTest and Miss Test is available via http://computproteomics.bmb.sdu.dk/Apps/PolySTest.

Download Full-text

Applications of community detection algorithms to large biological datasets

10.1101/547570 ◽

2019 ◽

Cited By ~ 1

Author(s):

Itamar Kanter ◽

Gur Yaari ◽

Tomer Kalisky

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Community Detection ◽

Large Scale ◽

Heuristic Algorithms ◽

Relevant Information ◽

Biological Data ◽

Sequence Information ◽

Single Experiment ◽

Or Genes

ABSTRACTRecent advances in data acquiring technologies in biology have led to major challenges in mining relevant information from large datasets. For example, single-cell RNA sequencing technologies are producing expression and sequence information from tens of thousands of cells in every single experiment. A common task in analyzing biological data is to cluster samples or features (e.g. genes) into groups sharing common characteristics. This is an NP-hard problem for which numerous heuristic algorithms have been developed. However, in many cases, the clusters created by these algorithms do not reflect biological reality. To overcome this, a Networks Based Clustering (NBC) approach was recently proposed, by which the samples or genes in the dataset are first mapped to a network and then community detection (CD) algorithms are used to identify clusters of nodes.Here, we created an open and flexible python-based toolkit for NBC that enables easy and accessible network construction and community detection. We then tested the applicability of NBC for identifying clusters of cells or genes from previously published large-scale single-cell and bulk RNA-seq datasets.We show that NBC can be used to accurately and efficiently analyze large-scale datasets of RNA sequencing experiments.

Download Full-text

A Self-Attention Model for Inferring Cooperativity between Regulatory Features

10.1101/2020.01.31.927996 ◽

2020 ◽

Author(s):

Fahad Ullah ◽

Asa Ben-Hur

Keyword(s):

Gene Expression ◽

Simulated Data ◽

Relevant Information ◽

Regulatory Elements ◽

Dnase I ◽

Attention Mechanism ◽

Feature Interaction ◽

Biologically Relevant ◽

Attention Model ◽

Biological Phenomena

AbstractMotivationDeep learning has demonstrated its predictive power in modeling complex biological phenomena such as gene expression. The value of these models hinges not only on their accuracy, but also on the ability to extract biologically relevant information from the trained models. While there has been much recent work on developing feature attribution methods that discover the most important features for a given sequence, inferring cooperativity between regulatory elements, which is the hallmark of phenomena such as gene expression, remains an open problemResultsWe present SATORI, a Self-ATtentiOn based model to predict Regulatory element Interactions. Our approach combines convolutional and recurrent layers with a self-attention mechanism that helps us capture a global view of the landscape of interactions between regulatory elements in a sequence. We evaluate our method on simulated data and three complex datasets: human TAL1-GATA1 transcription factor ChIP-Seq, DNase I Hypersensitive Sites (DHSs) in human promoters across 164 cell lines, and genome-wide DNase I-Seq and ATAC-Seq peaks across 36 arabidopsis samples. In each of the three experiments SATORI identified numerous statistically significant TF-TF interactions, many of which have been previously reported. Our method is able to detect higher numbers of these experimentally verified TF-TF interactions than the existing Feature Interaction Score, and also has the advantage of not requiring a computationally expensive post-processing step. Finally, SATORI can be used for detection of any type of feature interaction in models that use a similar attention mechanism, and is not limited to the detection of TF-TF interactionsAvailabilityThe source code for SATORI is available at https://github.com/fahadahaf/[email protected]

Download Full-text

PolySTest: Robust Statistical Testing of Proteomics Data with Missing Values Improves Detection of Biologically Relevant Features

Molecular & Cellular Proteomics ◽

10.1074/mcp.ra119.001777 ◽

2020 ◽

Vol 19 (8) ◽

pp. 1396-1408 ◽

Cited By ~ 2

Author(s):

Veit Schwämmle ◽

Christina E. Hagensen ◽

Adelina Rogowska-Wrzesinska ◽

Ole N. Jensen

Keyword(s):

Mass Spectrometry ◽

Large Scale ◽

Missing Values ◽

Statistical Tests ◽

Ground Truth ◽

Statistical Testing ◽

Molecular Networks ◽

Proteomics Data ◽

Biologically Relevant ◽

Data Browsing

Statistical testing remains one of the main challenges for high-confidence detection of differentially regulated proteins or peptides in large-scale quantitative proteomics experiments by mass spectrometry. Statistical tests need to be sufficiently robust to deal with experiment intrinsic data structures and variations and often also reduced feature coverage across different biological samples due to ubiquitous missing values. A robust statistical test provides accurate confidence scores of large-scale proteomics results, regardless of instrument platform, experimental protocol and software tools. However, the multitude of different combinations of experimental strategies, mass spectrometry techniques and informatics methods complicate the decision of choosing appropriate statistical approaches. We address this challenge by introducing PolySTest, a user-friendly web service for statistical testing, data browsing and data visualization. We introduce a new method, Miss test, that simultaneously tests for missingness and feature abundance, thereby complementing common statistical tests by rescuing otherwise discarded data features. We demonstrate that PolySTest with integrated Miss test achieves higher confidence and higher sensitivity for artificial and experimental proteomics data sets with known ground truth. Application of PolySTest to mass spectrometry based large-scale proteomics data obtained from differentiating muscle cells resulted in the rescue of 10–20% additional proteins in the identified molecular networks relevant to muscle differentiation. We conclude that PolySTest is a valuable addition to existing tools and instrument enhancements that improve coverage and depth of large-scale proteomics experiments. A fully functional demo version of PolySTest and Miss test is available via http://computproteomics.bmb.sdu.dk/Apps/PolySTest.

Download Full-text

Identification of Biologically Relevant Biclusters from Gene Expression Dataset of Duchenne Muscular Dystrophy (DMD) Disease Using Elephant Swarm Water Search Algorithm

Advances in Intelligent Systems and Computing - Emerging Technologies in Data Mining and Information Security ◽

10.1007/978-981-15-9927-9_15 ◽

2021 ◽

pp. 147-157

Author(s):

Joy Adhikary ◽

Sriyankar Acharyya

Keyword(s):

Gene Expression ◽

Duchenne Muscular Dystrophy ◽

Muscular Dystrophy ◽

Search Algorithm ◽

Gene Expression Dataset ◽

Biologically Relevant

Download Full-text

A data mining paradigm for identifying key factors in biological processes using gene expression data

10.1101/327478 ◽

2018 ◽

Author(s):

Jin Li ◽

Le Zheng ◽

Akihiko Uchiyama ◽

Lianghua Bin ◽

Theodora M. Mauro ◽

...

Keyword(s):

Gene Expression ◽

Large Scale ◽

Molecular Mechanisms ◽

Biological Data ◽

Epidermal Differentiation ◽

Biological Processes ◽

Loss Of Function ◽

Key Factors ◽

Primary Analysis ◽

Epidermal Development

AbstractA large volume of biological data is being generated for studying mechanisms of various biological processes. These precious data enable large-scale computational analyses to gain biological insights. However, it remains a challenge to mine the data efficiently for knowledge discovery. The heterogeneity of these data makes it difficult to consistently integrate them, slowing down the process of biological discovery. We introduce a data processing paradigm to identify key factors in biological processes via systematic collection of gene expression datasets, primary analysis of data, and evaluation of consistent signals. To demonstrate its effectiveness, our paradigm was applied to epidermal development and identified many genes that play a potential role in this process. Besides the known epidermal development genes, a substantial proportion of the identified genes are still not supported by gain- or loss-of-function studies, yielding many novel genes for future studies. Among them, we selected a top gene for loss-of-function experimental validation and confirmed its function in epidermal differentiation, proving the ability of this paradigm to identify new factors in biological processes. In addition, this paradigm revealed many key genes in cold-induced thermogenesis using data from cold-challenged tissues, demonstrating its generalizability. This paradigm can lead to fruitful results for studying molecular mechanisms in an era of explosive accumulation of publicly available biological data.

Download Full-text

GEESE: Metabolically driven latent space learning for gene expression data

10.1101/365643 ◽

2018 ◽

Cited By ~ 1

Author(s):

Marco Barsacchi ◽

Helena Andres Terre ◽

Pietro Lió

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Metabolic Model ◽

Generative Models ◽

Biological Data ◽

Expression Data ◽

Biologically Relevant ◽

Latent Space ◽

Unseen Data ◽

Expression Microarrays

AbstractGene expression microarrays provide a characterisation of the transcriptional activity of a particular biological sample. Their high dimensionality hampers the process of pattern recognition and extraction. Several approaches have been proposed for gleaning information about the hidden structure of the data. Among these approaches, deep generative models provide a powerful way for approximating the manifold on which the data reside.Here we develop GEESE, a deep learning based framework that provides novel insight into the manifold learning for gene expression data, employing a metabolic model to constrain the learned representation. We evaluated the proposed framework, showing its ability to capture biologically relevant features, and encoding that features in a much simpler latent space. We showed how using a metabolic model to drive the autoencoder learning process helps in achieving better generalisation to unseen data. GEESE provides a novel perspective on the problem of unsupervised learning for biological data.AvailabilitySource code of GEESE is available athttps://bitbucket.org/mbarsacchi/geese/.

Download Full-text