scholarly journals Infer related genes from large scale gene expression dataset with embedding

2018 ◽  
Author(s):  
Chi Tung Choy ◽  
Chi Hang Wong ◽  
Stephen Lam Chan

AbstractArtificial neural networks (ANNs) have been utilized for classification and prediction task with remarkable accuracy. However, its implications for unsupervised data mining using molecular data is under-explored. We adopted a method of unsupervised ANN, namely word embedding, to extract biologically relevant information from TCGA gene expression dataset. Ground truth relationship, such as cancer types of the input sample and semantic meaning of genes, were showed to retain in the resulting entity matrices. We also demonstrated the interpretability and usage of these matrices in shortlisting candidates from a long gene list. This method is feasible to mine big volume of biological data, and would be a valuable tool to discover novel knowledge from omics data. The resulting embedding matrices mined from TCGA gene expression data are interactively explorable online (http://bit.ly/tcga-embedding-cancer) and could serve as an informative reference.

Mathematics ◽  
2021 ◽  
Vol 9 (7) ◽  
pp. 772
Author(s):  
Seonghun Kim ◽  
Seockhun Bae ◽  
Yinhua Piao ◽  
Kyuri Jo

Genomic profiles of cancer patients such as gene expression have become a major source to predict responses to drugs in the era of personalized medicine. As large-scale drug screening data with cancer cell lines are available, a number of computational methods have been developed for drug response prediction. However, few methods incorporate both gene expression data and the biological network, which can harbor essential information about the underlying process of the drug response. We proposed an analysis framework called DrugGCN for prediction of Drug response using a Graph Convolutional Network (GCN). DrugGCN first generates a gene graph by combining a Protein-Protein Interaction (PPI) network and gene expression data with feature selection of drug-related genes, and the GCN model detects the local features such as subnetworks of genes that contribute to the drug response by localized filtering. We demonstrated the effectiveness of DrugGCN using biological data showing its high prediction accuracy among the competing methods.


2004 ◽  
Vol 18 (2) ◽  
pp. 167-183 ◽  
Author(s):  
Jianhua Zhang ◽  
Amy Moseley ◽  
Anil G. Jegga ◽  
Ashima Gupta ◽  
David P. Witte ◽  
...  

To understand the commitment of the genome to nervous system differentiation and function, we sought to compare nervous system gene expression to that of a wide variety of other tissues by gene expression database construction and mining. Gene expression profiles of 10 different adult nervous tissues were compared with that of 72 other tissues. Using ANOVA, we identified 1,361 genes whose expression was higher in the nervous system than other organs and, separately, 600 genes whose expression was at least threefold higher in one or more regions of the nervous system compared with their median expression across all organs. Of the 600 genes, 381 overlapped with the 1,361-gene list. Limited in situ gene expression analysis confirmed that identified genes did represent nervous system-enriched gene expression, and we therefore sought to evaluate the validity and significance of these top-ranked nervous system genes using known gene literature and gene ontology categorization criteria. Diverse functional categories were present in the 381 genes, including genes involved in intracellular signaling, cytoskeleton structure and function, enzymes, RNA metabolism and transcription, membrane proteins, as well as cell differentiation, death, proliferation, and division. We searched existing public sites and identified 110 known genes related to mental retardation, neurological disease, and neurodegeneration. Twenty-one of the 381 genes were within the 110-gene list, compared with a random expectation of 5. This suggests that the 381 genes provide a candidate set for further analyses in neurological and psychiatric disease studies and that as a field, we are as yet, far from a large-scale understanding of the genes that are critical for nervous system structure and function. Together, our data indicate the power of profiling an individual biologic system in a multisystem context to gain insight into the genomic basis of its structure and function.


2019 ◽  
Vol 51 (10) ◽  
pp. 981-988 ◽  
Author(s):  
Xiaolan Rao ◽  
Richard A Dixon

Abstract Co-expression network analysis is one of the most powerful approaches for interpretation of large transcriptomic datasets. It enables characterization of modules of co-expressed genes that may share biological functional linkages. Such networks provide an initial way to explore functional associations from gene expression profiling and can be applied to various aspects of plant biology. This review presents the applications of co-expression network analysis in plant biology and addresses optimized strategies from the recent literature for performing co-expression analysis on plant biological systems. Additionally, we describe the combined interpretation of co-expression analysis with other genomic data to enhance the generation of biologically relevant information.


2019 ◽  
Author(s):  
Veit Schwämmle ◽  
Christina E Hagensen ◽  
Adelina Rogowska-Wrzesinska ◽  
Ole N. Jensen

AbstractStatistical testing remains one of the main challenges for high-confidence detection of differentially regulated proteins or peptides in large-scale quantitative proteomics experiments by mass spectrometry. Statistical tests need to be sufficiently robust to deal with experiment intrinsic data structures and variations and often also reduced feature coverage across different biological samples due to ubiquitous missing values. A robust statistical test provides accurate confidence scores of large-scale proteomics results, regardless of instrument platform, experimental protocol and software tools. However, the multitude of different combinations of experimental strategies, mass spectrometry techniques and informatics methods complicate the decision of choosing appropriate statistical approaches. We address this challenge by introducing PolySTest, a user-friendly web service for statistical testing, data browsing and data visualization. We introduce a new method, Miss Test, that simultaneously tests for missingness and feature abundance, thereby complementing common statistical tests by rescuing otherwise discarded data features. We demonstrate that PolySTest with integrated Miss Test achieves higher confidence and higher sensitivity for artificial and experimental proteomics data sets with known ground truth. Application of PolySTest to mass spectrometry based large-scale proteomics data obtained from differentiating muscle cells resulted in the rescue of 10%-20% additional proteins in the identified molecular networks relevant to muscle differentiation. We conclude that PolySTest is a valuable addition to existing tools and instrument enhancements that improve coverage and depth of large-scale proteomics experiments. A fully functional demo version of PolySTest and Miss Test is available via http://computproteomics.bmb.sdu.dk/Apps/PolySTest.


2019 ◽  
Author(s):  
Itamar Kanter ◽  
Gur Yaari ◽  
Tomer Kalisky

ABSTRACTRecent advances in data acquiring technologies in biology have led to major challenges in mining relevant information from large datasets. For example, single-cell RNA sequencing technologies are producing expression and sequence information from tens of thousands of cells in every single experiment. A common task in analyzing biological data is to cluster samples or features (e.g. genes) into groups sharing common characteristics. This is an NP-hard problem for which numerous heuristic algorithms have been developed. However, in many cases, the clusters created by these algorithms do not reflect biological reality. To overcome this, a Networks Based Clustering (NBC) approach was recently proposed, by which the samples or genes in the dataset are first mapped to a network and then community detection (CD) algorithms are used to identify clusters of nodes.Here, we created an open and flexible python-based toolkit for NBC that enables easy and accessible network construction and community detection. We then tested the applicability of NBC for identifying clusters of cells or genes from previously published large-scale single-cell and bulk RNA-seq datasets.We show that NBC can be used to accurately and efficiently analyze large-scale datasets of RNA sequencing experiments.


2020 ◽  
Author(s):  
Fahad Ullah ◽  
Asa Ben-Hur

AbstractMotivationDeep learning has demonstrated its predictive power in modeling complex biological phenomena such as gene expression. The value of these models hinges not only on their accuracy, but also on the ability to extract biologically relevant information from the trained models. While there has been much recent work on developing feature attribution methods that discover the most important features for a given sequence, inferring cooperativity between regulatory elements, which is the hallmark of phenomena such as gene expression, remains an open problemResultsWe present SATORI, a Self-ATtentiOn based model to predict Regulatory element Interactions. Our approach combines convolutional and recurrent layers with a self-attention mechanism that helps us capture a global view of the landscape of interactions between regulatory elements in a sequence. We evaluate our method on simulated data and three complex datasets: human TAL1-GATA1 transcription factor ChIP-Seq, DNase I Hypersensitive Sites (DHSs) in human promoters across 164 cell lines, and genome-wide DNase I-Seq and ATAC-Seq peaks across 36 arabidopsis samples. In each of the three experiments SATORI identified numerous statistically significant TF-TF interactions, many of which have been previously reported. Our method is able to detect higher numbers of these experimentally verified TF-TF interactions than the existing Feature Interaction Score, and also has the advantage of not requiring a computationally expensive post-processing step. Finally, SATORI can be used for detection of any type of feature interaction in models that use a similar attention mechanism, and is not limited to the detection of TF-TF interactionsAvailabilityThe source code for SATORI is available at https://github.com/fahadahaf/[email protected]


2020 ◽  
Vol 19 (8) ◽  
pp. 1396-1408 ◽  
Author(s):  
Veit Schwämmle ◽  
Christina E. Hagensen ◽  
Adelina Rogowska-Wrzesinska ◽  
Ole N. Jensen

Statistical testing remains one of the main challenges for high-confidence detection of differentially regulated proteins or peptides in large-scale quantitative proteomics experiments by mass spectrometry. Statistical tests need to be sufficiently robust to deal with experiment intrinsic data structures and variations and often also reduced feature coverage across different biological samples due to ubiquitous missing values. A robust statistical test provides accurate confidence scores of large-scale proteomics results, regardless of instrument platform, experimental protocol and software tools. However, the multitude of different combinations of experimental strategies, mass spectrometry techniques and informatics methods complicate the decision of choosing appropriate statistical approaches. We address this challenge by introducing PolySTest, a user-friendly web service for statistical testing, data browsing and data visualization. We introduce a new method, Miss test, that simultaneously tests for missingness and feature abundance, thereby complementing common statistical tests by rescuing otherwise discarded data features. We demonstrate that PolySTest with integrated Miss test achieves higher confidence and higher sensitivity for artificial and experimental proteomics data sets with known ground truth. Application of PolySTest to mass spectrometry based large-scale proteomics data obtained from differentiating muscle cells resulted in the rescue of 10–20% additional proteins in the identified molecular networks relevant to muscle differentiation. We conclude that PolySTest is a valuable addition to existing tools and instrument enhancements that improve coverage and depth of large-scale proteomics experiments. A fully functional demo version of PolySTest and Miss test is available via http://computproteomics.bmb.sdu.dk/Apps/PolySTest.


2018 ◽  
Author(s):  
Jin Li ◽  
Le Zheng ◽  
Akihiko Uchiyama ◽  
Lianghua Bin ◽  
Theodora M. Mauro ◽  
...  

AbstractA large volume of biological data is being generated for studying mechanisms of various biological processes. These precious data enable large-scale computational analyses to gain biological insights. However, it remains a challenge to mine the data efficiently for knowledge discovery. The heterogeneity of these data makes it difficult to consistently integrate them, slowing down the process of biological discovery. We introduce a data processing paradigm to identify key factors in biological processes via systematic collection of gene expression datasets, primary analysis of data, and evaluation of consistent signals. To demonstrate its effectiveness, our paradigm was applied to epidermal development and identified many genes that play a potential role in this process. Besides the known epidermal development genes, a substantial proportion of the identified genes are still not supported by gain- or loss-of-function studies, yielding many novel genes for future studies. Among them, we selected a top gene for loss-of-function experimental validation and confirmed its function in epidermal differentiation, proving the ability of this paradigm to identify new factors in biological processes. In addition, this paradigm revealed many key genes in cold-induced thermogenesis using data from cold-challenged tissues, demonstrating its generalizability. This paradigm can lead to fruitful results for studying molecular mechanisms in an era of explosive accumulation of publicly available biological data.


2018 ◽  
Author(s):  
Marco Barsacchi ◽  
Helena Andres Terre ◽  
Pietro Lió

AbstractGene expression microarrays provide a characterisation of the transcriptional activity of a particular biological sample. Their high dimensionality hampers the process of pattern recognition and extraction. Several approaches have been proposed for gleaning information about the hidden structure of the data. Among these approaches, deep generative models provide a powerful way for approximating the manifold on which the data reside.Here we develop GEESE, a deep learning based framework that provides novel insight into the manifold learning for gene expression data, employing a metabolic model to constrain the learned representation. We evaluated the proposed framework, showing its ability to capture biologically relevant features, and encoding that features in a much simpler latent space. We showed how using a metabolic model to drive the autoencoder learning process helps in achieving better generalisation to unseen data. GEESE provides a novel perspective on the problem of unsupervised learning for biological data.AvailabilitySource code of GEESE is available athttps://bitbucket.org/mbarsacchi/geese/.


Sign in / Sign up

Export Citation Format

Share Document