scholarly journals Comprehensive biological interpretation of gene signatures using semantic distributed representation

2019 ◽  
Author(s):  
Yuumi Okuzono ◽  
Takashi Hoshino

AbstractRecent rise of microarray and next-generation sequencing in genome-related fields has simplified obtaining gene expression data at whole gene level, and biological interpretation of gene signatures related to life phenomena and diseases has become very important. However, the conventional method is numerical comparison of gene signature, pathway, and gene ontology (GO) overlap and distribution bias, and it is not possible to compare the specificity and importance of genes contained in gene signatures as humans do.This study proposes the gene signature vector (GsVec), a unique method for interpreting gene signatures that clarifies the semantic relationship between gene signatures by incorporating a method of distributed document representation from natural language processing (NLP). In proposed algorithm, a gene-topic vector is created by multiplying the feature vector based on the gene’s distributed representation by the probability of the gene signature topic and the low frequency of occurrence of the corresponding gene in all gene signatures. These vectors are concatenated for genes included in each gene signature to create a signature vector. The degrees of similarity between signature vectors are obtained from the cosine distances, and the levels of relevance between gene signatures are quantified.Using the above algorithm, GsVec learned approximately 5,000 types of canonical pathway and GO biological process gene signatures published in the Molecular Signatures Database (MSigDB). Then, validation of the pathway database BioCarta with known biological significance and validation using actual gene expression data (differentially expressed genes) were performed, and both were able to obtain biologically valid results. In addition, the results compared with the pathway enrichment analysis in Fisher’s exact test used in the conventional method resulted in equivalent or more biologically valid signatures. Furthermore, although NLP is generally developed in Python, GsVec can execute the entire process in only the R language, the main language of bioinformatics.

2019 ◽  
Vol 21 (5) ◽  
pp. 1818-1824 ◽  
Author(s):  
Qi Zhao ◽  
Yu Sun ◽  
Zekun Liu ◽  
Hongwan Zhang ◽  
Xingyang Li ◽  
...  

Abstract   Unsupervised clustering of high-throughput gene expression data is widely adopted for cancer subtyping. However, cancer subtypes derived from a single dataset are usually not applicable across multiple datasets from different platforms. Merging different datasets is necessary to determine accurate and applicable cancer subtypes but is still embarrassing due to the batch effect. CrossICC is an R package designed for the unsupervised clustering of gene expression data from multiple datasets/platforms without the requirement of batch effect adjustment. CrossICC utilizes an iterative strategy to derive the optimal gene signature and cluster numbers from a consensus similarity matrix generated by consensus clustering. This package also provides abundant functions to visualize the identified subtypes and evaluate subtyping performance. We expected that CrossICC could be used to discover the robust cancer subtypes with significant translational implications in personalized care for cancer patients. Availability and Implementation The package is implemented in R and available at GitHub (https://github.com/bioinformatist/CrossICC) and Bioconductor (http://bioconductor.org/packages/release/bioc/html/CrossICC.html) under the GPL v3 License.


2015 ◽  
Author(s):  
Jie Tan ◽  
John H Hammond ◽  
Deborah A Hogan ◽  
Casey S Greene

The growth in genome-scale assays of gene expression for different species in publicly available databases presents new opportunities for computational methods that aid in hypothesis generation and biological interpretation of these data. Here, we present an unsupervised machine-learning approach, ADAGE (Analysis using Denoising Autoencoders of Gene Expression) and apply it to the interpretation of all of the publicly available gene expression data for Pseudomonas aeruginosa, an important opportunistic bacterial pathogen. In post-hoc positive control analyses using curated knowledge, the P. aeruginosa ADAGE model found that co-operonic genes often participated in similar processes and accurately predicted which genes had similar functions. By analyzing newly generated data and previously published microarray and RNA-seq data, the ADAGE model identified gene expression differences between strains, modeled the cellular response to low oxygen, and predicted the involvement of biological processes despite low level expression differences in directly involved genes. Comparison of ADAGE with PCA and ICA revealed that ADAGE extracts distinct signals. We provide the ADAGE model with analysis of all publicly available P. aeruginosa GeneChip experiments, and we provide open source code for use in other species and settings.


2019 ◽  
Author(s):  
Shaoke Lou ◽  
Tianxiao Li ◽  
Daniel Spakowicz ◽  
Geoffrey Lowell Chupp ◽  
Mark Gerstein

AbstractThe pathogenesis of asthma is a complex process involving multiple genes and pathways. Identifying biomarkers from asthma datasets, especially those that include heterogeneous subpopulations, is challenging. In this work, we developed a framework that incorporates a denoising autoencoder and a supervised learning approach to identify gene signatures related to asthma severity. The autoencoder embeds high-dimensional gene expression data into a lower-dimensional latent space in an unsupervised fashion, enabling us to extract distinguishing features from gene expression data. We found that the weights on hidden units in this latent space correlate well with previously defined and clinically relevant clusters of patients. Moreover, pathway analysis based on each gene’s contribution to the hidden units showed significant enrichment in known asthma-related pathways. We then used genes that contribute most to the hidden units to develop a secondary supervised classifier (based on random forest) for directly predicting asthma severity. The random-forest importance metric from this classifier identified a signature based on 50 key genes, which can predict severity with an AUROC of 0.81 and thus have potential as diagnostic biomarkers. Furthermore, the key genes could also be used for successfully estimating, via support-vector-machine regression, the FEV1/FVC ratios across patients, achieving pre- and post-treatment correlations of 0.56 and 0.65, respectively (between predicted and observed values). The 50 biomarker candidate genes can be found in supplementary. The source codes are freely available upon request.


2020 ◽  
Vol 21 (1) ◽  
Author(s):  
Blaise Hanczar ◽  
Farida Zehraoui ◽  
Tina Issa ◽  
Mathieu Arles

Abstract Background The use of predictive gene signatures to assist clinical decision is becoming more and more important. Deep learning has a huge potential in the prediction of phenotype from gene expression profiles. However, neural networks are viewed as black boxes, where accurate predictions are provided without any explanation. The requirements for these models to become interpretable are increasing, especially in the medical field. Results We focus on explaining the predictions of a deep neural network model built from gene expression data. The most important neurons and genes influencing the predictions are identified and linked to biological knowledge. Our experiments on cancer prediction show that: (1) deep learning approach outperforms classical machine learning methods on large training sets; (2) our approach produces interpretations more coherent with biology than the state-of-the-art based approaches; (3) we can provide a comprehensive explanation of the predictions for biologists and physicians. Conclusion We propose an original approach for biological interpretation of deep learning models for phenotype prediction from gene expression data. Since the model can find relationships between the phenotype and gene expression, we may assume that there is a link between the identified genes and the phenotype. The interpretation can, therefore, lead to new biological hypotheses to be investigated by biologists.


2020 ◽  
Author(s):  
Shaoke LOU ◽  
Tianxiao Li ◽  
Daniel Spakowicz ◽  
Xiting Yan ◽  
Geoffrey Lowell Chupp ◽  
...  

Abstract Backgrounds: The pathogenesis of asthma is a complex process involving multiple genes and pathways. Identifying biomarkers from asthma datasets, especially those that include heterogeneous subpopulations, is challenging. In this work, we developed a framework that incorporates a denoising autoencoder and a supervised learning approach to identify gene signatures related to asthma severity. The autoencoder embeds high-dimensional gene expression data into a lower -dimensional latent space in an unsupervised fashion, enabling us to extract distinguishing features from gene expression data. Results: Using the trained autoencoder model, we found that the weights on hidden units in the latent space correlate well with previously defined and clinically relevant clusters of patients. Moreover, pathway analysis based on each gene's contribution to the hidden units showed significant enrichment in known asthma-related pathways. We then used genes that contribute most to the hidden units to develop a secondary supervised classifier (based on random forest) for directly predicting asthma severity. The random-forest importance metric from this classifier identified a signature based on 50 key genes, which can predict severity with an AUROC of 0.81 and thus have potential as diagnostic biomarkers. Furthermore, the key genes could also be used for successfully estimating, via support-vector-machine regression, the FEV1/FVC ratios across patients, achieving pre- and post-treatment correlations of 0.56 and 0.65, respectively (between predicted and observed values). Conclusions: The denoising autoencoder framework could extract meaningful functional genes and patient groups from the gene expression profile of asthma patients. These patterns may provide potential sources for biomarkers for asthma severity.


2020 ◽  
Author(s):  
Samaneh Maleknia ◽  
Ali Sharifi-Zarchi ◽  
Vahid Rezaei Tabar ◽  
Mohsen Namazi ◽  
Kaveh Kavousi

AbstractMotivationOne of the most popular techniques in biological studies for analyzing high throughput data is pathway enrichment analysis (PEA). Many researchers apply the existing methods without considering the topology of pathways or at least they have overlooked a significant part of the structure, which may reduce the accuracy and generalizability of the results. Developing a new approach while considering gene expression data and topological features like causal relations regarding edge directions will help the investigators to achieve more accurate results.ResultsWe proposed a new pathway enrichment analysis based on Bayesian network (BNrich) as an approach in PEA. To this end, the cycles were eliminated in 187 KEGG human signaling pathways concerning intuitive biological rules and the Bayesian network structures were constructed. The constructed networks were simplified by the Least Absolute Shrinkage Selector Operator (LASSO), and their parameters were estimated using the gene expression data. We finally prioritize the impacted pathways by Fisher’s Exact Test on significant parameters. Our method integrates both edge and node related parameters to enrich modules in the affected signaling pathway network. In order to evaluate the proposed method, consistency, discrimination, false positive rate and empirical P-value criteria were calculated, and the results are compared to well-known enrichment methods such as signaling pathway impact analysis (SPIA), bi-level meta-analysis (BLMA) and topology-based pathway enrichment analysis (TPEA).AvailabilityThe R package is available on carn.


2019 ◽  
Vol 14 (6) ◽  
pp. 491-503
Author(s):  
Md. Shahjaman ◽  
Nishith Kumar ◽  
Md. Nurul Haque Mollah

Background: DNA microarray technology allows researchers to measure the expression levels of thousands of genes simultaneously. The main objective of microarray gene expression (GE) data analysis is to detect biomarker genes that are Differentially Expressed (DE) between two or more experimental groups/conditions. Objective: There are some popular statistical methods in the literature for the selection of biomarker genes. However, most of them often produce misleading results in presence of outliers. Therefore, in this study, we introduce a robust approach to overcome the problems of classical methods. Methods: We use median and median absolute deviation (MAD) for our robust procedure. In this procedure, a gene was considered as outlying gene if at least one of the expressions of this gene does not belong to a certain interval of the proposed outlier detection rule. Otherwise, this gene was considered as a non-outlying gene. Results: We investigate the performance of the proposed method in a comparison of the traditional method using both simulated and real gene expression data analysis. From a real colon cancer gene expression data analysis, the proposed method detected an additional fourteen (14) DE genes that were not detected by the traditional methods. Using the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis, we observed that these additional 14 DE genes are involved in three important metabolic pathways of cancer disease. The proposed method also detected nine (9) additional DE genes from another head-and-neck cancer gene expression data analysis; those involved in top ten metabolic pathways obtain from the KEGG pathway database. Conclusion: The simulation as well as real cancer gene expression datasets results show better performance with our proposed procedure. Therefore, the additional genes detected by the proposed procedure require further wet lab validation.


Sign in / Sign up

Export Citation Format

Share Document