A pipeline to create predictive functional networks: application to the tumor progression of hepatocellular carcinoma

Mapping Intimacies ◽

10.1101/605519 ◽

2019 ◽

Author(s):

Maxime Folschette ◽

Vincent Legagneux ◽

Arnaud Poret ◽

Lokmane Chebouba ◽

Carito Guziolowski ◽

...

Keyword(s):

Gene Expression ◽

Hepatocellular Carcinoma ◽

Protein Complexes ◽

Biological Data ◽

Biological Study ◽

Biological Knowledge ◽

Combinatorial Search ◽

Regulatory Knowledge ◽

Pathway Databases ◽

Computational Predictions

AbstractBackgroundIntegrating genome-wide gene expression patient profiles with regulatory knowledge is a challenging task because of the inherent heterogeneity, noise and incompleteness of biological data. From the computational side, several solvers for logic programs are able to perform extremely well in decision problems for combinatorial search domains. The challenge then is how to process the biological knowledge in order to feed these solvers to gain insights in a biological study. It requires formalizing the biological knowledge to give a precise interpretation of this information; currently, very few pathway databases offer this possibility.ResultsThe presented work proposes an automatic pipeline to extract automatically regulatory knowledge from pathway databases and generate novel computational predictions related to the state of expression or activity of biological molecules. We applied it in the context of hepatocellular carcinoma (HCC) progression, and evaluate the precision and the stability of these computational predictions. Our working base is a graph of 3,383 nodes and 13,771 edges extracted from the KEGG database, in which we integrate 209 differentially expressed genes between low and high aggressive HCC across 294 patients. Our computational model predicts the shifts of expression of 146 initially non-observed biological components. Our predictions were validated at 88% using a larger experimental dataset and cross-validation techniques. In particular, we focus on the protein complexes predictions and show for the first time that NFKB1/BCL-3 complexes are activated in aggressive HCC. In spite of the large dimension of the reconstructed models, our analyses over the computational predictions discover a well constrained region where KEGG regulatory knowledge constrains gene expression of several biomolecules. These regions can offer interesting windows to perturb experimentally such complex systems.ConclusionThis new pipeline allows biologists to develop their own predictive models based on a list of genes. It facilitates the identification of new regulatory biomolecules using knowledge graphs and predictive computational methods. Our workflow is implemented in an automatic python pipeline which is publicly available at https://github.com/LokmaneChebouba/key-pipe and contains as testing data all the data used in this paper.

Download Full-text

Enabling semantic queries across federated bioinformatics databases

Database ◽

10.1093/database/baz106 ◽

2019 ◽

Vol 2019 ◽

Cited By ~ 9

Author(s):

Ana Claudia Sima ◽

Tarcisio Mendes de Farias ◽

Erich Zbinden ◽

Maria Anisimova ◽

Manuel Gil ◽

...

Keyword(s):

Gene Expression ◽

Data Integration ◽

Heterogeneous Data ◽

Biological Data ◽

Data Sources ◽

Biological Knowledge ◽

Biological Databases ◽

Semantic Level ◽

Sparql Endpoint ◽

Description Framework

Abstract Motivation: Data integration promises to be one of the main catalysts in enabling new insights to be drawn from the wealth of biological data available publicly. However, the heterogeneity of the different data sources, both at the syntactic and the semantic level, still poses significant challenges for achieving interoperability among biological databases. Results: We introduce an ontology-based federated approach for data integration. We applied this approach to three heterogeneous data stores that span different areas of biological knowledge: (i) Bgee, a gene expression relational database; (ii) Orthologous Matrix (OMA), a Hierarchical Data Format 5 orthology DS; and (iii) UniProtKB, a Resource Description Framework (RDF) store containing protein sequence and functional information. To enable federated queries across these sources, we first defined a new semantic model for gene expression called GenEx. We then show how the relational data in Bgee can be expressed as a virtual RDF graph, instantiating GenEx, through dedicated relational-to-RDF mappings. By applying these mappings, Bgee data are now accessible through a public SPARQL endpoint. Similarly, the materialized RDF data of OMA, expressed in terms of the Orthology ontology, is made available in a public SPARQL endpoint. We identified and formally described intersection points (i.e. virtual links) among the three data sources. These allow performing joint queries across the data stores. Finally, we lay the groundwork to enable nontechnical users to benefit from the integrated data, by providing a natural language template-based search interface.

Download Full-text

Application of Biological Domain Knowledge Based Feature Selection on Gene Expression Data

Entropy ◽

10.3390/e23010002 ◽

2020 ◽

Vol 23 (1) ◽

pp. 2

Author(s):

Malik Yousef ◽

Abhishek Kumar ◽

Burcu Bakir-Gungor

Keyword(s):

Gene Expression ◽

Feature Selection ◽

Data Analysis ◽

Gene Expression Data ◽

Gene Selection ◽

Biological Data ◽

Biological Information ◽

Background Information ◽

Biological Knowledge ◽

Expression Data

In the last two decades, there have been massive advancements in high throughput technologies, which resulted in the exponential growth of public repositories of gene expression datasets for various phenotypes. It is possible to unravel biomarkers by comparing the gene expression levels under different conditions, such as disease vs. control, treated vs. not treated, drug A vs. drug B, etc. This problem refers to a well-studied problem in the machine learning domain, i.e., the feature selection problem. In biological data analysis, most of the computational feature selection methodologies were taken from other fields, without considering the nature of the biological data. Thus, integrative approaches that utilize the biological knowledge while performing feature selection are necessary for this kind of data. The main idea behind the integrative gene selection process is to generate a ranked list of genes considering both the statistical metrics that are applied to the gene expression data, and the biological background information which is provided as external datasets. One of the main goals of this review is to explore the existing methods that integrate different types of information in order to improve the identification of the biomolecular signatures of diseases and the discovery of new potential targets for treatment. These integrative approaches are expected to aid the prediction, diagnosis, and treatment of diseases, as well as to enlighten us on disease state dynamics, mechanisms of their onset and progression. The integration of various types of biological information will necessitate the development of novel techniques for integration and data analysis. Another aim of this review is to boost the bioinformatics community to develop new approaches for searching and determining significant groups/clusters of features based on one or more biological grouping functions.

Download Full-text

Application of Biological Domain Knowledge Based Feature Selection on Gene Expression Data

10.20944/preprints202012.0377.v1 ◽

2020 ◽

Cited By ~ 1

Author(s):

Malik Yousef ◽

Abhishek Kumar ◽

Burcu Bakir-Gungor

Keyword(s):

Gene Expression ◽

Feature Selection ◽

Data Analysis ◽

Gene Expression Data ◽

Gene Selection ◽

Selection Process ◽

Biological Data ◽

Integrative Approach ◽

Biological Knowledge ◽

Expression Data

In the last two decades, there have been massive advancements in high throughput technologies, which resulted in the exponential growth of public repositories of gene expression datasets for various phenotypes. It is possible to unravel biomarkers by comparing the gene expression levels under different conditions, such as disease vs. control, treated vs. not treated, drug A vs. drug B, etc. This problem refers to a well-studied problem in the machine learning domain, i.e., the feature selection problem. In biological data analysis, most of the computational feature selection methodologies were taken from other fields, without considering the nature of the biological data. For gene expression data analysis, most of the existing feature selection methods rely on expression values alone to select the genes; and biological knowledge is integrated at the end of the analysis in order to gain biological insights or to support the initial findings. Thus, integrative approaches that utilize the biological knowledge while performing feature selection are necessary for this kind of data. The main idea behind the integrative gene selection process is to generate a ranked list of genes considering both the statistical metrics that are applied to the gene expression data, and the biological background information which is provided as external datasets. Since the integrative approach attracted attention in the gene expression domain, lately the gene selection process shifted from being purely data-centric to more incorporative analysis with additional biological knowledge.

Download Full-text

Clustering of gene expression profiles: creating initialization-independent clusterings by eliminating unstable genes

Journal of Integrative Bioinformatics ◽

10.1515/jib-2010-134 ◽

2010 ◽

Vol 7 (3) ◽

Author(s):

Wim De Mulder ◽

Martin Kuiper ◽

René Boel

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Expression Profiles ◽

Clustering Algorithms ◽

Gene Expression Profiles ◽

Biological Data ◽

Biological Knowledge ◽

Expression Data ◽

Data Set ◽

Cluster Membership

SummaryClustering is an important approach in the analysis of biological data, and often a first step to identify interesting patterns of coexpression in gene expression data. Because of the high complexity and diversity of gene expression data, many genes cannot be easily assigned to a cluster, but even if the dissimilarity of these genes with all other gene groups is large, they will finally be forced to become member of a cluster. In this paper we show how to detect such elements, called unstable elements. We have developed an approach for iterative clustering algorithms in which unstable elements are deleted, making the iterative algorithm less dependent on initial centers. Although the approach is unsupervised, it is less likely that the clusters into which the reduced data set is subdivided contain false positives. This clustering yields a more differentiated approach for biological data, since the cluster analysis is divided into two parts: the pruned data set is divided into highly consistent clusters in an unsupervised way and the removed, unstable elements for which no meaningful cluster exists in unsupervised terms can be given a cluster with the use of biological knowledge and information about the likelihood of cluster membership. We illustrate our framework on both an artificial and real biological data set.

Download Full-text

Extracting Gradual Rules to reveal regulation between genes

Current Bioinformatics ◽

10.2174/1574893615999200711170945 ◽

2020 ◽

Vol 15 ◽

Author(s):

Manel Gouider ◽

Ines Hamdi ◽

Henda Ben Ghezala

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

High Throughput ◽

Explicit Knowledge ◽

Biological Data ◽

Genetic Interactions ◽

Human Lung Cancer ◽

Biological Knowledge ◽

Good Representation ◽

Expression Data

Background: The gene regulation represents a very complex mechanism produced in the cell in order to increase or decrease the gene expression. This regulation of genes forms a Gene regulatory Network GRN composed of a collection of genes and products of genes in interaction. The high throughput technologies that generate a huge volume of gene expression data are useful for analyzing the GRN. The biologists are interested in the relevant genetic knowledge hidden in these data sources. Although, the knowledge extracted by the different data mining approaches of the literature are insufficient for inferring the GRN topology or do not give a good representation of the real genetic regulation in the cell. Objective: In this work, we are interested in the extraction of genetic interactions from the high throughput technologies, such as the microarrays or DNA chips. Methods: In this paper, in order to extract expressive and explicit knowledge about the interactions between genes, we use the method of gradual patterns and rules extraction applied on numerical data that extracts the frequent co-variations between gene expression values. Furthermore, we choose to integrate experimental biological data and biological knowledge in the process of knowledge extraction of genetic interactions. Results: The validation results on real gene expression data of the model plant Arabidopsis and human lung cancer shows the performance of this approach. Conclusion: The extracted gradual rules express the genetic interactions composed a GRN, these rules help to understand complex systems and cellular functions.

Download Full-text

Identifying Protein Complexes from Dynamic Temporal Interval Protein-Protein Interaction Networks

BioMed Research International ◽

10.1155/2019/3726721 ◽

2019 ◽

Vol 2019 ◽

pp. 1-17 ◽

Cited By ~ 1

Author(s):

Jinxiong Zhang ◽

Cheng Zhong ◽

Hai Xiang Lin ◽

Mian Wang

Keyword(s):

Gene Expression ◽

Protein Interaction ◽

Protein Complexes ◽

Biological Data ◽

Protein Interaction Networks ◽

Interaction Networks ◽

Temporal Interval ◽

Protein Protein Interaction ◽

Identification Method ◽

Ppi Networks

Identification of protein complex is very important for revealing the underlying mechanism of biological processes. Many computational methods have been developed to identify protein complexes from static protein-protein interaction (PPI) networks. Recently, researchers are considering the dynamics of protein-protein interactions. Dynamic PPI networks are closer to reality in the cell system. It is expected that more protein complexes can be accurately identified from dynamic PPI networks. In this paper, we use the undulating degree above the base level of gene expression instead of the gene expression level to construct dynamic temporal PPI networks. Further we convert dynamic temporal PPI networks into dynamic Temporal Interval Protein Interaction Networks (TI-PINs) and propose a novel method to accurately identify more protein complexes from the constructed TI-PINs. Owing to preserving continuous interactions within temporal interval, the constructed TI-PINs contain more dynamical information for accurately identifying more protein complexes. Our proposed identification method uses multisource biological data to judge whether the joint colocalization condition, the joint coexpression condition, and the expanding cluster condition are satisfied; this is to ensure that the identified protein complexes have the features of colocalization, coexpression, and functional homogeneity. The experimental results on yeast data sets demonstrated that using the constructed TI-PINs can obtain better identification of protein complexes than five existing dynamic PPI networks, and our proposed identification method can find more protein complexes accurately than four other methods.

Download Full-text

Enabling Semantic Queries Across Federated Bioinformatics Databases

10.1101/686600 ◽

2019 ◽

Cited By ~ 1

Author(s):

Ana Claudia Sima ◽

Tarcisio Mendes de Farias ◽

Erich Zbinden ◽

Maria Anisimova ◽

Manuel Gil ◽

...

Keyword(s):

Gene Expression ◽

Data Integration ◽

Heterogeneous Data ◽

Biological Data ◽

Data Sources ◽

Biological Knowledge ◽

Biological Databases ◽

Semantic Level ◽

Sparql Endpoint ◽

Link Type

MotivationData integration promises to be one of the main catalysts in enabling new insights to be drawn from the wealth of biological data available publicly. However, the heterogeneity of the different data sources, both at the syntactic and the semantic level, still poses significant challenges for achieving interoperability among biological databases.ResultsWe introduce an ontology-based federated approach for data integration. We applied this approach to three heterogeneous data stores that span different areas of biological knowledge: 1) Bgee, a gene expression relational database; 2) OMA, a Hierarchical Data Format 5 (HDF5) orthology data store, and 3) UniProtKB, a Resource Description Framework (RDF) store containing protein sequence and functional information. To enable federated queries across these sources, we first defined a new semantic model for gene expression called GenEx. We then show how the relational data in Bgee can be expressed as a virtual RDF graph, instantiating GenEx, through dedicated relational-to-RDF mappings. By applying these mappings, Bgee data are now accessible through a public SPARQL endpoint. Similarly, the materialised RDF data of OMA, expressed in terms of the Orthology ontology, is made available in a public SPARQL endpoint. We identified and formally described intersection points (i.e. virtual links) among the three data sources. These allow performing joint queries across the data stores. Finally, we lay the groundwork to enable nontechnical users to benefit from the integrated data, by providing a natural language template-based search interface.Project URLhttp://biosoda.expasy.org, https://github.com/biosoda/bioquery

Download Full-text

New gene association measures by joint network embedding of multiple gene expression datasets

10.1101/2020.03.16.992396 ◽

2020 ◽

Author(s):

Guiying Wu ◽

Xiangyu Li ◽

Wenbo Guo ◽

Zheng Wei ◽

Tao Hu ◽

...

Keyword(s):

Gene Expression ◽

Hepatocellular Carcinoma ◽

Target Genes ◽

Linear Time ◽

Single Gene ◽

Biological Data ◽

Stable Gene ◽

Bipartite Networks ◽

Rna Seq ◽

Experimental Conditions

ABSTRACTLarge number of samples are required to construct a reliable gene co-expression network, the samples from a single gene expression dataset are obviously not enough. However, batch effect may widely exist among datasets due to different experimental conditions. We proposed JEBIN (Joint Embedding of multiple BIpartite Networks) algorithm, it can learn a low-dimensional representation vector for each gene by integrating multiple bipartite networks, and each network corresponds to one dataset. JEBIN owns many inherent advantages, such as it is a nonlinear, global model, has linear time complexity with the number of genes, dataset or samples, and can integrate datasets with different distribution. We verified the effectiveness and scalability of JEBIN through a series of simulation experiments, and proved better performance on real biological data than commonly used integration algorithms. In addition, we conducted a differential co-expression analysis of hepatocellular carcinoma between the single-cell and bulk RNA-seq data, and also a contrast between the hepatocellular carcinoma and its adjacency samples using the bulk RNA-seq data. Analysis results prove that JEBIN can obtain comprehensive and stable gene co-expression networks through integrating multiple datasets and has wide prospect in the functional annotation of unknown genes and the regulatory mechanism inference of target genes.

Download Full-text

Hepatocellular carcinoma: Gene expression profiling and regulation of xenobiotic-metabolizing cytochromes P450

Biochemical Pharmacology ◽

10.1016/j.bcp.2020.113912 ◽

2020 ◽

Vol 177 ◽

pp. 113912 ◽

Cited By ~ 1

Author(s):

Jana Nekvindova ◽

Alena Mrkvicova ◽

Veronika Zubanova ◽

Alena Hyrslova Vaculova ◽

Pavel Anzenbacher ◽

...

Keyword(s):

Gene Expression ◽

Hepatocellular Carcinoma ◽

Gene Expression Profiling ◽

Expression Profiling ◽

Cytochromes P450

Download Full-text

On the limits of active module identification

Briefings in Bioinformatics ◽

10.1093/bib/bbab066 ◽

2021 ◽

Author(s):

Olga Lazareva ◽

Jan Baumbach ◽

Markus List ◽

David B Blumenthal

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Small Diameter ◽

Extensive Study ◽

Biological Knowledge ◽

Expression Data ◽

Module Identification ◽

Ppi Networks ◽

Novel Algorithms ◽

Context Specific

Abstract In network and systems medicine, active module identification methods (AMIMs) are widely used for discovering candidate molecular disease mechanisms. To this end, AMIMs combine network analysis algorithms with molecular profiling data, most commonly, by projecting gene expression data onto generic protein–protein interaction (PPI) networks. Although active module identification has led to various novel insights into complex diseases, there is increasing awareness in the field that the combination of gene expression data and PPI network is problematic because up-to-date PPI networks have a very small diameter and are subject to both technical and literature bias. In this paper, we report the results of an extensive study where we analyzed for the first time whether widely used AMIMs really benefit from using PPI networks. Our results clearly show that, except for the recently proposed AMIM DOMINO, the tested AMIMs do not produce biologically more meaningful candidate disease modules on widely used PPI networks than on random networks with the same node degrees. AMIMs hence mainly learn from the node degrees and mostly fail to exploit the biological knowledge encoded in the edges of the PPI networks. This has far-reaching consequences for the field of active module identification. In particular, we suggest that novel algorithms are needed which overcome the degree bias of most existing AMIMs and/or work with customized, context-specific networks instead of generic PPI networks.

Download Full-text