Supervised classification for gene network reconstruction

L.A. Soinov

doi:10.1042/bst0311497

Supervised classification for gene network reconstruction

Biochemical Society Transactions ◽

10.1042/bst0311497 ◽

2003 ◽

Vol 31 (6) ◽

pp. 1497-1502 ◽

Cited By ~ 11

Author(s):

L.A. Soinov

Keyword(s):

Gene Expression ◽

Gene Networks ◽

Supervised Classification ◽

Biochemical Network ◽

Expression Data ◽

Clustering Methods ◽

Boolean Models ◽

Experimental Conditions ◽

Expression Levels ◽

Mathematical Techniques

One of the central problems of functional genomics is revealing gene expression networks – the relationships between genes that reflect observations of how the expression level of each gene affects those of others. Microarray data are currently a major source of information about the interplay of biochemical network participants in living cells. Various mathematical techniques, such as differential equations, Bayesian and Boolean models and several statistical methods, have been applied to expression data in attempts to extract the underlying knowledge. Unsupervised clustering methods are often considered as the necessary first step in visualization and analysis of the expression data. As for supervised classification, the problem mainly addressed so far has been how to find discriminative genes separating various samples or experimental conditions. Numerous methods have been applied to identify genes that help to predict treatment outcome or to confirm a diagnosis, as well as to identify primary elements of gene regulatory circuits. However, less attention has been devoted to using supervised learning to uncover relationships between genes and/or their products. To start filling this gap a machine-learning approach for gene networks reconstruction is described here. This approach is based on building classifiers – functions, which determine the state of a gene's transcription machinery through expression levels of other genes. The method can be applied to various cases where relationships between gene expression levels could be expected.

Download Full-text

A priori, de novo mathematical exploration of gene expression mechanism via regression viewpoint with briefly cataloged modeling antiquity

International Journal of Biomathematics ◽

10.1142/s1793524517500061 ◽

2016 ◽

Vol 10 (01) ◽

pp. 1750006

Author(s):

Shaurya Jauhari ◽

S. A. M. Rizvi

Keyword(s):

Gene Expression ◽

De Novo ◽

A Priori ◽

Mathematical Framework ◽

Reaction Synthesis ◽

Expression Data ◽

Boolean Models ◽

Mathematical Exploration ◽

Mathematical Techniques ◽

Expression Mechanism

Various algorithms have been devised to mathematically model the dynamic mechanism of the gene expression data. Gillespie’s stochastic simulation (GSSA) has been exceptionally primal for chemical reaction synthesis with future ameliorations. Several other mathematical techniques such as differential equations, thermodynamic models and Boolean models have been implemented to optimally and effectively represent the gene functioning. We present a novel mathematical framework of gene expression, undertaking the mathematical modeling of the transcription and translation phases, which is a detour from conventional modeling approaches. These subprocesses are inherent to every gene expression, which is implicitly an experimental outcome. As we foresee, there can be modeled a generality about some basal translation or transcription values that correspond to a particular assay.

Download Full-text

Analysis of Patterns of Gene Expression Variation within and between Ethnic Populations in Pediatric B-ALL

Cancer Informatics ◽

10.4137/cin.s11831 ◽

2013 ◽

Vol 12 ◽

pp. CIN.S11831 ◽

Cited By ~ 3

Author(s):

Chindo Hicks ◽

Lucio Miele ◽

Tejaswi Koganti ◽

LaFarra Young-Gaylor ◽

Deidre Rogers ◽

...

Keyword(s):

Gene Expression ◽

Gene Networks ◽

Lymphoblastic Leukemia ◽

The United States ◽

Expression Data ◽

Expression Levels ◽

Ethnic Populations ◽

Racial Ethnic ◽

Key Pathways ◽

Gene Expression Levels

B-Precursor acute lymphoblastic leukemia (B-ALL) is the most common childhood cancer. Although 80% of B-ALL patients are able to be cured, significant challenges persist. Significant disparities in clinical outcomes and mortality rates exist between racial/ ethnic populations. The objective of this study was to determine whether gene expression levels significantly differ between ethnic populations. We compared gene expression levels between four ethnic populations (Whites, Blacks, Hispanics, and Asians) in the United States. Additionally, we performed network and pathway analysis to identify gene networks and pathways. Gene expression data involved 198 samples distributed as follows: 126 Whites, 51 Hispanics, 13 Blacks, and 8 Asians. We identified 300 highly significantly ( P < 0.001) differentially expressed genes between the four ethnic populations. Among the identified genes included the genes PHF6, BRD3, CRLF2, and RNF135 which have been implicated in pediatric B-ALL. We identified key pathways implicated in B-ALL including the PDGF, PI3/AKT, ERBB2-ERBB3, and IL-15 signaling pathways.

Download Full-text

Graph Theoretic Techniques for Clustering and Biclustering gene expression data.

International Journal of Computer and Communication Technology ◽

10.47893/ijcct.2012.1136 ◽

2012 ◽

pp. 173-181

Author(s):

Prangyaparamita Mohapatra ◽

Tripti Swarnkar

Keyword(s):

Gene Expression ◽

Data Mining ◽

Gene Expression Data ◽

Biological Networks ◽

Clustering Algorithms ◽

Expression Data ◽

Microarray Technology ◽

Clustering Methods ◽

Experimental Conditions ◽

Data Set

DNA microarray technology has made it possible to simultaneously monitor the expression levels of thousands of genes during biological processes and across collections of related samples. However, the large number of genes and the complexity of biological networks greatly increase the challenges of comprehending and interpreting the resulting mass of data, which often consists of millions of measurements. A first step toward addressing this challenge is the use of clustering techniques, which is essential in the data mining process to reveal natural structures and identify interesting patterns in the underlying data. Cluster analysis seeks to partition a given data set into groups based on specified features so that the data points within a group are more similar to each other than the points in different groups. Many conventional clustering algorithms have been adapted or directly applied to gene expression data, and also new algorithms have recently been proposed specifically aiming at gene expression data. These clustering algorithms have been proven useful for identifying biologically relevant groups of genes and samples. A large number of clustering approaches have been proposed for the analysis of gene expression data obtained from microarray experiments. However, the results of the application of standard clustering methods to genes are limited. These limited results are imposed by the existence of a number of experimental conditions where the activity of genes is uncorrelated. A similar limitation exists when clustering of conditions is performed. For this reason, a number of algorithms that perform simultaneous clustering on the row and column dimensions of the gene expression matrix have been proposed to date. This simultaneous clustering, usually designated by biclustering, seeks to find submatrices that are subgroups of genes and subgroups of columns, where the genes exhibit highly correlated activities for every condition. This type of algorithms has also been proposed and used in other fields, such as information retrieval and data mining. In this paper, we first briefly introduce the concepts of microarray technology and discuss the basic elements of clustering on gene expression data. Then, we present specific challenges pertinent to each clustering category and introduce several representative approaches.

Download Full-text

Bootstrapping Time-Course Gene Expression Data for Gene Networks: Application to Gene Relevance Networks

Journal of Computational Biology ◽

10.1089/cmb.2018.0029 ◽

2018 ◽

Vol 25 (12) ◽

pp. 1374-1384

Author(s):

Jeonifer M. Garren ◽

Jaejik Kim

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Gene Networks ◽

Time Course ◽

Expression Data

Download Full-text

Uncovering Potential Therapeutic Targets in Colorectal Cancer by Deciphering Mutational Status and Expression of Druggable Oncogenes

Cancers ◽

10.3390/cancers11070983 ◽

2019 ◽

Vol 11 (7) ◽

pp. 983 ◽

Cited By ~ 3

Author(s):

Otília Menyhart ◽

Tatsuhiko Kakisaka ◽

Lőrinc Sándor Pongor ◽

Hiroyuki Uetake ◽

Ajay Goel ◽

...

Keyword(s):

Gene Expression ◽

Colorectal Cancer ◽

Gene Expression Data ◽

Drug Targets ◽

Therapeutic Targets ◽

Independent Set ◽

Driver Mutations ◽

Expression Data ◽

Expression Levels ◽

Mutational Status

Background: Numerous driver mutations have been identified in colorectal cancer (CRC), but their relevance to the development of targeted therapies remains elusive. The secondary effects of pathogenic driver mutations on downstream signaling pathways offer a potential approach for the identification of therapeutic targets. We aimed to identify differentially expressed genes as potential drug targets linked to driver mutations. Methods: Somatic mutations and the gene expression data of 582 CRC patients were utilized, incorporating the mutational status of 39,916 and the expression levels of 20,500 genes. To uncover candidate targets, the expression levels of various genes in wild-type and mutant cases for the most frequent disruptive mutations were compared with a Mann–Whitney test. A survival analysis was performed in 2100 patients with transcriptomic gene expression data. Up-regulated genes associated with worse survival were filtered for potentially actionable targets. The most significant hits were validated in an independent set of 171 CRC patients. Results: Altogether, 426 disruptive mutation-associated upregulated genes were identified. Among these, 95 were linked to worse recurrence-free survival (RFS). Based on the druggability filter, 37 potentially actionable targets were revealed. We selected seven genes and validated their expression in 171 patient specimens. The best independently validated combinations were DUSP4 (p = 2.6 × 10−12) in ACVR2A mutated (7.7%) patients; BMP4 (p = 1.6 × 10−04) in SOX9 mutated (8.1%) patients; TRIB2 (p = 1.35 × 10−14) in ACVR2A mutated patients; VSIG4 (p = 2.6 × 10−05) in ANK3 mutated (7.6%) patients, and DUSP4 (p = 7.1 × 10−04) in AMER1 mutated (8.2%) patients. Conclusions: The results uncovered potentially druggable genes in colorectal cancer. The identified mutations could enable future patient stratification for targeted therapy.

Download Full-text

Reconstructing Gene Networks of Forest Trees from Gene Expression Data: Toward Higher-Resolution Approaches

Communications in Computer and Information Science - ICT Innovations 2018. Engineering and Life Sciences ◽

10.1007/978-3-030-00825-3_1 ◽

2018 ◽

pp. 3-12 ◽

Cited By ~ 1

Author(s):

Matt Zinkgraf ◽

Andrew Groover ◽

Vladimir Filkov

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Gene Networks ◽

Forest Trees ◽

Expression Data

Download Full-text

Inferring time-lagged causality using the derivative of single-cell expression

10.1101/2021.02.03.429525 ◽

2021 ◽

Author(s):

Huan-Huan Wei ◽

Hui Lu ◽

Hongyu Zhao

Keyword(s):

Gene Expression ◽

Causal Inference ◽

Single Cell ◽

Causal Relationship ◽

Gene Expression Data ◽

Expression Data ◽

Causal Relationships ◽

Expression Levels ◽

Gene Pairs ◽

Time Lagged

AbstractMany computational methods have been developed for inferring causality among genes using cross-sectional gene expression data, such as single-cell RNA sequencing (scRNA-seq) data. However, due to the limitations of scRNA-seq technologies, time-lagged causal relationships may be missed by existing methods. In this work, we propose a method, called causal inference with time-lagged information (CITL), to infer time-lagged causal relationships from scRNA-seq data by assessing conditional independence between the changing and current expression levels of genes. CITL estimates the changing expression levels of genes by “RNA velocity”. We demonstrate the accuracy and stability of CITL for inferring time-lagged causality on simulation data against other leading approaches. We have applied CITL to real scRNA data and inferred 878 pairs of time-lagged causal relationships, with many of these inferred results supported by the literature.Author summaryComputational causal inference is a promising way to survey causal relationships between genes efficiently. Though many causal inference methods have been applied to gene expression data, none considers the time-lagged causal relationship, which means that some genes may take some time to affect their target genes with several reactions. If relationships between genes are time-lagged, the existing methods’ assumptions will be violated. The relationships will be challenging to recognize. We demonstrate that this is indeed the case through simulation. Therefore, we develop a method for inferring time-lagged causal relationships of single-cell gene expression data. We assume that a time-lagged causal relationship should present a strong association between the cause and the effect’s changing. To calculate such correlation, we first estimate the derivative of gene expression using the information from unspliced transcripts. Then, we use conditional independent tests to search gene pairs satisfying our assumption. Our results suggest that we could accurately infer time-lagged causal gene pairs validated by published literature. This method may complement gene regulatory analysis and provide candidate gene pairs for further controlled experiments.

Download Full-text

Building Gene Networks by Analyzing Gene Expression Profiles

Advanced Methodologies and Technologies in Medicine and Healthcare - Advances in Medical Diagnosis, Treatment, and Care ◽

10.4018/978-1-5225-7489-7.ch003 ◽

2019 ◽

pp. 27-44

Author(s):

Crescenzio Gallo

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Gene Networks ◽

Dna Microarrays ◽

Expression Profiles ◽

Expression Patterns ◽

Gene Expression Profiles ◽

Expression Data ◽

Gene Expressions ◽

Over Time

The possible applications of modeling and simulation in the field of bioinformatics are very extensive, ranging from understanding basic metabolic paths to exploring genetic variability. Experimental results carried out with DNA microarrays allow researchers to measure expression levels for thousands of genes simultaneously, across different conditions and over time. A key step in the analysis of gene expression data is the detection of groups of genes that manifest similar expression patterns. In this chapter, the authors examine various methods for analyzing gene expression data, addressing the important topics of (1) selecting the most differentially expressed genes, (2) grouping them by means of their relationships, and (3) classifying samples based on gene expressions.

Download Full-text

Analyzing Large Gene Expression Data Sets

Computational Text Analysis ◽

10.1093/oso/9780198567400.003.0014 ◽

2006 ◽

Author(s):

Soumya Raychaudhuri

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Expression Analysis ◽

Gene Expression Analysis ◽

Data Sets ◽

Expression Data ◽

Clustering Methods ◽

Biologically Relevant ◽

Large Gene ◽

Functional Coherence

The most interesting and challenging gene expression data sets to analyze are large multidimensional data sets that contain expression values for many genes across multiple conditions. In these data sets the use of scientific text can be particularly useful, since there are a myriad of genes examined under vastly different conditions, each of which may induce or repress expression of the same gene for different reasons. There is an enormous complexity to the data that we are examining—each gene is associated with dozens if not hundreds of expression values as well as multiple documents built up from vocabularies consisting of thousands of words. In Section 2.4 we reviewed common gene expression strategies, most of which revolve around defining groups of genes based on common profiles. A limitation of many gene expression analytic approaches is that they do not incorporate comprehensive background knowledge about the genes into the analysis. We present computational methods that leverage the peer-reviewed literature in the automatic analysis of gene expression data sets. Including the literature in gene expression data analysis offers an opportunity to incorporate background functional information about the genes when defining expression clusters. In Chapter 5 we saw how literature- based approaches could help in the analysis of single condition experiments. Here we will apply the strategies introduced in Chapter 6 to assess the coherence of groups of genes to enhance gene expression analysis approaches. The methods proposed here could, in fact, be applied to any multivariate genomics data type. The key concepts discussed in this chapter are listed in the frame box. We begin with a discussion of gene groups and their role in expression analysis; we briefly discuss strategies to assign keywords to groups and strategies to assess their functional coherence. We apply functional coherence measures to gene expression analysis; for examples we focus on a yeast expression data set. We first demonstrate how functional coherence can be used to focus in on the key biologically relevant gene groups derived by clustering methods such as self-organizing maps and k-means clustering.

Download Full-text

Putative biomarkers for predicting tumor sample purity based on gene expression data

BMC Genomics ◽

10.1186/s12864-019-6412-8 ◽

2019 ◽

Vol 20 (1) ◽

Author(s):

Yuanyuan Li ◽

David M. Umbach ◽

Adrienna Bingham ◽

Qi-Jing Li ◽

Yuan Zhuang ◽

...

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Supervised Machine Learning ◽

Tumor Type ◽

Expression Data ◽

Expression Levels ◽

Gene Set ◽

Tumor Purity ◽

Tumor Types ◽

Cancerous Cells

Abstract Background Tumor purity is the percent of cancer cells present in a sample of tumor tissue. The non-cancerous cells (immune cells, fibroblasts, etc.) have an important role in tumor biology. The ability to determine tumor purity is important to understand the roles of cancerous and non-cancerous cells in a tumor. Methods We applied a supervised machine learning method, XGBoost, to data from 33 TCGA tumor types to predict tumor purity using RNA-seq gene expression data. Results Across the 33 tumor types, the median correlation between observed and predicted tumor-purity ranged from 0.75 to 0.87 with small root mean square errors, suggesting that tumor purity can be accurately predicted υσινγ expression data. We further confirmed that expression levels of a ten-gene set (CSF2RB, RHOH, C1S, CCDC69, CCL22, CYTIP, POU2AF1, FGR, CCL21, and IL7R) were predictive of tumor purity regardless of tumor type. We tested whether our set of ten genes could accurately predict tumor purity of a TCGA-independent data set. We showed that expression levels from our set of ten genes were highly correlated (ρ = 0.88) with the actual observed tumor purity. Conclusions Our analyses suggested that the ten-gene set may serve as a biomarker for tumor purity prediction using gene expression data.

Download Full-text