scholarly journals Item response theory modeling for microarray gene expression data

2009 ◽  
Vol 6 (1) ◽  
Author(s):  
Andrej Kastrin

The high dimensionality of global gene expression profiles, where number of variables (genes) is very large compared to the number of observations (samples), presents challenges that affect generalizability and applicability of microarray analysis. Latent variable modeling offers a promising approach to deal with high-dimensional microarray data. The latent variable model is based on a few latent variables that capture most of the gene expression information. Here, we describe how to accomplish a reduction in dimension by a latent variable methodology, which can greatly reduce the number of features used to characterize microarray data. We propose a general latent variable framework for prediction of predefined classes of samples using gene expression profiles from microarray experiments. The framework consists of (i) selection of smaller number of genes that are most differentially expressed between samples, (ii) dimension reduction using hierarchical clustering, where each cluster partition is identified as latent variable, (iii) discretization of gene expression matrix, (iv) fitting the Rasch item response model for genes in each cluster partition to estimate the expression of latent variable, and (v) construction of prediction model with latent variables as covariates to study the relationship between latent variables and phenotype. Two different microarray data sets are used to illustrate a general framework of the approach. We show that the predictive performance of our method is comparable to the current best approach based on an all-gene space. The method is general and can be applied to the other high-dimensional data problems.

2010 ◽  
Vol 49 (03) ◽  
pp. 254-268 ◽  
Author(s):  
C.-S. Yang ◽  
K.-C. Wu ◽  
C.-H. Yang ◽  
L.-Y. Chuang

Summary Background: Microarray data with reference to gene expression profiles have provided some valuable results related to a variety of problems, and contributed to advances in clinical medicine. Microarray data characteristically have a high dimension and small sample size, which makes it difficult for a general classification method to obtain correct data for classification. However, not every gene is potentially relevant for distinguishing the sample class. Thus, in order to analyze gene expression profiles correctly, feature (gene) selection is crucial for the classification process, and an effective gene extraction method is necessary for eliminating irrelevant genes and decreasing the classification error rate. Objective: The purpose of gene expression analysis is to discriminate between classes of samples, and to predict the relative importance of each gene for sample classification. Method: In this paper, correlation-based feature selection (CFS) and Taguchi-binary particle swarm optimization (TBPSO) were combined into a hybrid method, and the K-nearest neighbor (K-NN) with leave-one-out cross-validation (LOOCV) method served as a classifier for ten gene expression profiles. Results: Experimental results show that this hybrid method effectively simplifies feature selection by reducing the number of features needed. The classification error rate obtained by the proposed method had the lowest classification error rate for all of the ten gene expression data set problems tested. For six of the gene expression profile data sets a classification error rate of zero could be reached. Conclusion: The introduced method outperformed five other methods from the literature in terms of classification error rate. It could thus constitute a valuable tool for gene expression analysis in future studies.


2015 ◽  
Vol 11 (10) ◽  
pp. 2690-2698 ◽  
Author(s):  
Mirko Francesconi ◽  
Ben Lehner

Gene expression profiling is a fast, cheap and standardised analysis that provides a high dimensional measurement of the state of a biological sample, including of single cells. Computational methods to reconstruct the composition of samples and spatial and temporal information from expression profiles are described, as well as how they can be used to describe the effects of genetic variation.


2017 ◽  
Author(s):  
Brian Cleary ◽  
Le Cong ◽  
Eric S. Lander ◽  
Aviv Regev

AbstractRNA profiling is an excellent phenotype of cellular responses and tissue states, but can be costly to generate at the massive scale required for studies of regulatory circuits, genetic states or perturbation screens. Here, we draw on a series of advances over the last decade in the field of mathematics to establish a rigorous link between biological structure, data compressibility, and efficient data acquisition. We propose that very few random composite measurements – in which gene abundances are combined in a random linear combination – are needed to approximate the high-dimensional similarity between any pair of gene abundance profiles. We then show how finding latent, sparse representations of gene expression data would enable us to “decompress” a small number of random composite measurements and recover high-dimensional gene expression levels that were not measured (unobserved). We present a new algorithm for finding sparse, modular structure, which improves the ability to interpret samples in terms of small numbers of active modules, and show that the modular structure we find is sufficient to recover gene expression profiles from composite measurements (with ~100-fold fewer composite measurements than genes). Moreover, the knowledge that sparse, modular structures exist allows us to recover expression profiles from composite measurements, even without access to any training data. Finally, we present a proof-of-concept experiment for making composite measurements in the laboratory, involving the measurement of linear combinations of RNA abundances. Altogether, our results suggest new compressive modalities in experimental biology that can form a foundation for massive scaling in high-throughput measurements, while also offering new insights into the interpretation of high-dimensional data.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Tianzhong Yang ◽  
Jingbo Niu ◽  
Han Chen ◽  
Peng Wei

Abstract Background Environmental exposures can regulate intermediate molecular phenotypes, such as gene expression, by different mechanisms and thereby lead to various health outcomes. It is of significant scientific interest to unravel the role of potentially high-dimensional intermediate phenotypes in the relationship between environmental exposure and traits. Mediation analysis is an important tool for investigating such relationships. However, it has mainly focused on low-dimensional settings, and there is a lack of a good measure of the total mediation effect. Here, we extend an R-squared (R$$^2$$ 2 ) effect size measure, originally proposed in the single-mediator setting, to the moderate- and high-dimensional mediator settings in the mixed model framework. Results Based on extensive simulations, we compare our measure and estimation procedure with several frequently used mediation measures, including product, proportion, and ratio measures. Our R$$^2$$ 2 -based second-moment measure has small bias and variance under the correctly specified model. To mitigate potential bias induced by non-mediators, we examine two variable selection procedures, i.e., iterative sure independence screening and false discovery rate control, to exclude the non-mediators. We establish the consistency of the proposed estimation procedures and introduce a resampling-based confidence interval. By applying the proposed estimation procedure, we found that 38% of the age-related variations in systolic blood pressure can be explained by gene expression profiles in the Framingham Heart Study of 1711 individuals. An R package “RsqMed” is available on CRAN. Conclusion R-squared (R$$^2$$ 2 ) is an effective and efficient measure for total mediation effect especially under high-dimensional setting.


2021 ◽  
Author(s):  
Christos Fotis ◽  
George Alevizos ◽  
Nikolaos Meimetis ◽  
Christina Koleri ◽  
Thomas Gkekas ◽  
...  

The analysis and comparison of compounds' transcriptomic signatures can help elucidate a compound's Mechanism of Action (MoA) in a biological system. In order to take into account the complexity of the biological system, several computational methods have been developed that utilize prior knowledge of molecular interactions to create a signaling network representation that best explains the compound's effect. However, due to their complex structure, large scale datasets of compound-induced signaling networks and methods specifically tailored to their analysis and comparison are very limited. Our goal is to develop graph deep learning models that are optimized to transform compound-induced signaling networks into high-dimensional representations and investigate their relationship with their respective MoAs. We created a new dataset of compound-induced signaling networks by applying the CARNIVAL network creation pipeline on the gene expression profiles of the CMap dataset. Furthermore, we developed a novel unsupervised graph deep learning pipeline, called deepSNEM, to encode the information in the compound-induced signaling networks in fixed-length high-dimensional representations. The core of deepSNEM is a graph transformer network, trained to maximize the mutual information between whole-graph and sub-graph representations that belong to similar perturbations. By clustering the deepSNEM embeddings, using the k-means algorithm, we were able to identify distinct clusters that are significantly enriched for mTOR, topoisomerase, HDAC and protein synthesis inhibitors respectively. Additionally, we developed a subgraph importance pipeline and identified important nodes and subgraphs that were found to be directly related to the most prevalent MoA of the assigned cluster. As a use case, deepSNEM was applied on compounds' gene expression profiles from various experimental platforms (MicroArrays and RNA sequencing) and the results indicate that correct hypotheses can be generated regarding their MoA.


2019 ◽  
Vol 39 (9) ◽  
Author(s):  
Keling Liu ◽  
Qingmei Fu ◽  
Yao Liu ◽  
Chenhong Wang

Abstract Preeclampsia (PE) is a disorder of pregnancy that is characterised by hypertension and a significant amount of proteinuria beginning after 20 weeks of pregnancy. It is closely associated with high maternal morbidity, mortality, maternal organ dysfunction or foetal growth restriction. Therefore, it is necessary to identify early and novel diagnostic biomarkers of PE. In the present study, we performed a multi-step integrative bioinformatics analysis of microarray data for identifying hub genes as diagnostic biomarkers of PE. With the help of gene expression profiles of the Gene Expression Omnibus (GEO) dataset GSE60438, a total of 268 dysregulated genes were identified including 131 up- and 137 down-regulated differentially expressed genes (DEGs). Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analyses of DEGs suggested that DEGs were significantly enriched in disease-related biological processes (BPs) such as hormone activity, immune response, steroid hormone biosynthesis, metabolic pathways, and other signalling pathways. Using the STRING database, we established a protein–protein interaction (PPI) network based on the above DEGs. Module analysis and identification of hub genes were performed to screen a total of 17 significant hub genes. The support vector machines (SVMs) model was used to predict the potential application of biomarkers in PE diagnosis with an area under the receiver operating characteristic (ROC) curve (AUC) of 0.958 in the training set and 0.834 in the test set, suggesting that this risk classifier has good discrimination between PE patients and control samples. Our results demonstrated that these 17 differentially expressed hub genes can be used as potential biomarkers for diagnosis of PE.


Sign in / Sign up

Export Citation Format

Share Document