scholarly journals Case-Based Retrieval Framework for Gene Expression Data

2015 ◽  
Vol 14 ◽  
pp. CIN.S22371 ◽  
Author(s):  
Ali Anaissi ◽  
Madhu Goyal ◽  
Daniel R. Catchpoole ◽  
Ali Braytee ◽  
Paul J. Kennedy

Background The process of retrieving similar cases in a case-based reasoning system is considered a big challenge for gene expression data sets. The huge number of gene expression values generated by microarray technology leads to complex data sets and similarity measures for high-dimensional data are problematic. Hence, gene expression similarity measurements require numerous machine-learning and data-mining techniques, such as feature selection and dimensionality reduction, to be incorporated into the retrieval process. Methods This article proposes a case-based retrieval framework that uses a k-nearest-neighbor classifier with a weighted-feature-based similarity to retrieve previously treated patients based on their gene expression profiles. Results The herein-proposed methodology is validated on several data sets: a childhood leukemia data set collected from The Children's Hospital at Westmead, as well as the Colon cancer, the National Cancer Institute (NCI), and the Prostate cancer data sets. Results obtained by the proposed framework in retrieving patients of the data sets who are similar to new patients are as follows: 96% accuracy on the childhood leukemia data set, 95% on the NCI data set, 93% on the Colon cancer data set, and 98% on the Prostate cancer data set. Conclusion The designed case-based retrieval framework is an appropriate choice for retrieving previous patients who are similar to a new patient, on the basis of their gene expression data, for better diagnosis and treatment of childhood leukemia. Moreover, this framework can be applied to other gene expression data sets using some or all of its steps.

2004 ◽  
Vol 01 (04) ◽  
pp. 681-694 ◽  
Author(s):  
MAT SOUKUP ◽  
JAE K. LEE

Microarrays can provide genome-wide expression patterns for various cancers, especially for tumor sub-types that may exhibit substantially different patient prognosis. Using such gene expression data, several approaches have been proposed to classify tumor sub-types accurately. These classification methods are not robust, and often dependent on a particular training sample for modelling, which raises issues in utilizing these methods to administer proper treatment for a future patient. We propose to construct an optimal, robust prediction model for classifying cancer sub-types using gene expression data. Our model is constructed in a step-wise fashion implementing cross-validated quadratic discriminant analysis. At each step, all identified models are validated by an independent sample of patients to develop a robust model for future data. We apply the proposed methods to two microarray data sets of cancer: the acute leukemia data by Golub et al.3 and the colon cancer data by Alon et al.12 We have found that the dimensionality of our optimal prediction models is relatively small for these cases and that our prediction models with one or two gene factors outperforms or has competing performance, especially for independent samples, to other methods based on 50 or more predictive gene factors. The methodology is implemented and developed by the procedures in R and Splus. The source code can be obtained at .


Blood ◽  
2015 ◽  
Vol 126 (23) ◽  
pp. 2663-2663
Author(s):  
Matthew A Care ◽  
Stephen M Thirdborough ◽  
Andrew J Davies ◽  
Peter W.M. Johnson ◽  
Andrew Jack ◽  
...  

Abstract Purpose To assess whether comparative gene network analysis can reveal characteristic immune response signatures that predict clinical response in Diffuse large B-cell lymphoma (DLBCL). Background The wealth of available gene expression data sets for DLBCL and other cancer types provides a resource to define recurrent pathological processes at the level of gene expression and gene correlation neighbourhoods. This is of particular relevance in the context of cancer immune responses, where convergence onto common patterns may drive shared gene expression profiles. Where existing and novel immunotherapies harness the immune response for therapeutic benefit such responses may provide predictive biomarkers. Methods We independently analysed publically available DLBCL gene expression data sets and a wide compendium of gene expression data from diverse cancer types, and then asked whether common elements of cancer host response could be identified from resulting networks. Using 10 DLBCL gene expression data sets, encompassing 2030 cases, we established pairwise gene correlation matrices per data set, which were merged to generate median correlations of gene pairs across all data sets. Gene network analysis and unsupervised clustering was then applied to define global representations of DLBCL gene expression neighbourhoods. In parallel a diverse range of solid and lymphoid malignancies including; breast, colorectal, oesophageal, head and neck, non-small cell lung, prostate, pancreatic cancer, Hodgkin lymphoma, Follicular lymphoma and DLBCL were independently analysed using an orthogonal weighted gene correlation network analysis of gene expression data sets from which correlated modules across diverse cancer types were identified. The biology of resulting gene neighbourhoods was assessed by signature and ontology enrichment, and the overlap between gene correlation neighbourhoods and WGCNA derived modules associated with immune/host responses was analysed. Results Amongst DLBCL data, we identified distinct gene correlation neighbourhoods associated with the immune response. These included both elements of IFN-polarised responses, core T-cell, and cytotoxic signatures as well as distinct macrophage responses. Neighbourhoods linked to macrophages separated CD163 from CD68 and CD14. In the WGCNA analysis of diverse cancer types clusters corresponding to these immune response neighbourhoods were independently identified including a highly similar cluster related to CD163. The overlapping CD163 clusters in both analyses linked to diverse Fc-Receptors, complement pathway components and patterns of scavenger receptors potentially linked to alternative macrophage activation. The relationship between the CD163 macrophage gene expression cluster and outcome was tested in DLBCL data sets, identifying a poor response in CD163 -cluster high patients, which reached statistical significance in one data set (GSE10846). Notably, the effect of the CD163-associated gene neighbourhood which correlates with poor outcome post rituximab containing immunochemotherapy is distinct from the effect of IFNG-STAT1-IRF1 polarised cytotoxic responses. The latter represents the predominant immune response pattern separating cell of origin unclassifiable (Type-III) DLBCL from either ABC or GCB DLBCL subsets, and is associated with a trend toward positive outcome. Conclusion Comparative gene expression network analysis identifies common immune response signatures shared between DLBCL and other cancer types. Gene expression clusters linked to CD163 macrophage responses and IFNG-STAT1-IRF1 polarised cytotoxic responses are common patterns with apparent divergent outcome association. Disclosures Davies: CTI: Honoraria; GIlead: Consultancy, Honoraria, Research Funding; Mundipharma: Honoraria, Research Funding; Bayer: Research Funding; Takeda: Honoraria, Research Funding; Janssen: Honoraria, Research Funding; Roche: Honoraria, Research Funding; GSK: Research Funding; Pfizer: Honoraria; Celgene: Honoraria, Research Funding. Jack:Jannsen: Research Funding.


2014 ◽  
Author(s):  
Adrin Jalali ◽  
Nico Pfeifer

Motivation: Molecular measurements from cancer patients such as gene expression and DNA methylation are usually very noisy. Furthermore, cancer types can be very heterogeneous. Therefore, one of the main assumptions for machine learning, that the underlying unknown distribution is the same for all samples, might not be completely fullfilled. We introduce a method, that can estimate this bias on a per-feature level and incorporate calculated feature confidences into a weighted combination of classifiers with disjoint feature sets. Results: The new method achieves state-of-the-art performance on many different cancer data sets with measured DNA methylation or gene expression. Moreover, we show how to visualize the learned classifiers to find interesting associations with the target label. Applied to a leukemia data set we find several ribosomal proteins associated with leukemia's risk group that might be interesting targets for follow-up studies and support the hypothesis that the ribosomes are a new frontier in gene regulation. Availability: The method is available under GPLv3+ License at https: //github.com/adrinjalali/Network-Classifier.


2017 ◽  
Author(s):  
Momeneh Foroutan ◽  
Dharmesh D. Bhuva ◽  
Kristy Horan ◽  
Ruqian Lyu ◽  
Joseph Cursons ◽  
...  

AbstractBackgroundGene set scoring provides a useful approach for quantifying concordance between sample transcriptomes and selected molecular signatures. Most methods use information from all samples to score an individual sample, leading to unstable scores in small data sets and introducing biases from sample composition across a data set (e.g. varying numbers of samples for different cancer subtypes). To address these issues we have developed a truly single sample scoring method, and associated R/Bioconductor package singscore.ResultsWe have developed a rank-based single sample scoring method, implemented as a Bioconductor package. We use multiple cancer data sets to compare it against widely-used scoring methods, including GSVA, z-scores, PLAGE, and ssGSEA. Our approach does not depend upon background samples and thus the scores are stable regardless of the composition and number of samples in the gene expression data set. In contrast, scores obtained by GSVA, z-score, PLAGE and ssGSEA can be unstable when less data are available (nsamples < 25). We show that the computational time for singscore is faster than current implementations of GSVA and ssGSEA, and is comparable with that of z-score and PLAGE. The singscore package also produces visualisations and interactive plots that enable exploration of molecular phenotypes.ConclusionsThe single sample scoring method described here is independent of sample composition in gene expression data and thus it provides stable scores that are less likely to be influenced by unwanted variation across samples. These scores can be used for dimensional reduction of transcriptomic data and the phenotypic landscapes obtained by scoring samples against multiple molecular signatures may provide insights for sample stratification.


Author(s):  
Benny Yiu-ming Fung ◽  
Vincent To-yee Ng

When classifying tumors using gene expression data, mining tasks commonly make use of only a single data set. However, classification models based on patterns extracted from a single data set are often not indicative of an entire population and heterogeneous samples subsequently applied to these models may not fit, leading to performance degradation. In short, it is not possible to guarantee that mining results based on a single gene expression data set will be reliable or robust (Miller et al., 2002). This problem can be addressed using classification algorithms capable of handling multiple, heterogeneous gene expression data sets. Apart from improving mining performance, the use of such algorithms would make mining results less sensitive to the variations of different microarray platforms and to experimental conditions embedded in heterogeneous gene expression data sets.


2018 ◽  
Vol 25 ◽  
pp. 9-16
Author(s):  
M Shahjaman ◽  
N Kumar ◽  
AA Begum ◽  
SMS Islam ◽  
MNH Mollah

The main purpose of gene expression data analysis is to identify the biomarker genes by comparing the gene expression levels between two different groups or conditions. There are several methods to select biomarker genes and many comparative studies have been performed to select the appropriate method. However, they did not consider the problems of outliers in their data sets though it is very essential to select the method from robustness point of view due to outliers may occur in the different steps of the gene expression data generating process. In this paper, it is evaluated the performance among five popular statistical biomarker gene selection methods viz. T-test, SAM, LIMMA, KW and FCROS using both simulated and real gene expression data sets in absence and presence of outliers. In the simulated data analysis, it was demonstrated the performance of these methods in terms of different performance measures such as TPR, TNR, FPR, FNR and AUC and based on these measures, it was found that in absence of outliers, for both small-and-large sample cases all the methods perform almost similar. Whereas, in presence of outliers, for small-sample case only the FCROS method perform well than other methods. From a real colon cancer data analysis, it was elucidated that FCROS method identified additional 59 genes that were not detected by the other methods and most of them belongs to the different cancer related pathways.J. bio-sci. 25: 9-16, 2017


2000 ◽  
Vol 3 (1) ◽  
pp. 9-15 ◽  
Author(s):  
PETER J. WOOLF ◽  
YIXIN WANG

Woolf, Peter J., and Yixin Wang. A fuzzy logic approach to analyzing gene expression data. Physiol Genomics 3: 9–15, 2000.—We have developed a novel algorithm for analyzing gene expression data. This algorithm uses fuzzy logic to transform expression values into qualitative descriptors that can be evaluated by using a set of heuristic rules. In our tests we designed a model to find triplets of activators, repressors, and targets in a yeast gene expression data set. For the conditions tested, the predictions made by the algorithm agree well with experimental data in the literature. The algorithm can also assist in determining the function of uncharacterized proteins and is able to detect a substantially larger number of transcription factors than could be found at random. This technology extends current techniques such as clustering in that it allows the user to generate a connected network of genes using only expression data.


Author(s):  
Soumya Raychaudhuri

The most interesting and challenging gene expression data sets to analyze are large multidimensional data sets that contain expression values for many genes across multiple conditions. In these data sets the use of scientific text can be particularly useful, since there are a myriad of genes examined under vastly different conditions, each of which may induce or repress expression of the same gene for different reasons. There is an enormous complexity to the data that we are examining—each gene is associated with dozens if not hundreds of expression values as well as multiple documents built up from vocabularies consisting of thousands of words. In Section 2.4 we reviewed common gene expression strategies, most of which revolve around defining groups of genes based on common profiles. A limitation of many gene expression analytic approaches is that they do not incorporate comprehensive background knowledge about the genes into the analysis. We present computational methods that leverage the peer-reviewed literature in the automatic analysis of gene expression data sets. Including the literature in gene expression data analysis offers an opportunity to incorporate background functional information about the genes when defining expression clusters. In Chapter 5 we saw how literature- based approaches could help in the analysis of single condition experiments. Here we will apply the strategies introduced in Chapter 6 to assess the coherence of groups of genes to enhance gene expression analysis approaches. The methods proposed here could, in fact, be applied to any multivariate genomics data type. The key concepts discussed in this chapter are listed in the frame box. We begin with a discussion of gene groups and their role in expression analysis; we briefly discuss strategies to assign keywords to groups and strategies to assess their functional coherence. We apply functional coherence measures to gene expression analysis; for examples we focus on a yeast expression data set. We first demonstrate how functional coherence can be used to focus in on the key biologically relevant gene groups derived by clustering methods such as self-organizing maps and k-means clustering.


Symmetry ◽  
2020 ◽  
Vol 12 (1) ◽  
pp. 154 ◽  
Author(s):  
Ho Sun Shon ◽  
Erdenebileg Batbaatar ◽  
Kyoung Ok Kim ◽  
Eun Jong Cha ◽  
Kyung-Ah Kim

Recently, large-scale bioinformatics and genomic data have been generated using advanced biotechnology methods, thus increasing the importance of analyzing such data. Numerous data mining methods have been developed to process genomic data in the field of bioinformatics. We extracted significant genes for the prognosis prediction of 1157 patients using gene expression data from patients with kidney cancer. We then proposed an end-to-end, cost-sensitive hybrid deep learning (COST-HDL) approach with a cost-sensitive loss function for classification tasks on imbalanced kidney cancer data. Here, we combined the deep symmetric auto encoder; the decoder is symmetric to the encoder in terms of layer structure, with reconstruction loss for non-linear feature extraction and neural network with balanced classification loss for prognosis prediction to address data imbalance problems. Combined clinical data from patients with kidney cancer and gene data were used to determine the optimal classification model and estimate classification accuracy by sample type, primary diagnosis, tumor stage, and vital status as risk factors representing the state of patients. Experimental results showed that the COST-HDL approach was more efficient with gene expression data for kidney cancer prognosis than other conventional machine learning and data mining techniques. These results could be applied to extract features from gene biomarkers for prognosis prediction of kidney cancer and prevention and early diagnosis.


2015 ◽  
Vol 13 (06) ◽  
pp. 1550019 ◽  
Author(s):  
Alexei A. Sharov ◽  
David Schlessinger ◽  
Minoru S. H. Ko

We have developed ExAtlas, an on-line software tool for meta-analysis and visualization of gene expression data. In contrast to existing software tools, ExAtlas compares multi-component data sets and generates results for all combinations (e.g. all gene expression profiles versus all Gene Ontology annotations). ExAtlas handles both users’ own data and data extracted semi-automatically from the public repository (GEO/NCBI database). ExAtlas provides a variety of tools for meta-analyses: (1) standard meta-analysis (fixed effects, random effects, z-score, and Fisher’s methods); (2) analyses of global correlations between gene expression data sets; (3) gene set enrichment; (4) gene set overlap; (5) gene association by expression profile; (6) gene specificity; and (7) statistical analysis (ANOVA, pairwise comparison, and PCA). ExAtlas produces graphical outputs, including heatmaps, scatter-plots, bar-charts, and three-dimensional images. Some of the most widely used public data sets (e.g. GNF/BioGPS, Gene Ontology, KEGG, GAD phenotypes, BrainScan, ENCODE ChIP-seq, and protein–protein interaction) are pre-loaded and can be used for functional annotations.


Sign in / Sign up

Export Citation Format

Share Document