scholarly journals Random Subspace Aggregation for Cancer Prediction with Gene Expression Profiles

2016 ◽  
Vol 2016 ◽  
pp. 1-10 ◽  
Author(s):  
Liying Yang ◽  
Zhimin Liu ◽  
Xiguo Yuan ◽  
Jianhua Wei ◽  
Junying Zhang

Background. Precisely predicting cancer is crucial for cancer treatment. Gene expression profiles make it possible to analyze patterns between genes and cancers on the genome-wide scale. Gene expression data analysis, however, is confronted with enormous challenges for its characteristics, such as high dimensionality, small sample size, and low Signal-to-Noise Ratio.Results. This paper proposes a method, termed RS_SVM, to predict gene expression profiles via aggregating SVM trained on random subspaces. After choosing gene features through statistical analysis, RS_SVM randomly selects feature subsets to yield random subspaces and training SVM classifiers accordingly and then aggregates SVM classifiers to capture the advantage of ensemble learning. Experiments on eight real gene expression datasets are performed to validate the RS_SVM method. Experimental results show that RS_SVM achieved better classification accuracy and generalization performance in contrast with single SVM,K-nearest neighbor, decision tree, Bagging, AdaBoost, and the state-of-the-art methods. Experiments also explored the effect of subspace size on prediction performance.Conclusions. The proposed RS_SVM method yielded superior performance in analyzing gene expression profiles, which demonstrates that RS_SVM provides a good channel for such biological data.

2004 ◽  
Vol 3 (1) ◽  
pp. 1-19 ◽  
Author(s):  
Minhui Paik ◽  
Yuhong Yang

Various discriminant methods have been applied for classification of tumors based on gene expression profiles, among which the nearest neighbor (NN) method has been reported to perform relatively well. Usually cross-validation (CV) is used to select the neighbor size as well as the number of variables for the NN method. However, CV can perform poorly when there is considerable uncertainty in choosing the best candidate classifier. As an alternative to selecting a single “winner," we propose a weighting method to combine the multiple NN rules. Four gene expression data sets are used to compare its performance with CV methods. The results show that when the CV selection is unstable, the combined classifier performs much better.


2010 ◽  
Vol 8 (3) ◽  
pp. 291-297 ◽  
Author(s):  
Patricia Maria de Carvalho Aguiar ◽  
Patricia Severino

ABSTRACT Objective: To evaluate the performance of gene expression analysis in the peripheral blood of Parkinson disease patients with different genetic profiles using microarray as a tool to identify possible diseases related biomarkers which could contribute to the elucidation of the pathological process, as well as be useful in diagnosis. Methods: Global gene expression analysis by means of DNA microarrays was performed in peripheral blood of Parkinson disease patients with previously identified mutations in PARK2 or PARK8 genes, Parkinson disease patients without known mutations in these genes and normal controls. Each group consisted of five individuals. Results: Global gene expression profiles were heterogeneous among patients and controls, and it was not possible to detect a consistent pattern between groups. However, analyzing genes with differential expression of p < 0.005 and fold change ≥ 1.2, we were able to identify a small group of well-annotated genes. Conclusions: Despite the small sample size, the identification of differentially expressed genes suggests that the microarray technique may be useful in identifying potential biomarkers in the peripheral blood of Parkinson disease patients or in people at risk of developing the disease. This will be important once neuroprotective therapies become available, and may contribute to the identification of new pathways involved in the disease physiopathology. Results presented here should be further validated in larger groups of patients.


2019 ◽  
Vol 20 (1) ◽  
Author(s):  
Ling Zhang ◽  
Ishwor Thapa ◽  
Christian Haas ◽  
Dhundy Bastola

Abstract Background High-throughput gene expression profiles have allowed discovery of potential biomarkers enabling early diagnosis, prognosis and developing individualized treatment. However, it remains a challenge to identify a set of reliable and reproducible biomarkers across various gene expression platforms and laboratories for single sample diagnosis and prognosis. We address this need with our Data-Driven Reference (DDR) approach, which employs stably expressed housekeeping genes as references to eliminate platform-specific biases and non-biological variabilities. Results Our method identifies biomarkers with “built-in” features, and these features can be interpreted consistently regardless of profiling technology, which enable classification of single-sample independent of platforms. Validation with RNA-seq data of blood platelets shows that DDR achieves the superior performance in classification of six different tumor types as well as molecular target statuses (such as MET or HER2-positive, and mutant KRAS, EGFR or PIK3CA) with smaller sets of biomarkers. We demonstrate on the three microarray datasets that our method is capable of identifying robust biomarkers for subgrouping medulloblastoma samples with data perturbation due to different microarray platforms. In addition to identifying the majority of subgroup-specific biomarkers in CodeSet of nanoString, some potential new biomarkers for subgrouping medulloblastoma were detected by our method. Conclusions In this study, we present a simple, yet powerful data-driven method which contributes significantly to identification of robust cross-platform gene signature for disease classification of single-patient to facilitate precision medicine. In addition, our method provides a new strategy for transcriptome analysis.


2019 ◽  
Author(s):  
Ling Zhang ◽  
Ishwor Thapa ◽  
Christian Haas ◽  
Dhundy Bastola

AbstractHigh-throughput gene expression profiles have allowed discovery of potential biomarkers enabling early diagnosis, prognosis and developing individualized treatment. However, it remains a challenge to identify a set of reliable and reproducible biomarkers across various gene expression platforms and laboratories for single sample diagnosis and prognosis. We address this need with our Data-Driven Reference (DDR) approach, which employs stably expressed housekeeping genes as references to eliminate platform-specific biases and non-biological variabilities. Our method identifies biomarkers with “built-in” features, and these features can be interpreted consistently regardless of profiling technology, which enable classification of single-sample independent of platforms. Validation with RNA-seq data of blood platelets shows that DDR achieves the superior performance in classification of six different tumor types as well as molecular target statuses (such asMETorHER2-positive, and mutantKRAS, EGFRorPIK3CA) with smaller sets of biomarkers. We demonstrate on the three microarray datasets that our method is capable of identifying robust biomarkers for subgrouping medulloblastoma samples with data perturbation due to different microarray platforms. In addition to identifying the majority of subgroup-specific biomarkers in Code-Set of nanoString, some potential new biomarkers for subgrouping medulloblastoma were detected by our method. Our results show that the DDR method contributes significantly to single-sample classification of disease and shed light on personalized medicine.


2019 ◽  
Author(s):  
Chen Jiaxing ◽  
Yen Kaow Ng ◽  
Lu Lin ◽  
Yiqi Jiang ◽  
Shuaicheng Li

Various distance functions for evaluating the differences between gene expression profiles have been proposed in the past. Such a function would output a low value if the profiles are strongly correlated—either negatively or positively—and vice versa. One popular distance function is the absolute correlation distance, da=1−|ρ|, where ρ is some similarity measures, such as Pearson or Spearman correlation. However, absolute correlation distance fails to fulfill the triangular inequality, which would have guaranteed better performance at vector quantization, allowed fast data localization, as well as sped up data clustering. In this work, we propose dr=√1−|ρ| as an alternative. We prove that dr satisfies the triangular equality when ρ represents Pearson correlation, Spearman correlation, or Cosine similarity. We empirically compared dr with da in gene clustering and sample clustering experiment, using real biological data. The two distances performed similarly in both gene cluster and sample cluster in hierarchical cluster and PAM cluster. However, dr demonstrated more robust clustering. According to bootstrap experiment, the number of times where dr generated more robust sample pair partition is significantly (p-value <0.05) larger. This advantage in robustness is also supported by the class "dissolved" event.


2010 ◽  
Vol 7 (3) ◽  
Author(s):  
Wim De Mulder ◽  
Martin Kuiper ◽  
René Boel

SummaryClustering is an important approach in the analysis of biological data, and often a first step to identify interesting patterns of coexpression in gene expression data. Because of the high complexity and diversity of gene expression data, many genes cannot be easily assigned to a cluster, but even if the dissimilarity of these genes with all other gene groups is large, they will finally be forced to become member of a cluster. In this paper we show how to detect such elements, called unstable elements. We have developed an approach for iterative clustering algorithms in which unstable elements are deleted, making the iterative algorithm less dependent on initial centers. Although the approach is unsupervised, it is less likely that the clusters into which the reduced data set is subdivided contain false positives. This clustering yields a more differentiated approach for biological data, since the cluster analysis is divided into two parts: the pruned data set is divided into highly consistent clusters in an unsupervised way and the removed, unstable elements for which no meaningful cluster exists in unsupervised terms can be given a cluster with the use of biological knowledge and information about the likelihood of cluster membership. We illustrate our framework on both an artificial and real biological data set.


2004 ◽  
Vol 171 (4S) ◽  
pp. 349-350
Author(s):  
Gaelle Fromont ◽  
Michel Vidaud ◽  
Alain Latil ◽  
Guy Vallancien ◽  
Pierre Validire ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document