scholarly journals On triangular inequalities of correlation-based distances for gene expression profiles

2019 ◽  
Author(s):  
Chen Jiaxing ◽  
Yen Kaow Ng ◽  
Lu Lin ◽  
Yiqi Jiang ◽  
Shuaicheng Li

Various distance functions for evaluating the differences between gene expression profiles have been proposed in the past. Such a function would output a low value if the profiles are strongly correlated—either negatively or positively—and vice versa. One popular distance function is the absolute correlation distance, da=1−|ρ|, where ρ is some similarity measures, such as Pearson or Spearman correlation. However, absolute correlation distance fails to fulfill the triangular inequality, which would have guaranteed better performance at vector quantization, allowed fast data localization, as well as sped up data clustering. In this work, we propose dr=√1−|ρ| as an alternative. We prove that dr satisfies the triangular equality when ρ represents Pearson correlation, Spearman correlation, or Cosine similarity. We empirically compared dr with da in gene clustering and sample clustering experiment, using real biological data. The two distances performed similarly in both gene cluster and sample cluster in hierarchical cluster and PAM cluster. However, dr demonstrated more robust clustering. According to bootstrap experiment, the number of times where dr generated more robust sample pair partition is significantly (p-value <0.05) larger. This advantage in robustness is also supported by the class "dissolved" event.

2019 ◽  
Vol 2019 ◽  
pp. 1-12 ◽  
Author(s):  
Lucas Delmonico ◽  
Said Attiya ◽  
Joan W. Chen ◽  
John C. Obenauer ◽  
Edward C. Goodwin ◽  
...  

Background. With the development of new drug combinations and targeted treatments for multiple types of cancer, the ability to stratify categories of patient populations and to develop companion diagnostics has become increasingly important. A panel of 325 RNA biomarkers was selected based on cancer-related biological processes of healthy cells and gene expression changes over time during nonmalignant epithelial cell organization. This “cancer in reverse” approach resulted in a panel of biomarkers relevant for at least 7 cancer types, providing gene expression profiles representing key cellular signaling pathways beyond mutations in “driver genes.” Objective. To further investigate this biomarker panel, the objective of the current study is to (1) validate the assay reproducibility for the 325 RNA biomarkers and (2) compare gene expression profiles side by side using two technology platforms. Methods and Results. We have mapped the 325 RNA transcripts and in a custom NanoString nCounter expression panel to be compared to all potential probe sets in the Affymetrix Human Genome U133 Plus 2.0. The experiments were conducted with 10 unique biological formalin-fixed paraffin-embedded (FFPE) breast tumor samples. Each site extracted RNA from four sections of 10-micron thick FFPE tissue over three different days by two different operators using an optimized standard operating procedure and quality control criteria. Samples were analyzed using mas5 in BioConductor and NanoStringNorm in R. Pearson correlation showed reproducibility between sites for all 60 samples with r=0.995 for Affymetrix and r=0.999 for NanoString. Correlation in multiple days and multiple users was for Affymetrix r=0.962−0.999 and for NanoString r=0.982−0.991. Conclusion. The 325 RNA biomarkers showed reproducibility in two technology platforms with moderate to high concordance. Future directions include performing clinical validation studies and generating rationale for patient selection in clinical trials using the technically validated assay.


2016 ◽  
Vol 2016 ◽  
pp. 1-10 ◽  
Author(s):  
Liying Yang ◽  
Zhimin Liu ◽  
Xiguo Yuan ◽  
Jianhua Wei ◽  
Junying Zhang

Background. Precisely predicting cancer is crucial for cancer treatment. Gene expression profiles make it possible to analyze patterns between genes and cancers on the genome-wide scale. Gene expression data analysis, however, is confronted with enormous challenges for its characteristics, such as high dimensionality, small sample size, and low Signal-to-Noise Ratio.Results. This paper proposes a method, termed RS_SVM, to predict gene expression profiles via aggregating SVM trained on random subspaces. After choosing gene features through statistical analysis, RS_SVM randomly selects feature subsets to yield random subspaces and training SVM classifiers accordingly and then aggregates SVM classifiers to capture the advantage of ensemble learning. Experiments on eight real gene expression datasets are performed to validate the RS_SVM method. Experimental results show that RS_SVM achieved better classification accuracy and generalization performance in contrast with single SVM,K-nearest neighbor, decision tree, Bagging, AdaBoost, and the state-of-the-art methods. Experiments also explored the effect of subspace size on prediction performance.Conclusions. The proposed RS_SVM method yielded superior performance in analyzing gene expression profiles, which demonstrates that RS_SVM provides a good channel for such biological data.


Blood ◽  
2005 ◽  
Vol 106 (11) ◽  
pp. 3424-3424 ◽  
Author(s):  
Kunju Sridhar ◽  
Patrick O. Brown ◽  
Robert Tibshirani ◽  
Catriona Jamieson ◽  
Irv Weissman ◽  
...  

Abstract Gene expression profiles (GEPs) were obtained from marrow hematopoietic precursor cells (HPC)(CD34+ cells) from 30 myelodysplastic syndrome (MDS) patients: RARS 2, RA 15, RAEB 9, RAEBT 4; IPSS Low 11, Int-1 10, Int-2 5, High 4, and 6 Normal individuals. Fluorescently labeled cDNA was prepared from CD34+ cells (&gt;90% purity), isolated by immunomagnetic column separation, after reverse transcription of high fidelity PCR-amplified poly(A) RNA (aRNA). The Cy-conjugated nucleotides for aRNA were hybridized to 40,000 gene chip microarrays obtained from the Stanford Functional Genomics Microarray Facility. aRNA from pooled normal CD34+ marrow cells was used as a Reference standard. High resolution scans were obtained to compile a dataset for each microarray, through files submitted to the Stanford Microarray Database. Dendrograms generated by unsupervised hierarchical gene clustering indicated major differences of GEP between Normal and MDS patients. Significance Analysis for Microarray (SAM) yielded 2327 genes significantly differentially expressed by MDS vs Normal: 2269 genes overexpressed, 58 underexpressed, with a false positive rate of ~10%. Prediction Analysis of Microarray (PAM) distinctly separated the MDS and Normal patients, requiring a minimum of 31 genes (which were also SAM significant). Class analysis by PAM correctly predicted 29 of the 30 to be MDS and 5 of the 6 to be Normal. Four disparate differential GEP regions in the dendrograms, comprising predominantly genes of differing functional categories provided signatures associated with differing MDS clinical subgroups. Nine of 10 patients with poor clinical outcomes were associated with a differing GEP signature than that which occurred in 14 of 20 patients with relatively good outcomes. Compared to the remainder of MDS patients, those with 5q- syndrome (n=5) had a differing GEP signature, with under-expression of 1018 genes, 11 of which were within the 5q31–32 CDS. Two of these genes (antioxidant protein1 and interferon regulatory factor1) have previously been proffered as candidate genes for this syndrome. Analysis of FACS-sorted highly purified marrow HPC subsets: CD34+38+ (late) and CD34+38- (early HPCs), indicated these ratios to be 4.3±2.1 (n=2) for MDS and 3.2±1.2 (n=12) for Normals. These findings suggest that the differing GEPs between the MDS and Normal CD34+ cells were not due to major differences in their proportions of CD38 cell subsets. SAM and PAM significant differential GEPs were noted between these cell subsets (also differing between MDS and Normal), indicating alteration of gene expression during differentiation. Wnt1 and β-catenin1 (genes involved in cell self-renewal) were over-expressed in both MDS CD38- and CD38+ cells compared to Normal. These data demonstrate: (1) molecular differences between MDS and Normal HPCs and within HPC subsets; (2) GEP signatures characterizing MDS patients with differing cytogenetic abnormalities (eg, 5q-) and clinical outcomes; (3) molecular criteria refining the prognostic categorization of MDS; and (4) gene expression data aiding characterization of the heterogeneous nature of this spectrum of diseases.


2010 ◽  
Vol 7 (3) ◽  
Author(s):  
Wim De Mulder ◽  
Martin Kuiper ◽  
René Boel

SummaryClustering is an important approach in the analysis of biological data, and often a first step to identify interesting patterns of coexpression in gene expression data. Because of the high complexity and diversity of gene expression data, many genes cannot be easily assigned to a cluster, but even if the dissimilarity of these genes with all other gene groups is large, they will finally be forced to become member of a cluster. In this paper we show how to detect such elements, called unstable elements. We have developed an approach for iterative clustering algorithms in which unstable elements are deleted, making the iterative algorithm less dependent on initial centers. Although the approach is unsupervised, it is less likely that the clusters into which the reduced data set is subdivided contain false positives. This clustering yields a more differentiated approach for biological data, since the cluster analysis is divided into two parts: the pruned data set is divided into highly consistent clusters in an unsupervised way and the removed, unstable elements for which no meaningful cluster exists in unsupervised terms can be given a cluster with the use of biological knowledge and information about the likelihood of cluster membership. We illustrate our framework on both an artificial and real biological data set.


2004 ◽  
Vol 171 (4S) ◽  
pp. 349-350
Author(s):  
Gaelle Fromont ◽  
Michel Vidaud ◽  
Alain Latil ◽  
Guy Vallancien ◽  
Pierre Validire ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document