scholarly journals Current Projection Methods-Induced Biases at Subgroup Detection for Machine-Learning Based Data-Analysis of Biomedical Data

2019 ◽  
Vol 21 (1) ◽  
pp. 79 ◽  
Author(s):  
Jörn Lötsch ◽  
Alfred Ultsch

Advances in flow cytometry enable the acquisition of large and high-dimensional data sets per patient. Novel computational techniques allow the visualization of structures in these data and, finally, the identification of relevant subgroups. Correct data visualizations and projections from the high-dimensional space to the visualization plane require the correct representation of the structures in the data. This work shows that frequently used techniques are unreliable in this respect. One of the most important methods for data projection in this area is the t-distributed stochastic neighbor embedding (t-SNE). We analyzed its performance on artificial and real biomedical data sets. t-SNE introduced a cluster structure for homogeneously distributed data that did not contain any subgroup structure. In other data sets, t-SNE occasionally suggested the wrong number of subgroups or projected data points belonging to different subgroups, as if belonging to the same subgroup. As an alternative approach, emergent self-organizing maps (ESOM) were used in combination with U-matrix methods. This approach allowed the correct identification of homogeneous data while in sets containing distance or density-based subgroups structures; the number of subgroups and data point assignments were correctly displayed. The results highlight possible pitfalls in the use of a currently widely applied algorithmic technique for the detection of subgroups in high dimensional cytometric data and suggest a robust alternative.

Genes ◽  
2020 ◽  
Vol 11 (7) ◽  
pp. 717
Author(s):  
Garba Abdulrauf Sharifai ◽  
Zurinahni Zainol

The training machine learning algorithm from an imbalanced data set is an inherently challenging task. It becomes more demanding with limited samples but with a massive number of features (high dimensionality). The high dimensional and imbalanced data set has posed severe challenges in many real-world applications, such as biomedical data sets. Numerous researchers investigated either imbalanced class or high dimensional data sets and came up with various methods. Nonetheless, few approaches reported in the literature have addressed the intersection of the high dimensional and imbalanced class problem due to their complicated interactions. Lately, feature selection has become a well-known technique that has been used to overcome this problem by selecting discriminative features that represent minority and majority class. This paper proposes a new method called Robust Correlation Based Redundancy and Binary Grasshopper Optimization Algorithm (rCBR-BGOA); rCBR-BGOA has employed an ensemble of multi-filters coupled with the Correlation-Based Redundancy method to select optimal feature subsets. A binary Grasshopper optimisation algorithm (BGOA) is used to construct the feature selection process as an optimisation problem to select the best (near-optimal) combination of features from the majority and minority class. The obtained results, supported by the proper statistical analysis, indicate that rCBR-BGOA can improve the classification performance for high dimensional and imbalanced datasets in terms of G-mean and the Area Under the Curve (AUC) performance metrics.


Algorithms ◽  
2020 ◽  
Vol 13 (5) ◽  
pp. 109 ◽  
Author(s):  
Marian B. Gorzałczany ◽  
Filip Rudziński

In this paper, we briefly present several modifications and generalizations of the concept of self-organizing neural networks—usually referred to as self-organizing maps (SOMs)—to illustrate their advantages in applications that range from high-dimensional data visualization to complex data clustering. Starting from conventional SOMs, Growing SOMs (GSOMs), Growing Grid Networks (GGNs), Incremental Grid Growing (IGG) approach, Growing Neural Gas (GNG) method as well as our two original solutions, i.e., Generalized SOMs with 1-Dimensional Neighborhood (GeSOMs with 1DN also referred to as Dynamic SOMs (DSOMs)) and Generalized SOMs with Tree-Like Structures (GeSOMs with T-LSs) are discussed. They are characterized in terms of (i) the modification mechanisms used, (ii) the range of network modifications introduced, (iii) the structure regularity, and (iv) the data-visualization/data-clustering effectiveness. The performance of particular solutions is illustrated and compared by means of selected data sets. We also show that the proposed original solutions, i.e., GeSOMs with 1DN (DSOMs) and GeSOMS with T-LSs outperform alternative approaches in various complex clustering tasks by providing up to 20 % increase in the clustering accuracy. The contribution of this work is threefold. First, algorithm-oriented original computer-implementations of particular SOM’s generalizations are developed. Second, their detailed simulation results are presented and discussed. Third, the advantages of our earlier-mentioned original solutions are demonstrated.


2020 ◽  
Author(s):  
Luis Andre Vale-Silva ◽  
Karl Rohr

The age of precision medicine demands powerful computational techniques to handle high-dimensional patient data. We present MultiSurv, a multimodal deep learning method for long-term pan-cancer survival prediction. MultiSurv is composed of three main modules. A feature representation module includes a dedicated submodel for each input data modality. A data fusion layer aggregates the multimodal representations. Finally, a prediction submodel yields conditional survival probabilities for a predefined set of follow-up time intervals. We trained MultiSurv on clinical, imaging, and four different high-dimensional omics data modalities from patients diagnosed with one of 33 different cancer types. We evaluated unimodal input configurations against several previous methods and different multimodal data combinations. MultiSurv achieved the best results according to different time-dependent metrics and delivered highly accurate long-term patient survival curves. The best performance was obtained when combining clinical information with either gene expression or DNA methylation data, depending on the evaluation metric. Additionally, MultiSurv can handle missing data, including missing values and complete data modalities. Interestingly, for unimodal data we found that simpler modeling approaches, including the classical Cox proportional hazards method, can achieve results rivaling those of more complex methods for certain data modalities. We also show how the learned feature representations of MultiSurv can be used to visualize relationships between cancer types and individual patients, after embedding into a low-dimensional space.


2021 ◽  
Author(s):  
Michael Thrun

Abstract Benchmark datasets with predefined cluster structures and high-dimensional biomedical datasets outline the challenges of cluster analysis: clustering algorithms are limited in their clustering ability in the presence of clusters defining distance-based structures resulting in a biased clustering solution. Data sets might not have cluster structures. Clustering yields arbitrary labels and often depends on the trial, leading to varying results. Moreover, recent research indicated that all partition comparison measures can yield the same results for different clustering solutions. Consequently, algorithm selection and parameter optimization by unsupervised quality measures (QM) are always biased and misleading. Only if the predefined structures happen to meet the particular clustering criterion and QM, can the clusters be recovered by one of the 34 open-source algorithms which are particularly useful in biomedical scenarios. Furthermore, comparative analysis with mirrored density plots provides a significantly more detailed benchmark than that with the typically used box plots or violin plots. Modern biomedical analysis techniques such as next-generation sequencing (NGS) have opened the door for complex high-dimensional data acquisition in medicine. For example, The Cancer Genome Atlas (TCGA) project provides open source cancer data for a worldwide community. The availability of such rich data sources, which enable discovering new insights into disease-related genetic mechanisms, is challenging for data analysts. Genome- or transcriptome-wide association studies may reveal novel disease-related genes, e.g.1, and virtual karyotyping by NGS-based low-coverage whole-genome sequencing may replace the conventional karyotyping technique 130 years after von Waldeyer described human chromosomes2. However, deciphering previously unknown relations and hierarchies in high-dimensional biological datasets remains a challenge for knowledge discovery, meaning that the identification of valid, novel, potentially useful, and ultimately understandable patterns in data (e.g.,3) is a difficult task. A common first step is identifying clusters of objects that are likely to be functionally related or interact4, which has provoked debates about the most suitable clustering approaches. However, the definition of a cluster remains a matter of ongoing discussion5,6. Therefore, clustering is restricted here to the task of separating data into similar groups (c.f.7,8). Vividly, relative relationships between high-dimensional data points are of interest to build up structures in data that a cluster analysis can identify. Therefore, it remains essential to evaluate the results of clustering algorithms and grasp the differences in the structures they can catch. Recent research on cluster analysis conveys the message that relevant and possibly prior unknown relationships in high-dimensional biological datasets can be discovered by employing optimization procedures and automatic pipelines for either benchmarking or algorithm selection (e.g.,4,9). The state-of-the-art approach is to use one or more unsupervised indices for automatic evaluation, e.g., Wiwie et al.4 suggest the following guidelines for biomedical data: "Use […] [hierarchical clustering*] or PAM. (2) Compute the silhouette values for clustering results using a broad range of parameter set variations. (3) Pick the result for the parameter set yielding the highest silhouette value" (*Restricted to UPGMA or average linking, see https://clusteval.sdu.dk/1/programs). Alternatively, the authors provide the possibility of using the internal Davies–Bouldin10 and Dunn11 indices. This work demonstrates the pitfalls and challenges of such approaches; more precisely, it shows that • Parameter optimization on datasets without distance-based clusters, • Algorithm selection by unsupervised quality measures on biomedical data, and • Benchmarking clustering algorithms with first-order statistics or box plots or a small number of trials are biased and often not recommended. Evidence for these pitfalls in cluster analysis is provided through the systematic and unbiased evaluation of 34 open source clustering algorithms with several bodies of data that possess clearly defined structures. These insights are particularly useful for knowledge discovery in biomedical scenarios. Select distance-based structures are consistently defined in artificial samples of data with specific pitfalls for clustering algorithms. Moreover, two natural datasets with investigated cluster structures are employed, and it is shown that the data reflect a true and valid empirical biomedical entity. This work shows that the limitations of clustering methods induced by their clustering criterion cannot be overcome by optimizing the algorithm parameters with a global criterion because such optimization can only reduce the variance but not the intrinsic bias. This limitation is outlined in two examples in which, by optimizing the quality measure of the Davies–Boulding index10, Dunn index11 or Silhouette value12, a specific cluster structure is imposed, but the clinically relevant cluster structures are not reproduced. The biases of conventional clustering algorithms are investigated on five artificially defined data structures and two high-dimensional datasets. Furthermore, a clustering algorithm's parameters can still be significantly optimized even if the dataset does not possess any distance-based cluster structure.


2021 ◽  
Author(s):  
Kehinde Olobatuyi

Abstract Similar to many Machine Learning models, both accuracy and speed of the Cluster weighted models (CWMs) can be hampered by high-dimensional data, leading to previous works on a parsimonious technique to reduce the effect of ”Curse of dimensionality” on mixture models. In this work, we review the background study of the cluster weighted models (CWMs). We further show that parsimonious technique is not sufficient for mixture models to thrive in the presence of huge high-dimensional data. We discuss a heuristic for detecting the hidden components by choosing the initial values of location parameters using the default values in the ”FlexCWM” R package. We introduce a dimensionality reduction technique called T-distributed stochastic neighbor embedding (TSNE) to enhance the parsimonious CWMs in high-dimensional space. Originally, CWMs are suited for regression but for classification purposes, all multi-class variables are transformed logarithmically with some noise. The parameters of the model are obtained via expectation maximization algorithm. The effectiveness of the discussed technique is demonstrated using real data sets from different fields.


2013 ◽  
Vol 2013 ◽  
pp. 1-12 ◽  
Author(s):  
Singh Vijendra ◽  
Sahoo Laxman

Clustering high-dimensional data has been a major challenge due to the inherent sparsity of the points. Most existing clustering algorithms become substantially inefficient if the required similarity measure is computed between data points in the full-dimensional space. In this paper, we have presented a robust multi objective subspace clustering (MOSCL) algorithm for the challenging problem of high-dimensional clustering. The first phase of MOSCL performs subspace relevance analysis by detecting dense and sparse regions with their locations in data set. After detection of dense regions it eliminates outliers. MOSCL discovers subspaces in dense regions of data set and produces subspace clusters. In thorough experiments on synthetic and real-world data sets, we demonstrate that MOSCL for subspace clustering is superior to PROCLUS clustering algorithm. Additionally we investigate the effects of first phase for detecting dense regions on the results of subspace clustering. Our results indicate that removing outliers improves the accuracy of subspace clustering. The clustering results are validated by clustering error (CE) distance on various data sets. MOSCL can discover the clusters in all subspaces with high quality, and the efficiency of MOSCL outperforms PROCLUS.


2017 ◽  
Vol 18 (1) ◽  
pp. 94-109 ◽  
Author(s):  
Junpeng Wang ◽  
Xiaotong Liu ◽  
Han-Wei Shen

Due to the intricate relationship between different dimensions of high-dimensional data, subspace analysis is often conducted to decompose dimensions and give prominence to certain subsets of dimensions, i.e. subspaces. Exploring and comparing subspaces are important to reveal the underlying features of subspaces, as well as to portray the characteristics of individual dimensions. To date, most of the existing high-dimensional data exploration and analysis approaches rely on dimensionality reduction algorithms (e.g. principal component analysis and multi-dimensional scaling) to project high-dimensional data, or their subspaces, to two-dimensional space and employ scatterplots for visualization. However, the dimensionality reduction algorithms are sometimes difficult to fine-tune and scatterplots are not effective for comparative visualization, making subspace comparison hard to perform. In this article, we aggregate high-dimensional data or their subspaces by computing pair-wise distances between all data items and showing the distances with matrix visualizations to present the original high-dimensional data or subspaces. Our approach enables effective visual comparisons among subspaces, which allows users to further investigate the characteristics of individual dimensions by studying their behaviors in similar subspaces. Through subspace comparisons, we identify dominant, similar, and conforming dimensions in different subspace contexts of synthetic and real-world high-dimensional data sets. Additionally, we present a prototype that integrates parallel coordinates plot and matrix visualization for high-dimensional data exploration and incremental dimensionality analysis, which also allows users to further validate the dimension characterization results derived from the subspace comparisons.


2017 ◽  
Vol 11 (3) ◽  
pp. 224-233
Author(s):  
Xiaoyun Chen ◽  
Mengzhen Liao ◽  
Xianbao Ye

Gene expression data is a kind of high dimension and small sample size data. The clustering accuracy of conventional clustering techniques is lower on gene expression data due to its high dimension. Because some subspace segmentation approaches can be better applied in the high-dimensional space, three new subspace clustering models for gene expression data sets are proposed in this work. The proposed projection subspace clustering models have projection sparse subspace clustering, projection low-rank representation subspace clustering and projection least-squares regression subspace clustering which combine projection technique with sparse subspace clustering, low-rank representation and least-square regression, respectively. In order to compute the inner product in the high-dimensional space, the kernel function is used to the projection subspace clustering models. The experimental results on six gene expression data sets show these models are effective.


Author(s):  
Dezhong Yao ◽  
Peilin Zhao ◽  
Tuan-Anh Nguyen Pham ◽  
Gao Cong

We investigate how to adopt dual random projection for high-dimensional similarity learning. For a high-dimensional similarity learning problem, projection is usually adopted to map high-dimensional features into low-dimensional space, in order to reduce the computational cost. However, dimensionality reduction method sometimes results in unstable performance due to the suboptimal solution in original space. In this paper, we propose a dual random projection framework for similarity learning to recover the original optimal solution from subspace optimal solution. Previous dual random projection methods usually make strong assumptions about the data, which need to be low rank or have a large margin. Those assumptions limit dual random projection applications in similarity learning. Thus, we adopt a dual-sparse regularized random projection method that introduces a sparse regularizer into the reduced dual problem. As the original dual solution is a sparse one, applying a sparse regularizer in the reduced space relaxes the low-rank assumption. Experimental results show that our method enjoys higher effectiveness and efficiency than state-of-the-art solutions.


Sign in / Sign up

Export Citation Format

Share Document