The Fundamental Clustering and Projection Suite (FCPS): A Dataset Collection to Test the Performance of Clustering and Data Projection Algorithms

Alfred Ultsch; Jörn Lötsch

doi:10.3390/data5010013

The Fundamental Clustering and Projection Suite (FCPS): A Dataset Collection to Test the Performance of Clustering and Data Projection Algorithms

Data ◽

10.3390/data5010013 ◽

2020 ◽

Vol 5 (1) ◽

pp. 13 ◽

Cited By ~ 1

Author(s):

Alfred Ultsch ◽

Jörn Lötsch

Keyword(s):

Data Science ◽

Clustering Algorithms ◽

Cluster Structure ◽

Projection Methods ◽

Analysis Method ◽

Clustering Methods ◽

Projection Algorithms ◽

Science Data ◽

Data Density ◽

Data Points

In the context of data science, data projection and clustering are common procedures. The chosen analysis method is crucial to avoid faulty pattern recognition. It is therefore necessary to know the properties and especially the limitations of projection and clustering algorithms. This report describes a collection of datasets that are grouped together in the Fundamental Clustering and Projection Suite (FCPS). The FCPS contains 10 datasets with the names “Atom”, “Chainlink”, “EngyTime”, “Golfball”, “Hepta”, “Lsun”, “Target”, “Tetra”, “TwoDiamonds”, and “WingNut”. Common clustering methods occasionally identified non-existent clusters or assigned data points to the wrong clusters in the FCPS suite. Likewise, common data projection methods could only partially reproduce the data structure correctly on a two-dimensional plane. In conclusion, the FCPS dataset collection addresses general challenges for clustering and projection algorithms such as lack of linear separability, different or small inner class spacing, classes defined by data density rather than data spacing, no cluster structure at all, outliers, or classes that are in contact. This report describes a collection of datasets that are grouped together in the Fundamental Clustering and Projection Suite (FCPS). It is designed to address specific problems of structure discovery in high-dimensional spaces.

Download Full-text

ClusterEnG: An interactive educational web resource for clustering big data

10.1101/120915 ◽

2017 ◽

Author(s):

Mohith Manjunath ◽

Yi Zhang ◽

Steve H. Yeo ◽

Omar Sobh ◽

Nathan Russell ◽

...

Keyword(s):

Big Data ◽

State Of The Art ◽

Clustering Algorithms ◽

Clustering Methods ◽

Web Resource ◽

Interactive Visualizations ◽

Data Points ◽

Similarities And Differences ◽

Intuitive Manner ◽

The Web

AbstractSummaryClustering is one of the most common techniques used in data analysis to discover hidden structures by grouping together data points that are similar in some measure into clusters. Although there are many programs available for performing clustering, a single web resource that provides both state-of-the-art clustering methods and interactive visualizations is lacking. ClusterEnG (acronym for Clustering Engine for Genomics) provides an interface for clustering big data and interactive visualizations including 3D views, cluster selection and zoom features. ClusterEnG also aims at educating the user about the similarities and differences between various clustering algorithms and provides clustering tutorials that demonstrate potential pitfalls of each algorithm. The web resource will be particularly useful to scientists who are not conversant with computing but want to understand the structure of their data in an intuitive manner.AvailabilityClusterEnG is part of a bigger project called KnowEnG (Knowledge Engine for Genomics) and is available at http://education.knoweng.org/[email protected]

Download Full-text

A Novel Complex Networks Clustering Algorithm Based on the Core Influence of Nodes

The Scientific World JOURNAL ◽

10.1155/2014/801854 ◽

2014 ◽

Vol 2014 ◽

pp. 1-7 ◽

Cited By ~ 3

Author(s):

Chao Tong ◽

Jianwei Niu ◽

Bin Dai ◽

Zhongyu Xie

Keyword(s):

Complex Networks ◽

Clustering Algorithm ◽

Cluster Formation ◽

Clustering Algorithms ◽

Cluster Structure ◽

Network Clustering ◽

Clustering Methods ◽

Positive Role ◽

The Core ◽

Final Cluster

In complex networks, cluster structure, identified by the heterogeneity of nodes, has become a common and important topological property. Network clustering methods are thus significant for the study of complex networks. Currently, many typical clustering algorithms have some weakness like inaccuracy and slow convergence. In this paper, we propose a clustering algorithm by calculating the core influence of nodes. The clustering process is a simulation of the process of cluster formation in sociology. The algorithm detects the nodes with core influence through their betweenness centrality, and builds the cluster’s core structure by discriminant functions. Next, the algorithm gets the final cluster structure after clustering the rest of the nodes in the network by optimizing method. Experiments on different datasets show that the clustering accuracy of this algorithm is superior to the classical clustering algorithm (Fast-Newman algorithm). It clusters faster and plays a positive role in revealing the real cluster structure of complex networks precisely.

Download Full-text

Hyperspectral Image Clustering with Spatially-Regularized Ultrametrics

Remote Sensing ◽

10.3390/rs13050955 ◽

2021 ◽

Vol 13 (5) ◽

pp. 955

Author(s):

Shukun Zhang ◽

James M. Murphy

Keyword(s):

Spectral Clustering ◽

Hyperspectral Image ◽

State Of The Art ◽

Image Clustering ◽

Clustering Methods ◽

Performance Guarantees ◽

Data Density ◽

Spatial Geometry ◽

Data Points ◽

Almost All

We propose a method for the unsupervised clustering of hyperspectral images based on spatially regularized spectral clustering with ultrametric path distances. The proposed method efficiently combines data density and spectral-spatial geometry to distinguish between material classes in the data, without the need for training labels. The proposed method is efficient, with quasilinear scaling in the number of data points, and enjoys robust theoretical performance guarantees. Extensive experiments on synthetic and real HSI data demonstrate its strong performance compared to benchmark and state-of-the-art methods. Indeed, the proposed method not only achieves excellent labeling accuracy, but also efficiently estimates the number of clusters. Thus, unlike almost all existing hyperspectral clustering methods, the proposed algorithm is essentially parameter-free.

Download Full-text

Multiple Partitions Aligned Clustering

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/375 ◽

2019 ◽

Cited By ~ 1

Author(s):

Zhao Kang ◽

Zipeng Guo ◽

Shudong Huang ◽

Siying Wang ◽

Wenyu Chen ◽

...

Keyword(s):

State Of The Art ◽

Cluster Structure ◽

Consensus Clustering ◽

Consensus Cluster ◽

Clustering Methods ◽

Unified Framework ◽

Significant Information ◽

Heterogeneous Information ◽

Data Points ◽

Indicator Matrix

Multi-view clustering is an important yet challenging task due to the difficulty of integrating the information from multiple representations. Most existing multi-view clustering methods explore the heterogeneous information in the space where the data points lie. Such common practice may cause significant information loss because of unavoidable noise or inconsistency among views. Since different views admit the same cluster structure, the natural space should be all partitions. Orthogonal to existing techniques, in this paper, we propose to leverage the multi-view information by fusing partitions. Specifically, we align each partition to form a consensus cluster indicator matrix through a distinct rotation matrix. Moreover, a weight is assigned for each view to account for the clustering capacity differences of views. Finally, the basic partitions, weights, and consensus clustering are jointly learned in a unified framework. We demonstrate the effectiveness of our approach on several real datasets, where significant improvement is found over other state-of-the-art multi-view clustering methods.

Download Full-text

ClusterEnG: an interactive educational web resource for clustering and visualizing high-dimensional data

PeerJ Computer Science ◽

10.7717/peerj-cs.155 ◽

2018 ◽

Vol 4 ◽

pp. e155 ◽

Cited By ~ 3

Author(s):

Mohith Manjunath ◽

Yi Zhang ◽

Yeonsung Kim ◽

Steve H. Yeo ◽

Omar Sobh ◽

...

Keyword(s):

Clustering Algorithm ◽

Clustering Algorithms ◽

Clustering Methods ◽

Web Interface ◽

Web Resource ◽

Interactive Visualizations ◽

Data Points ◽

Clustering Data ◽

Clustering Validation ◽

Intuitive Manner

Background Clustering is one of the most common techniques in data analysis and seeks to group together data points that are similar in some measure. Although there are many computer programs available for performing clustering, a single web resource that provides several state-of-the-art clustering methods, interactive visualizations and evaluation of clustering results is lacking. Methods ClusterEnG (acronym for Clustering Engine for Genomics) provides a web interface for clustering data and interactive visualizations including 3D views, data selection and zoom features. Eighteen clustering validation measures are also presented to aid the user in selecting a suitable algorithm for their dataset. ClusterEnG also aims at educating the user about the similarities and differences between various clustering algorithms and provides tutorials that demonstrate potential pitfalls of each algorithm. Conclusions The web resource will be particularly useful to scientists who are not conversant with computing but want to understand the structure of their data in an intuitive manner. The validation measures facilitate the process of choosing a suitable clustering algorithm among the available options. ClusterEnG is part of a bigger project called KnowEnG (Knowledge Engine for Genomics) and is available at http://education.knoweng.org/clustereng.

Download Full-text

A novel bidirectional clustering algorithm based on local density

Scientific Reports ◽

10.1038/s41598-021-93244-2 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Baicheng Lyu ◽

Wenhua Wu ◽

Zhiqiang Hu

Keyword(s):

Clustering Algorithm ◽

Local Density ◽

Clustering Algorithms ◽

Cluster Number ◽

Denoising Method ◽

Number Of Clusters ◽

Data Points ◽

Cutoff Distance ◽

Large Clusters ◽

Small Clusters

AbstractWith the widely application of cluster analysis, the number of clusters is gradually increasing, as is the difficulty in selecting the judgment indicators of cluster numbers. Also, small clusters are crucial to discovering the extreme characteristics of data samples, but current clustering algorithms focus mainly on analyzing large clusters. In this paper, a bidirectional clustering algorithm based on local density (BCALoD) is proposed. BCALoD establishes the connection between data points based on local density, can automatically determine the number of clusters, is more sensitive to small clusters, and can reduce the adjusted parameters to a minimum. On the basis of the robustness of cluster number to noise, a denoising method suitable for BCALoD is proposed. Different cutoff distance and cutoff density are assigned to each data cluster, which results in improved clustering performance. Clustering ability of BCALoD is verified by randomly generated datasets and city light satellite images.

Download Full-text

Global Surgery, Obstetric, and Anesthesia Indicator Definitions and Reporting: An Utstein Consensus Report

10.20944/preprints202104.0061.v1 ◽

2021 ◽

Author(s):

Justine Ina Davies ◽

Adrian W. Gelb ◽

Julian Gore-Booth ◽

Janet Martin ◽

Jannicke Mellin-Olsen ◽

...

Keyword(s):

Low Income ◽

Data Science ◽

Health Indicators ◽

Obstetric Care ◽

Global Surgery ◽

Low Income Countries ◽

Financial Risk Protection ◽

Current Time ◽

Timely Access ◽

Data Points

Background Indicators to evaluate progress towards timely access to safe surgical, anaesthesia, and obstetric (SAO) care were proposed in 2015 by the Lancet Commission on Global Surgery. Despite being rapidly taken up by practitioners, datapoints from which to derive them were not defined, limiting comparability across time or settings. We convened global experts to evaluate and explicitly define - for the first time - the indicators to improve comparability and support achievement of 2030 goals to improve access to safe affordable surgical and anaesthesia care. Methods and findings The Utstein process for developing and reporting guidelines through a consensus building process was followed. In-person discussions at a two day meeting were followed by an iterative process conducted by email and virtual group meetings until consensus was reached. Participants consisted of experts in surgery, anaesthesia, and obstetric care, data science, and health indicators from high, middle, and low income countries. Considering each of the six indicators in turn, we refined overarching descriptions and agreed upon data points needed for construction of each indicator at current time (basic data points), and as each evolves over 2-5 (intermediate) and >5 year (full) timeframes. We removed one of the original six indicators (one of two financial risk protection indicators was eliminated) and refined descriptions and defined data points required to construct the 5 remaining indicators: geospatial access, workforce, surgical volume, perioperative mortality, and catastrophic expenditure. Conclusions To track global progress toward timely access to quality SAO care, these indicators – at the basic level - should be implemented universally. Intermediate and full evolutions will assist in developing national surgical plans, and collecting data for research studies.

Download Full-text

COVID-19 Evidence Accelerator: A parallel analysis to describe the use of Hydroxychloroquine with or without Azithromycin among hospitalized COVID-19 patients

PLoS ONE ◽

10.1371/journal.pone.0248128 ◽

2021 ◽

Vol 16 (3) ◽

pp. e0248128

Author(s):

Mark Stewart ◽

Carla Rodriguez-Watson ◽

Adem Albayrak ◽

Julius Asubonteng ◽

Andrew Belli ◽

...

Keyword(s):

Best Practices ◽

Adverse Events ◽

Data Science ◽

Proportional Hazards ◽

Cox Proportional Hazards ◽

Parallel Analysis ◽

Science Data ◽

Systems Research ◽

Cox Proportional Hazards Models ◽

Global Threat

Background The COVID-19 pandemic remains a significant global threat. However, despite urgent need, there remains uncertainty surrounding best practices for pharmaceutical interventions to treat COVID-19. In particular, conflicting evidence has emerged surrounding the use of hydroxychloroquine and azithromycin, alone or in combination, for COVID-19. The COVID-19 Evidence Accelerator convened by the Reagan-Udall Foundation for the FDA, in collaboration with Friends of Cancer Research, assembled experts from the health systems research, regulatory science, data science, and epidemiology to participate in a large parallel analysis of different data sets to further explore the effectiveness of these treatments. Methods Electronic health record (EHR) and claims data were extracted from seven separate databases. Parallel analyses were undertaken on data extracted from each source. Each analysis examined time to mortality in hospitalized patients treated with hydroxychloroquine, azithromycin, and the two in combination as compared to patients not treated with either drug. Cox proportional hazards models were used, and propensity score methods were undertaken to adjust for confounding. Frequencies of adverse events in each treatment group were also examined. Results Neither hydroxychloroquine nor azithromycin, alone or in combination, were significantly associated with time to mortality among hospitalized COVID-19 patients. No treatment groups appeared to have an elevated risk of adverse events. Conclusion Administration of hydroxychloroquine, azithromycin, and their combination appeared to have no effect on time to mortality in hospitalized COVID-19 patients. Continued research is needed to clarify best practices surrounding treatment of COVID-19.

Download Full-text

A Robust Noise Resistant Algorithm for POI Identification from Flickr Data

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2017/460 ◽

2017 ◽

Cited By ~ 3

Author(s):

Yiyang Yang ◽

Zhiguo Gong ◽

Qing Li ◽

Leong Hou U ◽

Ruichu Cai ◽

...

Keyword(s):

Social Media ◽

Joint Space ◽

Zero Crossing ◽

Social Media Data ◽

Local Maxima ◽

Gradient Ascent ◽

Data Density ◽

Density Values ◽

Data Points ◽

Media Data

Point of Interests (POI) identification using social media data (e.g. Flickr, Microblog) is one of the most popular research topics in recent years. However, there exist large amounts of noises (POI irrelevant data) in such crowd-contributed collections. Traditional solutions to this problem is to set a global density threshold and remove the data point as noise if its density is lower than the threshold. However, the density values vary significantly among POIs. As the result, some POIs with relatively lower density could not be identified. To solve the problem, we propose a technique based on the local drastic changes of the data density. First we define the local maxima of the density function as the Urban POIs, and the gradient ascent algorithm is exploited to assign data points into different clusters. To remove noises, we incorporate the Laplacian Zero-Crossing points along the gradient ascent process as the boundaries of the POI. Points located outside the POI region are regarded as noises. Then the technique is extended into the geographical and textual joint space so that it can make use of the heterogeneous features of social media. The experimental results show the significance of the proposed approach in removing noises.

Download Full-text

RFCell: A Gene Selection Approach for scRNA-seq Clustering Based on Permutation and Random Forest

Frontiers in Genetics ◽

10.3389/fgene.2021.665843 ◽

2021 ◽

Vol 12 ◽

Author(s):

Yuan Zhao ◽

Zhao-Yu Fang ◽

Cui-Xiang Lin ◽

Chao Deng ◽

Yun-Pei Xu ◽

...

Keyword(s):

Random Forest ◽

Single Cell ◽

Gene Selection ◽

Clustering Algorithms ◽

Selection Methods ◽

Clustering Methods ◽

Cell Type ◽

Cell Type Specificity ◽

Random Forest Classification ◽

Forest Classification

In recent years, the application of single cell RNA-seq (scRNA-seq) has become more and more popular in fields such as biology and medical research. Analyzing scRNA-seq data can discover complex cell populations and infer single-cell trajectories in cell development. Clustering is one of the most important methods to analyze scRNA-seq data. In this paper, we focus on improving scRNA-seq clustering through gene selection, which also reduces the dimensionality of scRNA-seq data. Studies have shown that gene selection for scRNA-seq data can improve clustering accuracy. Therefore, it is important to select genes with cell type specificity. Gene selection not only helps to reduce the dimensionality of scRNA-seq data, but also can improve cell type identification in combination with clustering methods. Here, we proposed RFCell, a supervised gene selection method, which is based on permutation and random forest classification. We first use RFCell and three existing gene selection methods to select gene sets on 10 scRNA-seq data sets. Then, three classical clustering algorithms are used to cluster the cells obtained by these gene selection methods. We found that the gene selection performance of RFCell was better than other gene selection methods.

Download Full-text