scholarly journals Genetic Programming for Evolving Similarity Functions for Clustering: Representations and Analysis

2020 ◽  
Vol 28 (4) ◽  
pp. 531-561 ◽  
Author(s):  
Andrew Lensen ◽  
Bing Xue ◽  
Mengjie Zhang

Clustering is a difficult and widely studied data mining task, with many varieties of clustering algorithms proposed in the literature. Nearly all algorithms use a similarity measure such as a distance metric (e.g., Euclidean distance) to decide which instances to assign to the same cluster. These similarity measures are generally predefined and cannot be easily tailored to the properties of a particular dataset, which leads to limitations in the quality and the interpretability of the clusters produced. In this article, we propose a new approach to automatically evolving similarity functions for a given clustering algorithm by using genetic programming. We introduce a new genetic programming-based method which automatically selects a small subset of features (feature selection) and then combines them using a variety of functions (feature construction) to produce dynamic and flexible similarity functions that are specifically designed for a given dataset. We demonstrate how the evolved similarity functions can be used to perform clustering using a graph-based representation. The results of a variety of experiments across a range of large, high-dimensional datasets show that the proposed approach can achieve higher and more consistent performance than the benchmark methods. We further extend the proposed approach to automatically produce multiple complementary similarity functions by using a multi-tree approach, which gives further performance improvements. We also analyse the interpretability and structure of the automatically evolved similarity functions to provide insight into how and why they are superior to standard distance metrics.

2020 ◽  
Author(s):  
Andrew Lensen ◽  
Bing Xue ◽  
Mengjie Zhang

Clustering is a difficult and widely studied data mining task, with many varieties of clustering algorithms proposed in the literature. Nearly all algorithms use a similarity measure such as a distance metric (e.g., Euclidean distance) to decide which instances to assign to the same cluster. These similarity measures are generally predefined and cannot be easily tailored to the properties of a particular dataset, which leads to limitations in the quality and the interpretability of the clusters produced. In this article, we propose a new approach to automatically evolving similarity functions for a given clustering algorithm by using genetic programming. We introduce a new genetic programming-based method which automatically selects a small subset of features (feature selection) and then combines them using a variety of functions (feature construction) to produce dynamic and flexible similarity functions that are specifically designed for a given dataset. We demonstrate how the evolved similarity functions can be used to perform clustering using a graph-based representation. The results of a variety of experiments across a range of large, high-dimensional datasets show that the proposed approach can achieve higher and more consistent performance than the benchmark methods. We further extend the proposed approach to automatically produce multiple complementary similarity functions by using a multi-tree approach, which gives further performance improvements. We also analyse the interpretability and structure of the automatically evolved similarity functions to provide insight into how and why they are superior to standard distance metrics.


2020 ◽  
Author(s):  
Andrew Lensen ◽  
Bing Xue ◽  
Mengjie Zhang

Clustering is a difficult and widely studied data mining task, with many varieties of clustering algorithms proposed in the literature. Nearly all algorithms use a similarity measure such as a distance metric (e.g., Euclidean distance) to decide which instances to assign to the same cluster. These similarity measures are generally predefined and cannot be easily tailored to the properties of a particular dataset, which leads to limitations in the quality and the interpretability of the clusters produced. In this article, we propose a new approach to automatically evolving similarity functions for a given clustering algorithm by using genetic programming. We introduce a new genetic programming-based method which automatically selects a small subset of features (feature selection) and then combines them using a variety of functions (feature construction) to produce dynamic and flexible similarity functions that are specifically designed for a given dataset. We demonstrate how the evolved similarity functions can be used to perform clustering using a graph-based representation. The results of a variety of experiments across a range of large, high-dimensional datasets show that the proposed approach can achieve higher and more consistent performance than the benchmark methods. We further extend the proposed approach to automatically produce multiple complementary similarity functions by using a multi-tree approach, which gives further performance improvements. We also analyse the interpretability and structure of the automatically evolved similarity functions to provide insight into how and why they are superior to standard distance metrics.


2020 ◽  
Author(s):  
Andrew Lensen ◽  
Bing Xue ◽  
Mengjie Zhang

Clustering is a difficult and widely studied data mining task, with many varieties of clustering algorithms proposed in the literature. Nearly all algorithms use a similarity measure such as a distance metric (e.g., Euclidean distance) to decide which instances to assign to the same cluster. These similarity measures are generally predefined and cannot be easily tailored to the properties of a particular dataset, which leads to limitations in the quality and the interpretability of the clusters produced. In this article, we propose a new approach to automatically evolving similarity functions for a given clustering algorithm by using genetic programming. We introduce a new genetic programming-based method which automatically selects a small subset of features (feature selection) and then combines them using a variety of functions (feature construction) to produce dynamic and flexible similarity functions that are specifically designed for a given dataset. We demonstrate how the evolved similarity functions can be used to perform clustering using a graph-based representation. The results of a variety of experiments across a range of large, high-dimensional datasets show that the proposed approach can achieve higher and more consistent performance than the benchmark methods. We further extend the proposed approach to automatically produce multiple complementary similarity functions by using a multi-tree approach, which gives further performance improvements. We also analyse the interpretability and structure of the automatically evolved similarity functions to provide insight into how and why they are superior to standard distance metrics.


2020 ◽  
Author(s):  
Andrew Lensen ◽  
Bing Xue ◽  
Mengjie Zhang

Clustering is a difficult and widely studied data mining task, with many varieties of clustering algorithms proposed in the literature. Nearly all algorithms use a similarity measure such as a distance metric (e.g., Euclidean distance) to decide which instances to assign to the same cluster. These similarity measures are generally predefined and cannot be easily tailored to the properties of a particular dataset, which leads to limitations in the quality and the interpretability of the clusters produced. In this article, we propose a new approach to automatically evolving similarity functions for a given clustering algorithm by using genetic programming. We introduce a new genetic programming-based method which automatically selects a small subset of features (feature selection) and then combines them using a variety of functions (feature construction) to produce dynamic and flexible similarity functions that are specifically designed for a given dataset. We demonstrate how the evolved similarity functions can be used to perform clustering using a graph-based representation. The results of a variety of experiments across a range of large, high-dimensional datasets show that the proposed approach can achieve higher and more consistent performance than the benchmark methods. We further extend the proposed approach to automatically produce multiple complementary similarity functions by using a multi-tree approach, which gives further performance improvements. We also analyse the interpretability and structure of the automatically evolved similarity functions to provide insight into how and why they are superior to standard distance metrics.


Author(s):  
Pradeep Kumar Kumar ◽  
Raju S. Bapi ◽  
P. Radha Krishna

With the growth in the number of web users and necessity for making information available on the web, the problem of web personalization has become very critical and popular. Developers are trying to customize a web site to the needs of specific users with the help of knowledge acquired from user navigational behavior. Since user page visits are intrinsically sequential in nature, efficient clustering algorithms for sequential data are needed. In this paper, we introduce a similarity preserving function called sequence and set similarity measure S3M that captures both the order of occurrence of page visits as well as the content of pages. We conducted pilot experiments comparing the results of PAM, a standard clustering algorithm, with two similarity measures: Cosine and S3M. The goodness of the clusters resulting from both the measures was computed using a cluster validation technique based on average levensthein distance. Results on pilot dataset established the effectiveness of S3M for sequential data. Based on these results, we proposed a new clustering algorithm, SeqPAM for clustering sequential data. We tested the new algorithm on two datasets namely, cti and msnbc datasets. We provided recommendations for web personalization based on the clusters obtained from SeqPAM for msnbc dataset.


2021 ◽  
Author(s):  
James Anibal ◽  
Alexandre Day ◽  
Erol Bahadiroglu ◽  
Liam O'Neill ◽  
Long Phan ◽  
...  

Data clustering plays a significant role in biomedical sciences, particularly in single-cell data analysis. Researchers use clustering algorithms to group individual cells into populations that can be evaluated across different levels of disease progression, drug response, and other clinical statuses. In many cases, multiple sets of clusters must be generated to assess varying levels of cluster specificity. For example, there are many subtypes of leukocytes (e.g. T cells), whose individual preponderance and phenotype must be assessed for statistical/functional significance. In this report, we introduce a novel hierarchical density clustering algorithm (HAL-x) that uses supervised linkage methods to build a cluster hierarchy on raw single-cell data. With this new approach, HAL-x can quickly predict multiple sets of labels for immense datasets, achieving a considerable improvement in computational efficiency on large datasets compared to existing methods. We also show that cell clusters generated by HAL-x yield near-perfect F1-scores when classifying different clinical statuses based on single-cell profiles. Our hierarchical density clustering algorithm achieves high accuracy in single cell classification in a scalable, tunable and rapid manner. We make HAL-x publicly available at: https://pypi.org/project/hal-x/


2008 ◽  
pp. 3285-3307
Author(s):  
Pradeep Kumar ◽  
Raju S. Bapi ◽  
P. Radha Krishna

With the growth in the number of Web users and necessity for making information available on the Web, the problem of Web personalization has become very critical and popular. Developers are trying to customize a Web site to the needs of specific users with the help of knowledge acquired from user navigational behavior. Since user page visits are intrinsically sequential in nature, efficient clustering algorithms for sequential data are needed. In this chapter, we introduce a similarity preserving function called sequence and set similarity measure S3M that captures both the order of occurrence of page visits as well as the content of pages. We conducted pilot experiments comparing the results of PAM, a standard clustering algorithm, with two similarity measures: Cosine and S3M. The goodness of the clusters resulting from both the measures was computed using a cluster validation technique based on average levensthein distance. Results on pilot dataset established the effectiveness of S3M for sequential data. Based on these results, we proposed a new clustering algorithm, SeqPAM for clustering sequential data. We tested the new algorithm on two datasets namely, cti and msnbc datasets. We provided recommendations for Web personalization based on the clusters obtained from SeqPAM for msnbc dataset.


2020 ◽  
Vol 11 (1) ◽  
pp. 45-60
Author(s):  
Billel Kenidra ◽  
Mohamed Benmohammed

The clustering process is used to identify cancer subtypes based on gene expression and DNA methylation datasets, since cancer subtype information is critically important for understanding tumor heterogeneity, detecting previously unknown clusters of biological samples, which are usually associated with unknown types of cancer will, in turn, gives way to prescribe more effective treatments for patients. This is because cancer has varying subtypes which often respond disparately to the same treatment. While the DNA methylation database is extremely large-scale datasets, running time still remains a major challenge. Actually, traditional clustering algorithms are too slow to handle biological high-dimensional datasets, they usually require large amounts of computational time. The proposed clustering algorithm extraordinarily overcomes all others in terms of running time, it is able to rapidly identify a set of biologically relevant clusters in large-scale DNA methylation datasets, its superiority over the others has been demonstrated regarding its relative speed.


2019 ◽  
Vol 8 (1) ◽  
pp. 1-17
Author(s):  
Thuan Tan Nguyen ◽  
Ban Van Doan ◽  
Chau Ngoc Truong ◽  
Trinh Thi Thuy Tran

The purpose of the clustering method is to provide some meaningful partitioning of the data set. In general, finding separate clusters with similar members is essential. A problem in clustering is how to determine the number of optimal clusters that best fits the data set. Most clustering algorithms generate a partition based on input parameters (for example, cluster number, minimum density) which results in limiting the number of clusters. Therefore, the article proposes an improved EMC clustering algorithm that is more flexible in handling and manipulating those clusters, where input parameter values are assumed to be different clusters for different partitions of a data set. In addition, based on the above partitioning results, this article proposes a new approach to processing and optimizing fuzzy queries to improve efficiency in the manipulation and processing of specific data such as (less time consuming, less resource consuming)


Sign in / Sign up

Export Citation Format

Share Document