Research and Application of Improved Clustering Algorithm in Retail Customer Classification

Chu Fang; Haiming Liu

doi:10.3390/sym13101789

Research and Application of Improved Clustering Algorithm in Retail Customer Classification

Symmetry ◽

10.3390/sym13101789 ◽

2021 ◽

Vol 13 (10) ◽

pp. 1789

Author(s):

Chu Fang ◽

Haiming Liu

Keyword(s):

Data Mining ◽

Customer Value ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Customer Segmentation ◽

Basic Knowledge ◽

Data Partition ◽

Continuous Symmetry ◽

Advantages And Disadvantages ◽

Improved Algorithm

Clustering is a major field in data mining, which is also an important method of data partition or grouping. Clustering has now been applied in various ways to commerce, market analysis, biology, web classification, and so on. Clustering algorithms include the partitioning method, hierarchical clustering as well as density-based, grid-based, model-based, and fuzzy clustering. The K-means algorithm is one of the essential clustering algorithms. It is a kind of clustering algorithm based on the partitioning method. This study’s aim was to improve the algorithm based on research, while with regard to its application, the aim was to use the algorithm for customer segmentation. Customer segmentation is an essential element in the enterprise’s utilization of CRM. The first part of the paper presents an elaboration of the object of study, its background as well as the goal this article would like to achieve; it also discusses the research the mentality and the overall content. The second part mainly introduces the basic knowledge on clustering and methods for clustering analysis based on the assessment of different algorithms, while identifying its advantages and disadvantages through the comparison of those algorithms. The third part introduces the application of the algorithm, as the study applies clustering technology to customer segmentation. First, the customer value system is built through AHP; customer value is then quantified, and customers are divided into different classifications using clustering technology. The efficient CRM can thus be used according to the different customer classifications. Currently, there are some systems used to evaluate customer value, but none of them can be put into practice efficiently. In order to solve this problem, the concept of continuous symmetry is introduced. It is very important to detect the continuous symmetry of a given problem. It allows for the detection of an observable state whose components are nonlinear functions of the original unobservable state. Thus, we built an evaluating system for customer value, which is in line with the development of the enterprise, using the method of data mining, based on the practical situation of the enterprise and through a series of practical evaluating indexes for customer value. The evaluating system can be used to quantify customer value, to segment the customers, and to build a decision-supporting system for customer value management. The fourth part presents the cure, mainly an analysis of the typical k-means algorithm; this paper proposes two algorithms to improve the k-means algorithm. Improved algorithm A can get the K automatically and can ensure the achievement of the global optimum value to some degree. Improved Algorithm B, which combines the sample technology and the arrangement agglomeration algorithm, is much more efficient than the k-means algorithm. In conclusion, the main findings of the study and further research directions are presented.

Download Full-text

Applying Improved Clustering Algorithm into EC Environment Data Mining

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.596.951 ◽

2014 ◽

Vol 596 ◽

pp. 951-959 ◽

Cited By ~ 2

Author(s):

Yu Peng Ma ◽

Bo Ma ◽

Tong Hai Jiang

Keyword(s):

Data Mining ◽

Clustering Algorithm ◽

Service Providers ◽

Clustering Algorithms ◽

Customer Segmentation ◽

Data Mining Technique ◽

Directed Learning ◽

On Line ◽

Browsing Behavior ◽

Learning Data

With the rising growth of electronic commerce (EC) customers, EC service providers are keen to analyze the on-line browsing behavior of the customers in their web site and learn their specific features. Clustering is a popular non-directed learning data mining technique for partitioning a dataset into a set of clusters. Although there are many clustering algorithms, none is superior for the task of customer segmentation. This suggests that a proper clustering algorithm should be generated for EC environment. In this paper we are concerned with the situation and proposed an improved k-means algorithm, which is effective to exclude the noisy data and improve the clustering accuracy. The experimental results performed on real EC environment are provided to demonstrate the effectiveness and feasibility of the proposed approach.

Download Full-text

Improved minimum-minimum roughness algorithm for clustering categorical data

International Journal of ADVANCED AND APPLIED SCIENCES ◽

10.21833/ijaas.2021.10.006 ◽

2021 ◽

Vol 8 (10) ◽

pp. 43-50

Author(s):

Truong et al. ◽

Keyword(s):

Machine Learning ◽

Data Mining ◽

Hierarchical Clustering ◽

Categorical Data ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Experimental Results ◽

Data Sets ◽

Top Down ◽

Hierarchical Clustering Algorithm

Clustering is a fundamental technique in data mining and machine learning. Recently, many researchers are interested in the problem of clustering categorical data and several new approaches have been proposed. One of the successful and pioneering clustering algorithms is the Minimum-Minimum Roughness algorithm (MMR) which is a top-down hierarchical clustering algorithm and can handle the uncertainty in clustering categorical data. However, MMR tends to choose the category with less value leaf node with more objects, leading to undesirable clustering results. To overcome such shortcomings, this paper proposes an improved version of the MMR algorithm for clustering categorical data, called IMMR (Improved Minimum-Minimum Roughness). Experimental results on actual data sets taken from UCI show that the IMMR algorithm outperforms MMR in clustering categorical data.

Download Full-text

Canonical PSO Based K-Means Clustering Approach for Real Datasets

International Scholarly Research Notices ◽

10.1155/2014/414013 ◽

2014 ◽

Vol 2014 ◽

pp. 1-11 ◽

Cited By ~ 1

Author(s):

Lopamudra Dey ◽

Sanjay Chakraborty

Keyword(s):

Data Mining ◽

Air Pollution ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Real Life ◽

Cluster Validity ◽

Validity Assessment ◽

Different Types ◽

Clustering Approach ◽

Validity Measure

“Clustering” the significance and application of this technique is spread over various fields. Clustering is an unsupervised process in data mining, that is why the proper evaluation of the results and measuring the compactness and separability of the clusters are important issues. The procedure of evaluating the results of a clustering algorithm is known as cluster validity measure. Different types of indexes are used to solve different types of problems and indices selection depends on the kind of available data. This paper first proposes Canonical PSO based K-means clustering algorithm and also analyses some important clustering indices (intercluster, intracluster) and then evaluates the effects of those indices on real-time air pollution database, wholesale customer, wine, and vehicle datasets using typical K-means, Canonical PSO based K-means, simple PSO based K-means, DBSCAN, and Hierarchical clustering algorithms. This paper also describes the nature of the clusters and finally compares the performances of these clustering algorithms according to the validity assessment. It also defines which algorithm will be more desirable among all these algorithms to make proper compact clusters on this particular real life datasets. It actually deals with the behaviour of these clustering algorithms with respect to validation indexes and represents their results of evaluation in terms of mathematical and graphical forms.

Download Full-text

A Data Distribution View of Clustering Algorithms

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch059 ◽

2011 ◽

pp. 374-381 ◽

Cited By ~ 1

Author(s):

Junjie Wu ◽

Jian Chen ◽

Hui Xiong

Keyword(s):

Data Mining ◽

Cluster Analysis ◽

Clustering Analysis ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Data Distribution ◽

Point Of View ◽

Group Method ◽

Data Sets ◽

Distribution Point

Cluster analysis (Jain & Dubes, 1988) provides insight into the data by dividing the objects into groups (clusters), such that objects in a cluster are more similar to each other than objects in other clusters. Cluster analysis has long played an important role in a wide variety of fields, such as psychology, bioinformatics, pattern recognition, information retrieval, machine learning, and data mining. Many clustering algorithms, such as K-means and Unweighted Pair Group Method with Arithmetic Mean (UPGMA), have been wellestablished. A recent research focus on clustering analysis is to understand the strength and weakness of various clustering algorithms with respect to data factors. Indeed, people have identified some data characteristics that may strongly affect clustering analysis including high dimensionality and sparseness, the large size, noise, types of attributes and data sets, and scales of attributes (Tan, Steinbach, & Kumar, 2005). However, further investigation is expected to reveal whether and how the data distributions can have the impact on the performance of clustering algorithms. Along this line, we study clustering algorithms by answering three questions: 1. What are the systematic differences between the distributions of the resultant clusters by different clustering algorithms? 2. How can the distribution of the “true” cluster sizes make impact on the performances of clustering algorithms? 3. How to choose an appropriate clustering algorithm in practice? The answers to these questions can guide us for the better understanding and the use of clustering methods. This is noteworthy, since 1) in theory, people seldom realized that there are strong relationships between the clustering algorithms and the cluster size distributions, and 2) in practice, how to choose an appropriate clustering algorithm is still a challenging task, especially after an algorithm boom in data mining area. This chapter thus tries to fill this void initially. To this end, we carefully select two widely used categories of clustering algorithms, i.e., K-means and Agglomerative Hierarchical Clustering (AHC), as the representative algorithms for illustration. In the chapter, we first show that K-means tends to generate the clusters with a relatively uniform distribution on the cluster sizes. Then we demonstrate that UPGMA, one of the robust AHC methods, acts in an opposite way to K-means; that is, UPGMA tends to generate the clusters with high variation on the cluster sizes. Indeed, the experimental results indicate that the variations of the resultant cluster sizes by K-means and UPGMA, measured by the Coefficient of Variation (CV), are in the specific intervals, say [0.3, 1.0] and [1.0, 2.5] respectively. Finally, we put together K-means and UPGMA for a further comparison, and propose some rules for the better choice of the clustering schemes from the data distribution point of view.

Download Full-text

Ontology-Based K-Means Clustering Algorithm Analysis

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.380-384.1290 ◽

2013 ◽

Vol 380-384 ◽

pp. 1290-1293

Author(s):

Qing Ju Guo ◽

Wen Tian Ji ◽

Sheng Zhong

Keyword(s):

Semantic Web ◽

Clustering Algorithm ◽

Algorithm Analysis ◽

Clustering Method ◽

Data Set ◽

Advantages And Disadvantages ◽

Research Findings ◽

Partition Clustering ◽

Improved Algorithm

Lots of research findings have been made from home and abroad on clustering algorithm in recent years. In view of the traditional partition clustering method K-means algorithm, this paper, after analyzing its advantages and disadvantages, combines it with ontology-based data set to establish a semantic web model. It improves the existing clustering algorithm in various constraint conditions with the aim of demonstrating that the improved algorithm has better efficiency and accuracy under semantic web.

Download Full-text

Data Mining Clustering Algorithm Research and Application

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.926-930.3608 ◽

2014 ◽

Vol 926-930 ◽

pp. 3608-3611 ◽

Cited By ~ 1

Author(s):

Yi Fan Zhang ◽

Yong Tao Qian ◽

Tai Yu Liu ◽

Shu Yan Wu

Keyword(s):

Data Mining ◽

Cluster Analysis ◽

Clustering Analysis ◽

Clustering Algorithm ◽

Dimensional Space ◽

Data Mining Algorithm ◽

Practical Application ◽

Advantages And Disadvantages ◽

Analysis Theory ◽

Data Points

In this paper, first introduce data mining knowledge then focuses on the clustering analysis algorithms, including classification clustering algorithm, and each classification typical cluster analysis algorithms, including the formal description of each algorithm as well as the advantages and disadvantages of each algorithm also has a more detailed description. Then carefully introduce data mining algorithm on the basis of cluster analysis. And using cohesion based clustering algorithm with DBSCAN algorithm and clustering in consumer spending in two-dimensional space, 2,000 data points for each area, and get a reasonable clustering results, resulting in hierarchical clustering results valuable information, so as to realize the practical application of the algorithm and clustering analysis theory combined.

Download Full-text

Ensemble Hybrid K- Means and DBSCAN Clustering Algorithm – HDKA for Cancer Dataset

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.d8257.118419 ◽

2019 ◽

Vol 8 (4) ◽

pp. 6036-6040

Keyword(s):

Machine Learning ◽

Data Mining ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Experimental Results ◽

Cancer Dataset ◽

Dbscan Clustering ◽

Selection Of

Data Mining is the foremost vital space of analysis and is pragmatically utilized in totally different domains, It becomes a highly demanding field because huge amounts of data have been collected in various applications. The database can be clustered in more number of ways depending on the clustering algorithm used, parameter settings and other factors. Multiple clustering algorithms can be combined to get the final partitioning of data which provides better clustering results. In this paper, Ensemble hybrid KMeans and DBSCAN (HDKA) algorithm has been proposed to overcome the drawbacks of DBSCAN and KMeans clustering algorithms. The performance of the proposed algorithm improves the selection of centroid points through the centroid selection strategy.For experimental results we have used two dataset Colon and Leukemia from UCI machine learning repository.

Download Full-text

Customized M-clustering Algorithm Comparison with Clustering Algorithms in Data Mining with the Case Study of Lead Generation Techniques

Indian Journal of Science and Technology ◽

10.17485/ijst/2016/v9i37/101810 ◽

2016 ◽

Vol 9 (1) ◽

pp. 1-7

Author(s):

E. Manigandan

Keyword(s):

Data Mining ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Algorithm Comparison ◽

Lead Generation

Download Full-text

Semi-Supervised Clustering Algorithm Based on Small Size of Labeled Data

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.121-126.4675 ◽

2011 ◽

Vol 121-126 ◽

pp. 4675-4679

Author(s):

Ming Wei Leng ◽

Xiao Yun Chen ◽

Jian Jun Cheng ◽

Long Jie Li

Keyword(s):

Data Mining ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Nearest Neighbors ◽

Experimental Results ◽

K Nearest Neighbors ◽

Supervised Clustering ◽

The Core ◽

Knn Classification ◽

Core Problem

In many data mining domains, labeled data is very expensive to generate, how to make the best use of labeled data to guide the process of unlabeled clustering is the core problem of semi-supervised clustering. Most of semi-supervised clustering algorithms require a certain amount of labeled data and need set the values of some parameters, different values maybe have different results. In view of this, a new algorithm, called semi-supervised clustering algorithm based on small size of labeled data, is presented, which can use the small size of labeled data to expand labeled dataset by labeling their k-nearest neighbors and only one parameter. We demonstrate our clustering algorithm with three UCI datasets, compared with SSDBSCAN[4] and KNN, the experimental results confirm that accuracy of our clustering algorithm is close to that of KNN classification algorithm.

Download Full-text

Modified Single Pass Clustering Algorithm Based on Median as a Threshold Similarity Value

Collaborative Filtering Using Data Mining and Analysis - Advances in Data Mining and Database Management ◽

10.4018/978-1-5225-0489-4.ch002 ◽

2017 ◽

pp. 24-48 ◽

Cited By ~ 1

Author(s):

Mamta Mittal ◽

R. K. Sharma ◽

V.P. Singh ◽

Lalit Mohan Goyal

Keyword(s):

Data Mining ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Data Mining Techniques ◽

Single Pass ◽

Hidden Patterns ◽

Data Objects

Clustering is one of the data mining techniques that investigates these data resources for hidden patterns. Many clustering algorithms are available in literature. This chapter emphasizes on partitioning based methods and is an attempt towards developing clustering algorithms that can efficiently detect clusters. In partitioning based methods, k-means and single pass clustering are popular clustering algorithms but they have several limitations. To overcome the limitations of these algorithms, a Modified Single Pass Clustering (MSPC) algorithm has been proposed in this work. It revolves around the proposition of a threshold similarity value. This is not a user defined parameter; instead, it is a function of data objects left to be clustered. In our experiments, this threshold similarity value is taken as median of the paired distance of all data objects left to be clustered. To assess the performance of MSPC algorithm, five experiments for k-means, SPC and MSPC algorithms have been carried out on artificial and real datasets.

Download Full-text