The k-means Algorithm: A Comprehensive Survey and Performance Evaluation

Mohiuddin Ahmed; Raihan Seraj; Syed Mohammed Shamsul Islam

doi:10.3390/electronics9081295

The k-means Algorithm: A Comprehensive Survey and Performance Evaluation

Electronics ◽

10.3390/electronics9081295 ◽

2020 ◽

Vol 9 (8) ◽

pp. 1295 ◽

Cited By ~ 4

Author(s):

Mohiuddin Ahmed ◽

Raihan Seraj ◽

Syed Mohammed Shamsul Islam

Keyword(s):

Experimental Analysis ◽

Clustering Algorithm ◽

Fundamental Problem ◽

Clustering Algorithms ◽

Data Types ◽

Data Mining Algorithms ◽

Recent Developments ◽

Comprehensive Survey ◽

And Performance ◽

Mining Algorithms

The k-means clustering algorithm is considered one of the most powerful and popular data mining algorithms in the research community. However, despite its popularity, the algorithm has certain limitations, including problems associated with random initialization of the centroids which leads to unexpected convergence. Additionally, such a clustering algorithm requires the number of clusters to be defined beforehand, which is responsible for different cluster shapes and outlier effects. A fundamental problem of the k-means algorithm is its inability to handle various data types. This paper provides a structured and synoptic overview of research conducted on the k-means algorithm to overcome such shortcomings. Variants of the k-means algorithms including their recent developments are discussed, where their effectiveness is investigated based on the experimental analysis of a variety of datasets. The detailed experimental analysis along with a thorough comparison among different k-means clustering algorithms differentiates our work compared to other existing survey papers. Furthermore, it outlines a clear and thorough understanding of the k-means algorithm along with its different research directions.

Download Full-text

A Comprehensive Survey of Knowledge Graph-Based Recommender Systems: Technologies, Development, and Contributions

Information ◽

10.3390/info12060232 ◽

2021 ◽

Vol 12 (6) ◽

pp. 232

Author(s):

Janneth Chicaiza ◽

Priscila Valdiviezo-Diaz

Keyword(s):

Recommender Systems ◽

Survey Result ◽

Knowledge Graph ◽

Specific Recommendation ◽

Future Directions ◽

Recent Developments ◽

Comprehensive Survey ◽

System 2 ◽

And Performance ◽

Use Of Knowledge

In recent years, the use of recommender systems has become popular on the web. To improve recommendation performance, usage, and scalability, the research has evolved by producing several generations of recommender systems. There is much literature about it, although most proposals focus on traditional methods’ theories and applications. Recently, knowledge graph-based recommendations have attracted attention in academia and the industry because they can alleviate information sparsity and performance problems. We found only two studies that analyze the recommendation system’s role over graphs, but they focus on specific recommendation methods. This survey attempts to cover a broader analysis from a set of selected papers. In summary, the contributions of this paper are as follows: (1) we explore traditional and more recent developments of filtering methods for a recommender system, (2) we identify and analyze proposals related to knowledge graph-based recommender systems, (3) we present the most relevant contributions using an application domain, and (4) we outline future directions of research in the domain of recommender systems. As the main survey result, we found that the use of knowledge graphs for recommendations is an efficient way to leverage and connect a user’s and an item’s knowledge, thus providing more precise results for users.

Download Full-text

Challenges to Use Recommender Systems to Enhance Meta-Cognitive Functioning in Online Learners

Data Mining ◽

10.4018/978-1-4666-2455-9.ch099 ◽

2013 ◽

pp. 1916-1935

Author(s):

Mingming Zhou ◽

Yabo Xu

Keyword(s):

Learning Environment ◽

Cognitive Skills ◽

Cognitive Activity ◽

Cognitive Strategies ◽

Monitoring And Control ◽

Online Learners ◽

Data Mining Algorithms ◽

E Learning ◽

And Performance ◽

Mining Algorithms

A wealth of research has shown that meta-cognition plays a crucial role in the promotion of effective school learning. In most of the e-learning environment designs, however, meta-cognitive strategies have generally been neglected, and therefore, satisfactory uses of these strategies have rarely been realized. Most learners are not even aware of what they have been studying. If the learning system could automatically guide and intelligently recommend learning activities or strategies to facilitate student monitoring and control of their learning, it would favor and improve their learning process and performance. Unfortunately, nearly no e-learning systems to date have attempted to do so. In this chapter, we first described the need for enhancing meta-cognitive skills in e-learning environment, followed by an outline of major challenges for meta-cognitive activity recommendations. We then proposed to adopt data mining algorithms (i.e., content-based and sequence-based recommendation techniques) to meet the identified issues with a toy example.

Download Full-text

Data Fusion Using a Multi-Sensor Sparse-Based Clustering Algorithm

Remote Sensing ◽

10.3390/rs12234007 ◽

2020 ◽

Vol 12 (23) ◽

pp. 4007

Author(s):

Kasra Rafiezadeh Shahi ◽

Pedram Ghamisi ◽

Behnood Rasti ◽

Robert Jackisch ◽

Paul Scheunders ◽

...

Keyword(s):

Clustering Algorithm ◽

Spatial Information ◽

Clustering Algorithms ◽

Hyperspectral Data ◽

Sensor Data ◽

Data Sets ◽

Data Types ◽

Data Set ◽

Multiple Data Sets ◽

Imaging Sensors

The increasing amount of information acquired by imaging sensors in Earth Sciences results in the availability of a multitude of complementary data (e.g., spectral, spatial, elevation) for monitoring of the Earth’s surface. Many studies were devoted to investigating the usage of multi-sensor data sets in the performance of supervised learning-based approaches at various tasks (i.e., classification and regression) while unsupervised learning-based approaches have received less attention. In this paper, we propose a new approach to fuse multiple data sets from imaging sensors using a multi-sensor sparse-based clustering algorithm (Multi-SSC). A technique for the extraction of spatial features (i.e., morphological profiles (MPs) and invariant attribute profiles (IAPs)) is applied to high spatial-resolution data to derive the spatial and contextual information. This information is then fused with spectrally rich data such as multi- or hyperspectral data. In order to fuse multi-sensor data sets a hierarchical sparse subspace clustering approach is employed. More specifically, a lasso-based binary algorithm is used to fuse the spectral and spatial information prior to automatic clustering. The proposed framework ensures that the generated clustering map is smooth and preserves the spatial structures of the scene. In order to evaluate the generalization capability of the proposed approach, we investigate its performance not only on diverse scenes but also on different sensors and data types. The first two data sets are geological data sets, which consist of hyperspectral and RGB data. The third data set is the well-known benchmark Trento data set, including hyperspectral and LiDAR data. Experimental results indicate that this novel multi-sensor clustering algorithm can provide an accurate clustering map compared to the state-of-the-art sparse subspace-based clustering algorithms.

Download Full-text

BioMedR: an R/CRAN package for integrated data analysis pipeline in biomedical study

Briefings in Bioinformatics ◽

10.1093/bib/bbz150 ◽

2019 ◽

Cited By ~ 2

Author(s):

Jie Dong ◽

Min-Feng Zhu ◽

Yong-Huan Yun ◽

Ai-Ping Lu ◽

Ting-Jun Hou ◽

...

Keyword(s):

Data Mining ◽

Clustering Algorithms ◽

R Package ◽

Integrated Analysis ◽

Analysis Pipeline ◽

Molecular Fingerprints ◽

Useful Knowledge ◽

Data Mining Algorithms ◽

Mining Methods ◽

Mining Algorithms

Abstract Background With the increasing development of biotechnology and information technology, publicly available data in chemistry and biology are undergoing explosive growth. Such wealthy information in these resources needs to be extracted and then transformed to useful knowledge by various data mining methods. However, a main computational challenge is how to effectively represent or encode molecular objects under investigation such as chemicals, proteins, DNAs and even complicated interactions when data mining methods are employed. To further explore these complicated data, an integrated toolkit to represent different types of molecular objects and support various data mining algorithms is urgently needed. Results We developed a freely available R/CRAN package, called BioMedR, for molecular representations of chemicals, proteins, DNAs and pairwise samples of their interactions. The current version of BioMedR could calculate 293 molecular descriptors and 13 kinds of molecular fingerprints for small molecules, 9920 protein descriptors based on protein sequences and six types of generalized scale-based descriptors for proteochemometric modeling, more than 6000 DNA descriptors from nucleotide sequences and six types of interaction descriptors using three different combining strategies. Moreover, this package realized five similarity calculation methods and four powerful clustering algorithms as well as several useful auxiliary tools, which aims at building an integrated analysis pipeline for data acquisition, data checking, descriptor calculation and data modeling. Conclusion BioMedR provides a comprehensive and uniform R package to link up different representations of molecular objects with each other and will benefit cheminformatics/bioinformatics and other biomedical users. It is available at: https://CRAN.R-project.org/package=BioMedR and https://github.com/wind22zhu/BioMedR/.

Download Full-text

Dbscan Assisted by Hybrid Genetic K Means Algorithm

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.f8061.038620 ◽

2020 ◽

Vol 8 (6) ◽

pp. 1973-1979

Keyword(s):

Data Mining ◽

Time Complexity ◽

Clustering Algorithm ◽

Hybrid Approach ◽

Research Direction ◽

Computational Time ◽

Main Concern ◽

Complex Data ◽

Data Mining Algorithms ◽

Mining Algorithms

The data mining algorithms functioning is main concern, when the data becomes to a greater extent. Clustering analysis is a active and dispute research direction in the region of data mining for complex data samples. DBSCAN is a density-based clustering algorithm with several advantages in numerous applications. However, DBSCAN has quadratic time complexity i.e. making it complicated for realistic applications particularly with huge complex data samples. Therefore, this paper recommended a hybrid approach to reduce the time complexity by exploring the core properties of the DBSCAN in the initial stage using genetic based K-means partition algorithm. The technological experiments showed that the proposed hybrid approach obtains competitive results when compared with the usual approach and drastically improves the computational time.

Download Full-text

Educational Data Mining for Student Learning Pattern Analysis using Clustering Algorithms

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.f1528.089620 ◽

2020 ◽

Vol 9 (6) ◽

pp. 481-488

Keyword(s):

Data Mining ◽

Student Performance ◽

Clustering Algorithms ◽

Educational Data Mining ◽

Data Mining Algorithm ◽

Learning Behavior ◽

Study Program ◽

Data Mining Algorithms ◽

Students First ◽

And Performance

The exponential increase in universities’ electronic data creates the need to derive some useful information from these massive amounts of data. The progression in the data mining field causes it conceivable to educational data to improve the nature of educational processes. This study, thus, uses data mining methods to study the learning behavior and performance of university students. It focused on two aspects of the performance of the students. First, predicting students' learning behavior at the end of a complete year of the study program. Second, predict student performance with the help of the data model proposed by this study. Finally, provide course material recommendations using the data mining algorithm. Three data mining algorithms were considered which are K-Means, FCM, and KFCM., and maximum accuracy of 90.22% was achieved by KFCM. The study indicates that in terms of time and memory usages K-means algorithm give better results. This creates an opportunity for identifying students that may graduate with poor results or may not graduate at all, so early intercession might be possible.

Download Full-text

Shared Memory Parallelization of Data Mining Algorithms: Techniques, Programming Interface, and Performance

Proceedings of the 2002 SIAM International Conference on Data Mining ◽

10.1137/1.9781611972726.5 ◽

2002 ◽

Cited By ~ 14

Author(s):

Ruoming Jin ◽

Gagan Agrawal

Keyword(s):

Data Mining ◽

Shared Memory ◽

Data Mining Algorithms ◽

And Performance ◽

Mining Algorithms ◽

Programming Interface

Download Full-text

The Influence on Clustering Results of Electricity Load Curves Using Different Distances

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.401-403.1440 ◽

2013 ◽

Vol 401-403 ◽

pp. 1440-1443 ◽

Cited By ~ 1

Author(s):

Tie Feng Zhang ◽

Fei Lv ◽

Rong Gu

Keyword(s):

Clustering Algorithm ◽

Clustering Algorithms ◽

Pattern Extraction ◽

Clustering Techniques ◽

Comparison Results ◽

Power Load ◽

Load Pattern ◽

Electricity Load ◽

Mining Algorithms

Distance choice is an important issue in power load pattern extraction using clustering techniques, so it is necessary to find the influence on clustering result of load curves using different distances in clustering algorithms. In this paper several distances are used in the k-means algorithm for clustering load curves and their influences on the clustering results are analyzed, therefore, the suitable distance for the k-means algorithms is obtained. An example with 147 electricity customers load curves shows distances have different influences on clustering results using the same clustering algorithm. The comparison results indicate that the choice of distances is an important issue in power load pattern extraction using clustering techniques and a suitable distance may improve the accuracy of mining algorithms.

Download Full-text

Analysis And Detection of Diabetes Using Data Mining Techniques – Efficiency Comparison

International Journal of Scientific Research in Science and Technology ◽

10.32628/cseit217425 ◽

2021 ◽

pp. 73-79

Author(s):

G. Ramadevi ◽

Srujitha Yeruva ◽

P. Sravanthi ◽

P. Eknath Vamsi ◽

S. Jaya Prakash

Keyword(s):

Data Mining ◽

Clustering Algorithms ◽

Medical Data ◽

Healthcare Sector ◽

Data Mining Techniques ◽

Data Mining Algorithms ◽

Use Of Data ◽

Efficiency Comparison ◽

Using Data ◽

Mining Algorithms

In a digitized world, data is growing exponentially and it is difficult to analyze the data and give the results. Data mining techniques play an important role in healthcare sector - BigData. By making use of Data mining algorithms it is possible to analyze, detect and predict the presence of disease which helps doctors to detect the disease early and in decision making. The objective of data mining techniques used is to design an automated tool that notifies the patient’s treatment history disease and medical data to doctors. Data mining techniques are very much useful in analyzing medical data to achieve meaningful and practical patterns. This project works on diabetes medical data, classification and clustering algorithms like (OPTICS, NAIVEBAYES, and BRICH) are implemented and the efficiency of the same is examined.

Download Full-text

Distributed Algorithms for MaximizingLifetime in Clustered Wireless SensorNetworks Using Energy-Harvesting RelayNod

Journal of Electronic Research and Application ◽

10.26689/jera.v2i4.510 ◽

2018 ◽

Vol 2 (4) ◽

Author(s):

Pengfei Zhang ◽

Hwee-Pink Tan ◽

Gaoxi Xiao

Keyword(s):

Energy Harvesting ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Wireless Sensor ◽

Distributed Clustering ◽

Clustering Techniques ◽

Extensive Simulation ◽

Joint Problem ◽

Recent Developments ◽

Simulation Results

Motivated by recent developments in Wireless Sensor Networks(WSNs), we present distributed clustering algorithms for maximizingthe lifetime of WSNs, i.e., the duration till the first node dies. Westudy the joint problem of prolonging network lifetime by introducing clustering techniques and energy-harvesting (EH) nodes. Firstlywe propose distributed clustering algorithm for maximizing the lifetime of clustered WSN, which includes EH nodes, serving as relaynodes for cluster heads (CHs). Secondly graph-based and LP-basedEH-CH matching algorithms are proposed which serve as benchmarkalgorithms. Extensive simulation results show that the proposed algorithms can achieve optimal or suboptimal solutions efficiently

Download Full-text