scholarly journals The k-means Algorithm: A Comprehensive Survey and Performance Evaluation

Electronics ◽  
2020 ◽  
Vol 9 (8) ◽  
pp. 1295 ◽  
Author(s):  
Mohiuddin Ahmed ◽  
Raihan Seraj ◽  
Syed Mohammed Shamsul Islam

The k-means clustering algorithm is considered one of the most powerful and popular data mining algorithms in the research community. However, despite its popularity, the algorithm has certain limitations, including problems associated with random initialization of the centroids which leads to unexpected convergence. Additionally, such a clustering algorithm requires the number of clusters to be defined beforehand, which is responsible for different cluster shapes and outlier effects. A fundamental problem of the k-means algorithm is its inability to handle various data types. This paper provides a structured and synoptic overview of research conducted on the k-means algorithm to overcome such shortcomings. Variants of the k-means algorithms including their recent developments are discussed, where their effectiveness is investigated based on the experimental analysis of a variety of datasets. The detailed experimental analysis along with a thorough comparison among different k-means clustering algorithms differentiates our work compared to other existing survey papers. Furthermore, it outlines a clear and thorough understanding of the k-means algorithm along with its different research directions.

Information ◽  
2021 ◽  
Vol 12 (6) ◽  
pp. 232
Author(s):  
Janneth Chicaiza ◽  
Priscila Valdiviezo-Diaz

In recent years, the use of recommender systems has become popular on the web. To improve recommendation performance, usage, and scalability, the research has evolved by producing several generations of recommender systems. There is much literature about it, although most proposals focus on traditional methods’ theories and applications. Recently, knowledge graph-based recommendations have attracted attention in academia and the industry because they can alleviate information sparsity and performance problems. We found only two studies that analyze the recommendation system’s role over graphs, but they focus on specific recommendation methods. This survey attempts to cover a broader analysis from a set of selected papers. In summary, the contributions of this paper are as follows: (1) we explore traditional and more recent developments of filtering methods for a recommender system, (2) we identify and analyze proposals related to knowledge graph-based recommender systems, (3) we present the most relevant contributions using an application domain, and (4) we outline future directions of research in the domain of recommender systems. As the main survey result, we found that the use of knowledge graphs for recommendations is an efficient way to leverage and connect a user’s and an item’s knowledge, thus providing more precise results for users.


Data Mining ◽  
2013 ◽  
pp. 1916-1935
Author(s):  
Mingming Zhou ◽  
Yabo Xu

A wealth of research has shown that meta-cognition plays a crucial role in the promotion of effective school learning. In most of the e-learning environment designs, however, meta-cognitive strategies have generally been neglected, and therefore, satisfactory uses of these strategies have rarely been realized. Most learners are not even aware of what they have been studying. If the learning system could automatically guide and intelligently recommend learning activities or strategies to facilitate student monitoring and control of their learning, it would favor and improve their learning process and performance. Unfortunately, nearly no e-learning systems to date have attempted to do so. In this chapter, we first described the need for enhancing meta-cognitive skills in e-learning environment, followed by an outline of major challenges for meta-cognitive activity recommendations. We then proposed to adopt data mining algorithms (i.e., content-based and sequence-based recommendation techniques) to meet the identified issues with a toy example.


2020 ◽  
Vol 12 (23) ◽  
pp. 4007
Author(s):  
Kasra Rafiezadeh Shahi ◽  
Pedram Ghamisi ◽  
Behnood Rasti ◽  
Robert Jackisch ◽  
Paul Scheunders ◽  
...  

The increasing amount of information acquired by imaging sensors in Earth Sciences results in the availability of a multitude of complementary data (e.g., spectral, spatial, elevation) for monitoring of the Earth’s surface. Many studies were devoted to investigating the usage of multi-sensor data sets in the performance of supervised learning-based approaches at various tasks (i.e., classification and regression) while unsupervised learning-based approaches have received less attention. In this paper, we propose a new approach to fuse multiple data sets from imaging sensors using a multi-sensor sparse-based clustering algorithm (Multi-SSC). A technique for the extraction of spatial features (i.e., morphological profiles (MPs) and invariant attribute profiles (IAPs)) is applied to high spatial-resolution data to derive the spatial and contextual information. This information is then fused with spectrally rich data such as multi- or hyperspectral data. In order to fuse multi-sensor data sets a hierarchical sparse subspace clustering approach is employed. More specifically, a lasso-based binary algorithm is used to fuse the spectral and spatial information prior to automatic clustering. The proposed framework ensures that the generated clustering map is smooth and preserves the spatial structures of the scene. In order to evaluate the generalization capability of the proposed approach, we investigate its performance not only on diverse scenes but also on different sensors and data types. The first two data sets are geological data sets, which consist of hyperspectral and RGB data. The third data set is the well-known benchmark Trento data set, including hyperspectral and LiDAR data. Experimental results indicate that this novel multi-sensor clustering algorithm can provide an accurate clustering map compared to the state-of-the-art sparse subspace-based clustering algorithms.


Author(s):  
Jie Dong ◽  
Min-Feng Zhu ◽  
Yong-Huan Yun ◽  
Ai-Ping Lu ◽  
Ting-Jun Hou ◽  
...  

Abstract Background With the increasing development of biotechnology and information technology, publicly available data in chemistry and biology are undergoing explosive growth. Such wealthy information in these resources needs to be extracted and then transformed to useful knowledge by various data mining methods. However, a main computational challenge is how to effectively represent or encode molecular objects under investigation such as chemicals, proteins, DNAs and even complicated interactions when data mining methods are employed. To further explore these complicated data, an integrated toolkit to represent different types of molecular objects and support various data mining algorithms is urgently needed. Results We developed a freely available R/CRAN package, called BioMedR, for molecular representations of chemicals, proteins, DNAs and pairwise samples of their interactions. The current version of BioMedR could calculate 293 molecular descriptors and 13 kinds of molecular fingerprints for small molecules, 9920 protein descriptors based on protein sequences and six types of generalized scale-based descriptors for proteochemometric modeling, more than 6000 DNA descriptors from nucleotide sequences and six types of interaction descriptors using three different combining strategies. Moreover, this package realized five similarity calculation methods and four powerful clustering algorithms as well as several useful auxiliary tools, which aims at building an integrated analysis pipeline for data acquisition, data checking, descriptor calculation and data modeling. Conclusion BioMedR provides a comprehensive and uniform R package to link up different representations of molecular objects with each other and will benefit cheminformatics/bioinformatics and other biomedical users. It is available at: https://CRAN.R-project.org/package=BioMedR and https://github.com/wind22zhu/BioMedR/.


2020 ◽  
Vol 8 (6) ◽  
pp. 1973-1979

The data mining algorithms functioning is main concern, when the data becomes to a greater extent. Clustering analysis is a active and dispute research direction in the region of data mining for complex data samples. DBSCAN is a density-based clustering algorithm with several advantages in numerous applications. However, DBSCAN has quadratic time complexity i.e. making it complicated for realistic applications particularly with huge complex data samples. Therefore, this paper recommended a hybrid approach to reduce the time complexity by exploring the core properties of the DBSCAN in the initial stage using genetic based K-means partition algorithm. The technological experiments showed that the proposed hybrid approach obtains competitive results when compared with the usual approach and drastically improves the computational time.


The exponential increase in universities’ electronic data creates the need to derive some useful information from these massive amounts of data. The progression in the data mining field causes it conceivable to educational data to improve the nature of educational processes. This study, thus, uses data mining methods to study the learning behavior and performance of university students. It focused on two aspects of the performance of the students. First, predicting students' learning behavior at the end of a complete year of the study program. Second, predict student performance with the help of the data model proposed by this study. Finally, provide course material recommendations using the data mining algorithm. Three data mining algorithms were considered which are K-Means, FCM, and KFCM., and maximum accuracy of 90.22% was achieved by KFCM. The study indicates that in terms of time and memory usages K-means algorithm give better results. This creates an opportunity for identifying students that may graduate with poor results or may not graduate at all, so early intercession might be possible.


2013 ◽  
Vol 401-403 ◽  
pp. 1440-1443 ◽  
Author(s):  
Tie Feng Zhang ◽  
Fei Lv ◽  
Rong Gu

Distance choice is an important issue in power load pattern extraction using clustering techniques, so it is necessary to find the influence on clustering result of load curves using different distances in clustering algorithms. In this paper several distances are used in the k-means algorithm for clustering load curves and their influences on the clustering results are analyzed, therefore, the suitable distance for the k-means algorithms is obtained. An example with 147 electricity customers load curves shows distances have different influences on clustering results using the same clustering algorithm. The comparison results indicate that the choice of distances is an important issue in power load pattern extraction using clustering techniques and a suitable distance may improve the accuracy of mining algorithms.


Author(s):  
G. Ramadevi ◽  
Srujitha Yeruva ◽  
P. Sravanthi ◽  
P. Eknath Vamsi ◽  
S. Jaya Prakash

In a digitized world, data is growing exponentially and it is difficult to analyze the data and give the results. Data mining techniques play an important role in healthcare sector - BigData. By making use of Data mining algorithms it is possible to analyze, detect and predict the presence of disease which helps doctors to detect the disease early and in decision making. The objective of data mining techniques used is to design an automated tool that notifies the patient’s treatment history disease and medical data to doctors. Data mining techniques are very much useful in analyzing medical data to achieve meaningful and practical patterns. This project works on diabetes medical data, classification and clustering algorithms like (OPTICS, NAIVEBAYES, and BRICH) are implemented and the efficiency of the same is examined.


2018 ◽  
Vol 2 (4) ◽  
Author(s):  
Pengfei Zhang ◽  
Hwee-Pink Tan ◽  
Gaoxi Xiao

Motivated by recent developments in Wireless Sensor Networks(WSNs), we present distributed clustering algorithms for maximizingthe lifetime of WSNs, i.e., the duration till the first node dies. Westudy the joint problem of prolonging network lifetime by introducing clustering techniques and energy-harvesting (EH) nodes. Firstlywe propose distributed clustering algorithm for maximizing the lifetime of clustered WSN, which includes EH nodes, serving as relaynodes for cluster heads (CHs). Secondly graph-based and LP-basedEH-CH matching algorithms are proposed which serve as benchmarkalgorithms. Extensive simulation results show that the proposed algorithms can achieve optimal or suboptimal solutions efficiently


Sign in / Sign up

Export Citation Format

Share Document