Analysis of Document Clustering based on Cosine Similarity and K-Main Algorithms

Bambang Krismono Triwijoyo; Kartarina Kartarina

doi:10.33557/journalisi.v1i2.18

Analysis of Document Clustering based on Cosine Similarity and K-Main Algorithms

Journal of Information Systems and Informatics ◽

10.33557/journalisi.v1i2.18 ◽

2019 ◽

Vol 1 (2) ◽

pp. 164-177

Author(s):

Bambang Krismono Triwijoyo ◽

Kartarina Kartarina

Keyword(s):

Document Clustering ◽

Cosine Similarity ◽

Experimental Results ◽

Text Documents ◽

Number Of Clusters ◽

Efficient Organization

Clustering is a useful technique that organizes a large number of non-sequential text documents into a small number of clusters that are meaningful and coherent. Effective and efficient organization of documents is needed, making it easy for intuitive and informative tracking mechanisms. In this paper, we proposed clustering documents using cosine similarity and k-main. The experimental results show that based on the experimental results the accuracy of our method is 84.3%.

Download Full-text

An Improved B-hill Climbing Optimization Technique for Solving the Text Documents Clustering Problem

Current Medical Imaging Formerly Current Medical Imaging Reviews ◽

10.2174/1573405614666180903112541 ◽

2020 ◽

Vol 16 (4) ◽

pp. 296-306 ◽

Cited By ~ 3

Author(s):

Laith Mohammad Abualigah ◽

Essam Said Hanandeh ◽

Ahamad Tajudin Khader ◽

Mohammed Abdallh Otair ◽

Shishir Kumar Shandilya

Keyword(s):

Optimization Technique ◽

Document Clustering ◽

Text Clustering ◽

Hill Climbing ◽

Text Documents ◽

Clustering Problem ◽

Text Document ◽

Text Information ◽

Amount Of Knowledge ◽

The Hill

Background: Considering the increasing volume of text document information on Internet pages, dealing with such a tremendous amount of knowledge becomes totally complex due to its large size. Text clustering is a common optimization problem used to manage a large amount of text information into a subset of comparable and coherent clusters. Aims: This paper presents a novel local clustering technique, namely, β-hill climbing, to solve the problem of the text document clustering through modeling the β-hill climbing technique for partitioning the similar documents into the same cluster. Methods: The β parameter is the primary innovation in β-hill climbing technique. It has been introduced in order to perform a balance between local and global search. Local search methods are successfully applied to solve the problem of the text document clustering such as; k-medoid and kmean techniques. Results: Experiments were conducted on eight benchmark standard text datasets with different characteristics taken from the Laboratory of Computational Intelligence (LABIC). The results proved that the proposed β-hill climbing achieved better results in comparison with the original hill climbing technique in solving the text clustering problem. Conclusion: The performance of the text clustering is useful by adding the β operator to the hill climbing.

Download Full-text

Efficient text feature extraction by integrating the average linkage and K-medoids clustering

Modern Physics Letters B ◽

10.1142/s0217984921501517 ◽

2021 ◽

pp. 2150151

Author(s):

Dasong Sun

Keyword(s):

Feature Extraction ◽

Text Classification ◽

Experimental Results ◽

The Other ◽

Central Feature ◽

Number Of Clusters ◽

Average Linkage ◽

Text Feature

By clustering feature words, we can not only simplify the dimension of feature subsets, but also eliminate the redundancy of the feature. However, for a feature set with very large dimensions, the traditional [Formula: see text]-medoids algorithm is difficult to accurately estimate the value of [Formula: see text]. Moreover, the clustering results of the average linkage (AL) algorithm cannot be divided again, and the AL algorithm cannot be directly used for text classification. In order to overcome the limitations of AL and [Formula: see text]-medoids, in this paper, we combine the two algorithms together so as to be mutually complementary to each other. In particular, in order to meet the purpose of text classification, we improve the AL algorithm and propose the [Formula: see text] testing statistics to obtain the approximate number of clusters. Finally, the central feature words are preserved, and the other feature words are deleted. The experimental results show that the new algorithm largely eliminates the redundancy of the feature. Compared with the traditional TF-IDF algorithms, the performance of the text classification of the new algorithm is improved.

Download Full-text

Document Clustering

Pattern and Data Analysis in Healthcare Settings - Advances in Medical Technologies and Clinical Practice ◽

10.4018/978-1-5225-0536-5.ch013 ◽

2017 ◽

pp. 264-281

Author(s):

Harsha Patil ◽

R. S. Thakur

Keyword(s):

Text Mining ◽

Clustering Algorithms ◽

Document Clustering ◽

Web Pages ◽

Digital Form ◽

Search Query ◽

Text Documents ◽

Keen Interest ◽

Use Of Internet

As we know use of Internet flourishes with its full velocity and in all dimensions. Enormous availability of Text documents in digital form (email, web pages, blog post, news articles, ebooks and other text files) on internet challenges technology to appropriate retrieval of document as a response for any search query. As a result there has been an eruption of interest in people to mine these vast resources and classify them properly. It invigorates researchers and developers to work on numerous approaches of document clustering. Researchers got keen interest in this problem of text mining. The aim of this chapter is to summarised different document clustering algorithms used by researchers.

Download Full-text

Fuzzy Clustering with Repulsive Prototypes

Scalable Fuzzy Algorithms for Data Management and Analysis ◽

10.4018/978-1-60566-858-1.ch013 ◽

2010 ◽

pp. 332-346

Author(s):

Frank Rehm ◽

Roland Winkler ◽

Rudolf Kruse

Keyword(s):

Data Analysis ◽

Fuzzy Clustering ◽

Experimental Results ◽

Number Of Clusters ◽

Fuzzy C Means ◽

Analysis Process ◽

Fuzzy C Means Clustering ◽

The Right

A well known issue with prototype-based clustering is the user’s obligation to know the right number of clusters in a dataset in advance or to determine it as a part of the data analysis process. There are different approaches to cope with this non-trivial problem. This chapter follows the approach to address this problem as an integrated part of the clustering process. An extension to repulsive fuzzy c-means clustering is proposed equipping non-Euclidean prototypes with repulsive properties. Experimental results are presented that demonstrate the feasibility of the authors’ technique.

Download Full-text

Document Clustering into an Unknown Number of Clusters Using a Genetic Algorithm

Text, Speech and Dialogue - Lecture Notes in Computer Science ◽

10.1007/978-3-540-39398-6_7 ◽

2003 ◽

pp. 43-49 ◽

Cited By ~ 23

Author(s):

A. Casillas ◽

M. T. González de Lena ◽

R. Martínez

Keyword(s):

Genetic Algorithm ◽

Document Clustering ◽

Number Of Clusters ◽

Unknown Number

Download Full-text

Text Documents Plagiarism Detection using Rabin-Karp and Jaro-Winkler Distance Algorithms

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v5.i2.pp462-471 ◽

2017 ◽

Vol 5 (2) ◽

pp. 462 ◽

Cited By ~ 3

Author(s):

Brinardi Leonardo ◽

Seng Hansun

Keyword(s):

Detection System ◽

String Matching ◽

Experimental Results ◽

Plagiarism Detection ◽

Text Documents ◽

Matching Algorithm ◽

Text Document ◽

Different Types ◽

The University

Plagiarism is an act that is considered by the university as a fraud by taking someone ideas or writings without mentioning the references and claimed as his own. Plagiarism detection system is generally implement string matching algorithm in a text document to search for common words between documents. There are some algorithms used for string matching, two of them are Rabin-Karp and Jaro-Winkler Distance algorithms. Rabin-Karp algorithm is one of compatible algorithms to solve the problem of multiple string patterns, while, Jaro-Winkler Distance algorithm has advantages in terms of time. A plagiarism detection application is developed and tested on different types of documents, i.e. doc, docx, pdf and txt. From the experimental results, we obtained that both of these algorithms can be used to perform plagiarism detection of those documents, but in terms of their effectiveness, Rabin-Karp algorithm is much more effective and faster in the process of detecting the document with the size more than 1000 KB.

Download Full-text

A Validity Index for Fuzzy Clustering Based on Bipartite Modularity

Journal of Electrical and Computer Engineering ◽

10.1155/2019/2719617 ◽

2019 ◽

Vol 2019 ◽

pp. 1-9 ◽

Cited By ~ 1

Author(s):

Yongli Liu ◽

Xiaoyang Zhang ◽

Jingli Chen ◽

Hao Chao

Keyword(s):

Fuzzy Clustering ◽

Optimal Number ◽

Experimental Results ◽

Validity Index ◽

Number Of Clusters ◽

Validity Indices ◽

Noise Data ◽

Clustering Validity ◽

Optimal Number Of Clusters

Because traditional fuzzy clustering validity indices need to specify the number of clusters and are sensitive to noise data, we propose a validity index for fuzzy clustering, named CSBM (compactness separateness bipartite modularity), based on bipartite modularity. CSBM enhances the robustness by combining intraclass compactness and interclass separateness and can automatically determine the optimal number of clusters. In order to estimate the performance of CSBM, we carried out experiments on six real datasets and compared CSBM with other six prominent indices. Experimental results show that the CSBM index performs the best in terms of robustness while accurately detecting the number of clusters.

Download Full-text

Enhancement of Performance of Document Clustering in the Authorship Identification Problem with a Weighted Cosine Similarity

Advances in Computational Intelligence - Lecture Notes in Computer Science ◽

10.1007/978-3-030-04497-8_4 ◽

2018 ◽

pp. 49-56

Author(s):

Carolina Martín-del-Campo-Rodríguez ◽

Grigori Sidorov ◽

Ildar Batyrshin

Keyword(s):

Document Clustering ◽

Identification Problem ◽

Cosine Similarity ◽

Authorship Identification

Download Full-text

A Cluster Modelling Study of Conductivity of Yttria-Stabilized Zirconia

MRS Proceedings ◽

10.1557/proc-293-243 ◽

1992 ◽

Vol 293 ◽

Author(s):

Laura E. Depero ◽

Marcello Zocchi ◽

Fulvio Parmigiani

Keyword(s):

Structural Model ◽

Fixed Number ◽

Yttria Stabilized Zirconia ◽

Experimental Results ◽

Stabilized Zirconia ◽

Number Of Clusters ◽

Cluster Modelling ◽

Modelling Study

AbstractOn the basis of a previously described structural model for the Yttria-Stabilized Zirconia, the conductivity for this material can be calculated. A fixed number of clusters is generated in a structure of 70×28 polyhedra to simulate different Y contents in the structure. The results are compared with the experimental results obtained by others and discussed.

Download Full-text

Adaptive K-Means Algorithm with Dynamically Changing Cluster Centers and K-Value

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.532-533.1373 ◽

2012 ◽

Vol 532-533 ◽

pp. 1373-1377 ◽

Cited By ~ 1

Author(s):

Ai Ping Deng ◽

Ben Xiao ◽

Hui Yong Yuan

Keyword(s):

Nearest Neighbor ◽

Experimental Results ◽

Data Set ◽

Number Of Clusters ◽

K Value ◽

Testing Data ◽

Different Types ◽

Data Points ◽

Shared Nearest Neighbor

In allusion to the disadvantage of having to obtain the number of clusters in advance and the sensitivity to selecting initial clustering centers in the K-means algorithm, an improved K-means algorithm is proposed, that the cluster centers and the number of clusters are dynamically changing. The new algorithm determines the cluster centers by calculating the density of data points and shared nearest neighbor similarity, and controls the clustering categories by using the average shared nearest neighbor self-similarity.The experimental results of IRIS testing data set show that the algorithm can select the cluster cennters and can distinguish between different types of cluster efficiently.

Download Full-text