An Incremental Document Clustering Algorithm Based on a Hierarchical Agglomerative Approach

Author(s):  
Kil Hong Joo ◽  
SooJung Lee
2021 ◽  
Vol 12 (4) ◽  
pp. 169-185
Author(s):  
Saida Ishak Boushaki ◽  
Omar Bendjeghaba ◽  
Nadjet Kamel

Clustering is an important unsupervised analysis technique for big data mining. It finds its application in several domains including biomedical documents of the MEDLINE database. Document clustering algorithms based on metaheuristics is an active research area. However, these algorithms suffer from the problems of getting trapped in local optima, need many parameters to adjust, and the documents should be indexed by a high dimensionality matrix using the traditional vector space model. In order to overcome these limitations, in this paper a new documents clustering algorithm (ASOS-LSI) with no parameters is proposed. It is based on the recent symbiotic organisms search metaheuristic (SOS) and enhanced by an acceleration technique. Furthermore, the documents are represented by semantic indexing based on the famous latent semantic indexing (LSI). Conducted experiments on well-known biomedical documents datasets show the significant superiority of ASOS-LSI over five famous algorithms in terms of compactness, f-measure, purity, misclassified documents, entropy, and runtime.


Author(s):  
U. K. Sridevi ◽  
N. Nagaveni

Clustering is an important topic to find relevant content from a document collection and it also reduces the search space. The current clustering research emphasizes the development of a more efficient clustering method without considering the domain knowledge and user’s need. In recent years the semantics of documents have been utilized in document clustering. The discussed work focuses on the clustering model where ontology approach is applied. The major challenge is to use the background knowledge in the similarity measure. This paper presents an ontology based annotation of documents and clustering system. The semi-automatic document annotation and concept weighting scheme is used to create an ontology based knowledge base. The Particle Swarm Optimization (PSO) clustering algorithm can be applied to obtain the clustering solution. The accuracy of clustering has been computed before and after combining ontology with Vector Space Model (VSM). The proposed ontology based framework gives improved performance and better clustering compared to the traditional vector space model. The result using ontology was significant and promising.


2016 ◽  
Vol 10 (04) ◽  
pp. 527-555
Author(s):  
Lubomir Stanchev

In this article, we examine an algorithm for document clustering using a similarity graph. The graph stores words and common phrases from the English language as nodes and it can be used to compute the degree of semantic similarity between any two phrases. One application of the similarity graph is semantic document clustering, that is, grouping documents based on the meaning of the words in them. Since our algorithm for semantic document clustering relies on multiple parameters, we examine how fine-tuning these values affects the quality of the result. Specifically, we use the Reuters-21578 benchmark, which contains [Formula: see text] newswire stories that are grouped in 82 categories using human judgment. We apply the k-means clustering algorithm to group the documents using a similarity metric that is based on keywords matching and one that uses the similarity graph. We evaluate the results of the clustering algorithms using multiple metrics, such as precision, recall, f-score, entropy, and purity.


Sign in / Sign up

Export Citation Format

Share Document