Topic Detection Based on Sentence Embeddings and Agglomerative Clustering with Markov Moment

Svetlana S. Bodrunova; Andrey V. Orekhov; Ivan S. Blekanov; Nikolay S. Lyudkevich; Nikita A. Tarasov

doi:10.3390/fi12090144

Topic Detection Based on Sentence Embeddings and Agglomerative Clustering with Markov Moment

Future Internet ◽

10.3390/fi12090144 ◽

2020 ◽

Vol 12 (9) ◽

pp. 144 ◽

Cited By ~ 1

Author(s):

Svetlana S. Bodrunova ◽

Andrey V. Orekhov ◽

Ivan S. Blekanov ◽

Nikolay S. Lyudkevich ◽

Nikita A. Tarasov

Keyword(s):

Text Classification ◽

Optimal Number ◽

Agglomerative Clustering ◽

Number Of Clusters ◽

Universal Sentence ◽

Novel Approach ◽

Clustering Quality ◽

Set Up ◽

Markov Moment ◽

Optimal Number Of Clusters

The paper is dedicated to solving the problem of optimal text classification in the area of automated detection of typology of texts. In conventional approaches to topicality-based text classification (including topic modeling), the number of clusters is to be set up by the scholar, and the optimal number of clusters, as well as the quality of the model that designates proximity of texts to each other, remain unresolved questions. We propose a novel approach to the automated definition of the optimal number of clusters that also incorporates an assessment of word proximity of texts, combined with text encoding model that is based on the system of sentence embeddings. Our approach combines Universal Sentence Encoder (USE) data pre-processing, agglomerative hierarchical clustering by Ward’s method, and the Markov stopping moment for optimal clustering. The preferred number of clusters is determined based on the “e-2” hypothesis. We set up an experiment on two datasets of real-world labeled data: News20 and BBC. The proposed model is tested against more traditional text representation methods, like bag-of-words and word2vec, to show that it provides a much better-resulting quality than the baseline DBSCAN and OPTICS models with different encoding methods. We use three quality metrics to demonstrate that clustering quality does not drop when the number of clusters grows. Thus, we get close to the convergence of text clustering and text classification.

Download Full-text

Categorization of Mouse Ultrasonic Vocalizations Using Machine Learning Techniques

Acoustics ◽

10.3390/acoustics1040050 ◽

2019 ◽

Vol 1 (4) ◽

pp. 837-846

Author(s):

Spyros Kouzoupis ◽

Andreas Neocleous ◽

Irene Athanassakis

Keyword(s):

Clustering Algorithms ◽

Optimal Number ◽

Ultrasonic Vocalizations ◽

Machine Learning Techniques ◽

Agglomerative Clustering ◽

Number Of Clusters ◽

Unsupervised Analysis ◽

Learning Techniques ◽

Pitch Contours ◽

Optimal Number Of Clusters

A study of the ultrasonic vocalizations of several adult male BALB/c mice in the presence of a female, is undertaken in this study. A total of 179 distinct ultrasonic syllables referred to as “phonemes” are isolated, and in the resulting dataset, k-means and agglomerative clustering algorithms are implemented to group the ultrasonic vocalizations into clusters based on features extracted from their pitch contours. In order to find the optimal number of clusters, the elbow method was used, and nine distinct categories were obtained. Results when the k-means method was applied are presented through a matching matrix, while clustering results when the agglomerative technique was applied are presented as a dendrogram. The results of both methods are in line with the manual annotations made by the authors, as well as with the ones presented in the literature. The two methods of unsupervised analysis applied on 14 element feature vectors provide evidence that vocalizations can be grouped into nine clusters, which translates into the claim that there is a distinct repertoire of “syllables” or “phonemes”.

Download Full-text

Method for determining optimal number of clusters in K-means clustering algorithm

Journal of Computer Applications ◽

10.3724/sp.j.1087.2010.01995 ◽

2010 ◽

Vol 30 (8) ◽

pp. 1995-1998 ◽

Cited By ~ 18

Author(s):

Shi-bing ZHOU ◽

Zhen-yuan XU ◽

Xu-qing TANG

Keyword(s):

Clustering Algorithm ◽

Optimal Number ◽

Number Of Clusters ◽

Optimal Number Of Clusters

Download Full-text

Clustering Count-based RNA Methylation Data Using a Nonparametric Generative Model

Current Bioinformatics ◽

10.2174/1574893613666180601080008 ◽

2018 ◽

Vol 14 (1) ◽

pp. 11-23 ◽

Cited By ~ 3

Author(s):

Lin Zhang ◽

Yanling He ◽

Huaizhi Wang ◽

Hui Liu ◽

Yufei Huang ◽

...

Keyword(s):

Clustering Analysis ◽

Methylation Level ◽

Optimal Number ◽

Generative Model ◽

Methylation Data ◽

Sequencing Data ◽

Number Of Clusters ◽

Rna Methylation ◽

Clustering Effect ◽

Optimal Number Of Clusters

Background: RNA methylome has been discovered as an important layer of gene regulation and can be profiled directly with count-based measurements from high-throughput sequencing data. Although the detailed regulatory circuit of the epitranscriptome remains uncharted, clustering effect in methylation status among different RNA methylation sites can be identified from transcriptome-wide RNA methylation profiles and may reflect the epitranscriptomic regulation. Count-based RNA methylation sequencing data has unique features, such as low reads coverage, which calls for novel clustering approaches. Objective: Besides the low reads coverage, it is also necessary to keep the integer property to approach clustering analysis of count-based RNA methylation sequencing data. Method: We proposed a nonparametric generative model together with its Gibbs sampling solution for clustering analysis. The proposed approach implements a beta-binomial mixture model to capture the clustering effect in methylation level with the original count-based measurements rather than an estimated continuous methylation level. Besides, it adopts a nonparametric Dirichlet process to automatically determine an optimal number of clusters so as to avoid the common model selection problem in clustering analysis. Results: When tested on the simulated system, the method demonstrated improved clustering performance over hierarchical clustering, K-means, MClust, NMF and EMclust. It also revealed on real dataset two novel RNA N6-methyladenosine (m6A) co-methylation patterns that may be induced directly by METTL14 and WTAP, which are two known regulatory components of the RNA m6A methyltransferase complex. Conclusion: Our proposed DPBBM method not only properly handles the count-based measurements of RNA methylation data from sites of very low reads coverage, but also learns an optimal number of clusters adaptively from the data analyzed. Availability: The source code and documents of DPBBM R package are freely available through the Comprehensive R Archive Network (CRAN): https://cran.r-project.org/web/packages/DPBBM/.

Download Full-text

Estimating the Optimal Number of Clusters Via Internal Validity Index

Neural Processing Letters ◽

10.1007/s11063-021-10427-8 ◽

2021 ◽

Author(s):

Shibing Zhou ◽

Fei Liu ◽

Wei Song

Keyword(s):

Internal Validity ◽

Optimal Number ◽

Validity Index ◽

Number Of Clusters ◽

Optimal Number Of Clusters

Download Full-text

Improved Self-Adaptive ACS Algorithm to Determine the Optimal Number of Clusters

International Journal on Advanced Science Engineering and Information Technology ◽

10.18517/ijaseit.11.3.11723 ◽

2021 ◽

Vol 11 (3) ◽

pp. 1092

Author(s):

Ayad Mohammed Jabbar ◽

Ku Ruhana Ku-Mahamud ◽

Rafid Sagban

Keyword(s):

Optimal Number ◽

Number Of Clusters ◽

Acs Algorithm ◽

Optimal Number Of Clusters ◽

Self Adaptive

Download Full-text

Fast Search Algorithm for Determining the Optimal Number of Clusters using Cluster Validity Index

The Journal of the Korea Contents Association ◽

10.5392/jkca.2009.9.9.080 ◽

2009 ◽

Vol 9 (9) ◽

pp. 80-89 ◽

Cited By ~ 1

Author(s):

Sang-Wook Lee

Keyword(s):

Search Algorithm ◽

Optimal Number ◽

Cluster Validity ◽

Cluster Validity Index ◽

Validity Index ◽

Number Of Clusters ◽

Fast Search ◽

Fast Search Algorithm ◽

Optimal Number Of Clusters

Download Full-text

A heuristic method for finding the optimal number of clusters with application in medical data

2008 30th Annual International Conference of the IEEE Engineering in Medicine and Biology Society ◽

10.1109/iembs.2008.4650258 ◽

2008 ◽

Cited By ~ 3

Author(s):

Hamidreza Bayati ◽

Heydar Davoudi ◽

Emad Fatemizadeh

Keyword(s):

Heuristic Method ◽

Optimal Number ◽

Medical Data ◽

Number Of Clusters ◽

Optimal Number Of Clusters

Download Full-text

Investigating cluster validation metrics for optimal number of clusters determination

Intelligent Decision Technologies ◽

10.3233/idt-210187 ◽

2021 ◽

pp. 1-16

Author(s):

Aikaterini Karanikola ◽

Charalampos M. Liapis ◽

Sotiris Kotsiantis

Keyword(s):

Real World ◽

Optimal Number ◽

Cluster Validation ◽

Clustering Methods ◽

Number Of Clusters ◽

Validity Indices ◽

Selection Of ◽

Specific Distance ◽

Optimal Number Of Clusters

In short, clustering is the process of partitioning a given set of objects into groups containing highly related instances. This relation is determined by a specific distance metric with which the intra-cluster similarity is estimated. Finding an optimal number of such partitions is usually the key step in the entire process, yet a rather difficult one. Selecting an unsuitable number of clusters might lead to incorrect conclusions and, consequently, to wrong decisions: the term “optimal” is quite ambiguous. Furthermore, various inherent characteristics of the datasets, such as clusters that overlap or clusters containing subclusters, will most often increase the level of difficulty of the task. Thus, the methods used to detect similarities and the parameter selection of the partition algorithm have a major impact on the quality of the groups and the identification of their optimal number. Given that each dataset constitutes a rather distinct case, validity indices are indicators introduced to address the problem of selecting such an optimal number of clusters. In this work, an extensive set of well-known validity indices, based on the approach of the so-called relative criteria, are examined comparatively. A total of 26 cluster validation measures were investigated in two distinct case studies: one in real-world and one in artificially generated data. To ensure a certain degree of difficulty, both real-world and generated data were selected to exhibit variations and inhomogeneity. Each of the indices is being deployed under the schemes of 9 different clustering methods, which incorporate 5 different distance metrics. All results are presented in various explanatory forms.

Download Full-text

Finding the Optimal Number of Clusters for Word Sense Disambiguation

Text, Speech and Dialogue - Lecture Notes in Computer Science ◽

10.1007/978-3-642-23538-2_49 ◽

2011 ◽

pp. 388-394

Author(s):

Bartosz Broda ◽

Paweł Kędzia

Keyword(s):

Word Sense Disambiguation ◽

Optimal Number ◽

Word Sense ◽

Number Of Clusters ◽

Sense Disambiguation ◽

Optimal Number Of Clusters

Download Full-text

A Novel Method for Identifying Optimal Number of Clusters with Marginal Differential Entropy

Web-Age Information Management - Lecture Notes in Computer Science ◽

10.1007/978-3-642-39527-7_36 ◽

2013 ◽

pp. 371-382 ◽

Cited By ~ 1

Author(s):

Bo Shu ◽

Wei Chen ◽

Zhendong Niu ◽

Changmin Zhang ◽

Xiaotian Jiang

Keyword(s):

Optimal Number ◽

Differential Entropy ◽

Number Of Clusters ◽

Novel Method ◽

Optimal Number Of Clusters

Download Full-text