scholarly journals Sentence Embedding Based Semantic Clustering Approach for Discussion Thread Summarization

Complexity ◽  
2020 ◽  
Vol 2020 ◽  
pp. 1-11
Author(s):  
Atif Khan ◽  
Qaiser Shah ◽  
M. Irfan Uddin ◽  
Fasee Ullah ◽  
Abdullah Alharbi ◽  
...  

Huge data on the web come from discussion forums, which contain millions of threads. Discussion threads are a valuable source of knowledge for Internet users, as they have information about numerous topics. The discussion thread related to single topic comprises a huge number of reply posts, which makes it hard for the forum users to scan all the replies and determine the most relevant replies in the thread. At the same time, it is also hard for the forum users to manually summarize the bulk of reply posts in order to get the gist of discussion thread. Thus, automatically extracting the most relevant replies from discussion thread and combining them to form a summary are a challenging task. With this motivation behind, this study has proposed a sentence embedding based clustering approach for discussion thread summarization. The proposed approach works in the following fashion: At first, word2vec model is employed to represent reply sentences in the discussion thread through sentence embeddings/sentence vectors. Next, K-medoid clustering algorithm is applied to group semantically similar reply sentences in order to reduce the overlapping reply sentences. Finally, different quality text features are utilized to rank the reply sentences in different clusters, and then the high-ranked reply sentences are picked out from all clusters to form the thread summary. Two standard forum datasets are used to assess the effectiveness of the suggested approach. Empirical results confirm that the proposed sentence based clustering approach performed superior in comparison to other summarization methods in the context of mean precision, recall, and F-measure.

Author(s):  
R. R. Gharieb ◽  
G. Gendy ◽  
H. Selim

In this paper, the standard hard C-means (HCM) clustering approach to image segmentation is modified by incorporating weighted membership Kullback–Leibler (KL) divergence and local data information into the HCM objective function. The membership KL divergence, used for fuzzification, measures the proximity between each cluster membership function of a pixel and the locally-smoothed value of the membership in the pixel vicinity. The fuzzification weight is a function of the pixel to cluster-centers distances. The used pixel to a cluster-center distance is composed of the original pixel data distance plus a fraction of the distance generated from the locally-smoothed pixel data. It is shown that the obtained membership function of a pixel is proportional to the locally-smoothed membership function of this pixel multiplied by an exponentially distributed function of the minus pixel distance relative to the minimum distance provided by the nearest cluster-center to the pixel. Therefore, since incorporating the locally-smoothed membership and data information in addition to the relative distance, which is more tolerant to additive noise than the absolute distance, the proposed algorithm has a threefold noise-handling process. The presented algorithm, named local data and membership KL divergence based fuzzy C-means (LDMKLFCM), is tested by synthetic and real-world noisy images and its results are compared with those of several FCM-based clustering algorithms.


Author(s):  
Manmohan Singh ◽  
Rajendra Pamula ◽  
Alok Kumar

There are various applications of clustering in the fields of machine learning, data mining, data compression along with pattern recognition. The existent techniques like the Llyods algorithm (sometimes called k-means) were affected by the issue of the algorithm which converges to a local optimum along with no approximation guarantee. For overcoming these shortcomings, an efficient k-means clustering approach is offered by this paper for stream data mining. Coreset is a popular and fundamental concept for k-means clustering in stream data. In each step, reduction determines a coreset of inputs, and represents the error, where P represents number of input points according to nested property of coreset. Hence, a bit reduction in error of final coreset gets n times more accurate. Therefore, this motivated the author to propose a new coreset-reduction algorithm. The proposed algorithm executed on the Covertype dataset, Spambase dataset, Census 1990 dataset, Bigcross dataset, and Tower dataset. Our algorithm outperforms with competitive algorithms like Streamkm[Formula: see text], BICO (BIRCH meets Coresets for k-means clustering), and BIRCH (Balance Iterative Reducing and Clustering using Hierarchies.


Author(s):  
Muhamad Alias Md. Jedi ◽  
Robiah Adnan

TCLUST is a method in statistical clustering technique which is based on modification of trimmed k-means clustering algorithm. It is called “crisp” clustering approach because the observation is can be eliminated or assigned to a group. TCLUST strengthen the group assignment by putting constraint to the cluster scatter matrix. The emphasis in this paper is to restrict on the eigenvalues, λ of the scatter matrix. The idea of imposing constraints is to maximize the log-likelihood function of spurious-outlier model. A review of different robust clustering approach is presented as a comparison to TCLUST methods. This paper will discuss the nature of TCLUST algorithm and how to determine the number of cluster or group properly and measure the strength of group assignment. At the end of this paper, R-package on TCLUST implement the types of scatter restriction, making the algorithm to be more flexible for choosing the number of clusters and the trimming proportion.


2014 ◽  
Vol 2014 ◽  
pp. 1-11 ◽  
Author(s):  
Lopamudra Dey ◽  
Sanjay Chakraborty

“Clustering” the significance and application of this technique is spread over various fields. Clustering is an unsupervised process in data mining, that is why the proper evaluation of the results and measuring the compactness and separability of the clusters are important issues. The procedure of evaluating the results of a clustering algorithm is known as cluster validity measure. Different types of indexes are used to solve different types of problems and indices selection depends on the kind of available data. This paper first proposes Canonical PSO based K-means clustering algorithm and also analyses some important clustering indices (intercluster, intracluster) and then evaluates the effects of those indices on real-time air pollution database, wholesale customer, wine, and vehicle datasets using typical K-means, Canonical PSO based K-means, simple PSO based K-means, DBSCAN, and Hierarchical clustering algorithms. This paper also describes the nature of the clusters and finally compares the performances of these clustering algorithms according to the validity assessment. It also defines which algorithm will be more desirable among all these algorithms to make proper compact clusters on this particular real life datasets. It actually deals with the behaviour of these clustering algorithms with respect to validation indexes and represents their results of evaluation in terms of mathematical and graphical forms.


2021 ◽  
pp. 2141001
Author(s):  
Sanqiang Wei ◽  
Hongxia Hou ◽  
Hua Sun ◽  
Wei Li ◽  
Wenxia Song

The plots in certain literary works are very complicated and hinder readers from understanding them. Therefore tools should be proposed to support readers; comprehension of complex literary works supports their understanding by providing the most important information to readers. A human reader must capture multiple levels of abstraction and meaning to formulate an understanding of a document. Hence, in this paper, an Improved [Formula: see text]-means clustering algorithm (IKCA) has been proposed for literary word classification. For text data, the words that can express exact semantic in a class are generally better features. This paper uses the proposed technique to capture numerous cluster centroids for every class and then select the high-frequency words in centroids the text features for classification. Furthermore, neural networks have been used to classify text documents and [Formula: see text]-mean to cluster text documents. To develop the model based on unsupervised and supervised techniques to meet and identify the similarity between documents. The numerical results show that the suggested model will enhance to increases quality comparison of the existing Algorithm and [Formula: see text]-means algorithm, accuracy comparison of ALA and IKCA (95.2%), time is taken for clustering is less than 2 hours, success rate (97.4%) and performance ratio (98.1%).


Author(s):  
Xiaolong Gong ◽  
Linpeng Huang ◽  
Fuwei Wang

Real web datasets are often associated with multiple views such as long and short commentaries, users preference and so on. However, with the rapid growth of user generated texts, each view of the dataset has a large feature space and leads to the computational challenge during matrix decomposition process. In this paper, we propose a novel multi-view clustering algorithm based on the non-negative matrix factorization that attempts to use feature sampling strategy in order to reduce the complexity during the iteration process. In particular, our method exploits unsupervised semantic information in the learning process to capture the intrinsic similarity through a graph regularization. Moreover, we use Hilbert Schmidt Independence Criterion (HSIC) to explore the unsupervised semantic diversity information among multi-view contents of one web item. The overall objective is to minimize the loss function of multi-view non-negative matrix factorization that combines with an intra-semantic similarity graph regularizer and an inter-semantic diversity term. Compared with some state-of-the-art methods, we demonstrate the effectiveness of our proposed method on a large real-world dataset Doucom and the other three smaller datasets.


2019 ◽  
Vol 16 (4) ◽  
pp. 563-593 ◽  
Author(s):  
Aman Bhatnagar ◽  
Prem Vrat ◽  
Ravi Shankar

Purpose The purpose of this paper is to determine compatibility groups of different fruits and vegetables that can be stored and transported together based upon their requirements for temperature, relative humidity, odour and ethylene production. Pre-cooling which is necessary to prepare the commodity for subsequent shipping and safe storage is also discussed. Design/methodology/approach The methodology used in this journal is an attempt to form clusters/groups of storing together 43 identified fruits and vegetables based on four important parameters, namely, temperature, relative humidity, odour and ethylene production. An agglomerative hierarchical clustering algorithm is used to build a cluster hierarchy that is commonly displayed as a tree diagram called dendrogram. The same is further analyzed using K-means clustering to find clusters of comparable spatial extent. The results obtained from the analytics are compared with the available data of grouping fruits and vegetables. Findings This study investigates the usefulness and efficacy of the proposed clustering approach for storage and transportation of different fruits and vegetables that will eventually save huge investment made in terms of developing infrastructure components and energy consumption. This will enable the investors to adopt it for using the space more effectively and also reducing food wastage. Research limitations/implications Due to limited research and development (R&D) data pertaining to storage parameters of different fruits and vegetables on the basis of temperature, relative humidity, ethylene production/sensitivity, odour and pre-cooling, information from different available sources have been utilized. India needs to develop its own crop specific R&D data, since the conditions for soil, water and environment vary when compared to other countries. Due to the limited availability of the research data, various multi-criteria approaches used in other areas have been applied to this paper. Future studies might be interested in considering other relevant variables depending upon R&D and data availability. Practical implications With the increase in population, the demand for food is also increasing. To meet such growing demand and provide quality and nutritional food, it is important to have a clear methodology in terms of compatibility grouping for utilizing the available storage space for multi-commodity produce and during transportation. The methodology used shall enable the practitioners to understand the importance of temperature, humidity, odour and ethylene sensitivity for storage and transportation of perishables. Social implications This approach shall be useful for decision making by farmers, Farmer Producer Organization, cold-storage owners, practicing managers, policy makers and researchers in the areas of cold-chain management and will provide an opportunity to use the available space in the cold storage for storing different fruits and vegetables, thereby facilitating optimum use of infrastructure and resources. This will enable the investors to utilize the space more effectively and also reduce food wastage. It shall also facilitate organizations to manage their logistic activities to gain competitive advantage. Originality/value The proposed model would help decision makers to resolve the issues related to the selection of storing different perishable commodities together. From the secondary research, not much research papers have been found where such a multi-criteria clustering approach has been applied for the storage of fruits and vegetables incorporating four important parameters relevant for storage and transportation.


Author(s):  
Pankaj Kailas Bhole ◽  
A. J. Agrawal

Text  summarization is  an  old challenge  in  text  mining  but  in  dire  need  of researcher’s attention in the areas of computational intelligence, machine learning  and  natural  language  processing. We extract a set of features from each sentence that helps identify its importance in the document. Every time reading full text is time consuming. Clustering approach is useful to decide which type of data present in document. In this paper we introduce the concept of k-mean clustering for natural language processing of text for word matching and in order to extract meaningful information from large set of offline documents, data mining document clustering algorithm are adopted.


Author(s):  
Mohamad Farhan Mohamad Mohsin ◽  
Mohd Noor Abdul Hamid ◽  
Nurakmal Ahmad Mustaffa ◽  
Razamin Ramli ◽  
Kamarudin Abdullah

<span>CSR UUMWiFi is a CSR project under Universiti Utara Malaysia (UUM) that provides unlimited free internet connection for the Changlun community. Launched in 2015, the service has accumulated a huge number of users with diverse background and interest. This paper aims to uncover interesting service users’ behavior by mining the usage data. To achieve that, the access log for 3 months with 24,000 online users were downloaded from the Wi-Fi network server, pre-process and analyzed. The finding reveals that there were many loyal users who have been using this service on a daily basis since 2015 and the community spent 20-60 minutes per session. Besides that, the social media and leisure based application such YouTube, Facebook, Instagram, chatting applications, and miscellaneous web applications were among the top applications accessed by the Changlun community which contributes to huge data usage. It is also found that there were few users have used the CSR UUM WiFi for academic or business purposes. The identified patterns benefits the management team in providing a better quality service for community in future and setting up new policies for the service.</span>


First Monday ◽  
2019 ◽  
Author(s):  
Davi Oliveira Serrano De Andrade ◽  
Anderson Almeida Firmino ◽  
Cláudio de Souza Baptista ◽  
Hugo Feitosa De Figueirêdo

An event can be defined as a happening that gathers people with some common goal over a period of time and in a certain place. This paper presents a new method to retrieve social events through annotations in spatio-temporal photo collections, known as STEve-PR (Spatio-Temporal EVEnt Photo Retrieval). The proposed technique uses a clustering algorithm to gather similar photos by considering the location, date and time of the photos. The STEve-PR clustering approach clusters photos belonging to the same event. STEve-PR uses spatial clusters created to propagate event annotation between photos in the same cluster and employs TF-IDF similarity between tags to find the spatial cluster with the highest similarity for photos without a geographical location. We evaluated our approach on a public database.


Sign in / Sign up

Export Citation Format

Share Document