Weighted Mutual Information for Aggregated Kernel Clustering

Nezamoddin N. Kachouie; Meshal Shutaywi

doi:10.3390/e22030351

Weighted Mutual Information for Aggregated Kernel Clustering

Entropy ◽

10.3390/e22030351 ◽

2020 ◽

Vol 22 (3) ◽

pp. 351

Author(s):

Nezamoddin N. Kachouie ◽

Meshal Shutaywi

Keyword(s):

Mutual Information ◽

Kernel Function ◽

Dimensional Space ◽

Data Sets ◽

Clustering Methods ◽

Main Challenge ◽

Kernel Clustering ◽

Clustering Data ◽

Project Data ◽

The Right

Background: A common task in machine learning is clustering data into different groups based on similarities. Clustering methods can be divided in two groups: linear and nonlinear. A commonly used linear clustering method is K-means. Its extension, kernel K-means, is a non-linear technique that utilizes a kernel function to project the data to a higher dimensional space. The projected data will then be clustered in different groups. Different kernels do not perform similarly when they are applied to different datasets. Methods: A kernel function might be relevant for one application but perform poorly to project data for another application. In turn choosing the right kernel for an arbitrary dataset is a challenging task. To address this challenge, a potential approach is aggregating the clustering results to obtain an impartial clustering result regardless of the selected kernel function. To this end, the main challenge is how to aggregate the clustering results. A potential solution is to combine the clustering results using a weight function. In this work, we introduce Weighted Mutual Information (WMI) for calculating the weights for different clustering methods based on their performance to combine the results. The performance of each method is evaluated using a training set with known labels. Results: We applied the proposed Weighted Mutual Information to four data sets that cannot be linearly separated. We also tested the method in different noise conditions. Conclusions: Our results show that the proposed Weighted Mutual Information method is impartial, does not rely on a single kernel, and performs better than each individual kernel specially in high noise.

Download Full-text

Analisis Algoritma K-Means Clustering Untuk Menentukan Strategi Promosi Penjualan Sepeda Motor Studi Kasus PT. Alfa Scorpii

JUTI UNISI ◽

10.32520/juti.v4i1.1087 ◽

2020 ◽

Vol 4 (1) ◽

pp. 1-8

Author(s):

Abdul Muni

Keyword(s):

Data Mining ◽

Large Data ◽

Sales Promotion ◽

Large Data Sets ◽

Data Sets ◽

Promotion Strategy ◽

Corporate Earnings ◽

Clustering Data ◽

The Right ◽

The Cost

PT. Alpa Scorpii is the sector private the economy in the motorcycle sales. The utilization of the data is not maximum, sales report that is used only limited to report. Promotion strategy is to increase the income of the company in relation to the straight way with the cost. The data mining so that data can be used as the existing knowledge from the large data sets or with the term knowledge discovery or pattern recognition. Many methods in data mining one only with the method the algorithm K-Means the Cluster. Clustering data so that the field of marketing can perform the motor sales promotion strategy to new customers with the right and can improve corporate earnings.

Download Full-text

Stock Data Clustering of Food and Beverage Company

IJCCS (Indonesian Journal of Computing and Cybernetics Systems) ◽

10.22146/ijccs.2279 ◽

2007 ◽

Vol 1 (2) ◽

Author(s):

Shofwatul Uyun ◽

Subanar Subanar

Keyword(s):

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Clustering Methods ◽

Cluster Validity ◽

Fuzzy C Means ◽

Food And Beverage ◽

Hard Clustering ◽

Clustering Data ◽

Support Decision Making

AbstractCluster analysis can be defined as identifying groups of similar objects to discover distribution of patterns and interesting correlations in large data sets. Clustering analysis is important in the fields of pattern recognition and pattern classification. Over the years many methods have been developed for clustering data. In general, clustering methods can be categoried into two categories, i.e., fuzzy clustering and hard clustering. Fuzzy C-means is one of many methods of clustering based on fuzzy approach, while K-Means and K-Medoid are methods clustering based on crisp approach.This study aims to apply Fuzzy C-Means, K-Means and K-Medoid methods for clustering stock data in a jbod and beverage company. The main goal is to find a clustering method that can produce optimal clusters, The resulting clusters are validated using Dunn'• Index (DI). It is expected that the result of this reseach can be used to support decision making in the food and beverage company.Keywords : Clustering, Fuzzy C-Means, K-Means, K-Medoid, Cluster Validity, Dunn's Index (Dl)

Download Full-text

Details (Don't) Matter: Isolating Cluster Information in Deep Embedded Spaces

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2021/389 ◽

2021 ◽

Author(s):

Lukas Miklautz ◽

Lena G. M. Bauer ◽

Dominik Mautz ◽

Sebastian Tschiatschek ◽

Christian Böhm ◽

...

Keyword(s):

Dimensional Space ◽

Representation Learning ◽

Specific Information ◽

Data Sets ◽

Clustering Methods ◽

Cluster Performance ◽

Clustering Techniques ◽

General Variation ◽

Improved Performance ◽

Low Dimensional

Deep clustering techniques combine representation learning with clustering objectives to improve their performance. Among existing deep clustering techniques, autoencoder-based methods are the most prevalent ones. While they achieve promising clustering results, they suffer from an inherent conflict between preserving details, as expressed by the reconstruction loss, and finding similar groups by ignoring details, as expressed by the clustering loss. This conflict leads to brittle training procedures, dependence on trade-off hyperparameters and less interpretable results. We propose our framework, ACe/DeC, that is compatible with Autoencoder Centroid based Deep Clustering methods and automatically learns a latent representation consisting of two separate spaces. The clustering space captures all cluster-specific information and the shared space explains general variation in the data. This separation resolves the above mentioned conflict and allows our method to learn both detailed reconstructions and cluster specific abstractions. We evaluate our framework with extensive experiments to show several benefits: (1) cluster performance – on various data sets we outperform relevant baselines; (2) no hyperparameter tuning – this improved performance is achieved without introducing new clustering specific hyperparameters; (3) interpretability – isolating the cluster specific information in a separate space is advantageous for data exploration and interpreting the clustering results; and (4) dimensionality of the embedded space – we automatically learn a low dimensional space for clustering. Our ACe/DeC framework isolates cluster information, increases stability and interpretability, while improving cluster performance.

Download Full-text

Clustering Algorithm for Arbitrary Data Sets

Encyclopedia of Artificial Intelligence ◽

10.4018/978-1-59904-849-9.ch046 ◽

2011 ◽

pp. 297-303

Author(s):

Yu-Chen Song ◽

Hai-Dong Meng

Keyword(s):

Arbitrary Shape ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Interaction Networks ◽

Climate Indices ◽

Data Sets ◽

Clustering Methods ◽

Efficient Manner ◽

Protein Protein Interaction ◽

Clustering Data

Clustering analysis is an intrinsic component of numerous applications, including pattern recognition, life sciences, image processing, web data analysis, earth sciences, and climate research. As an example, consider the biology domain. In any living cell that undergoes a biological process, different subsets of its genes are expressed in different stages of the process. To facilitate a deeper understanding of these processes, a clustering algorithm was developed (Ben- Dor, Shamir, & Yakhini, 1999) that enabled detailed analysis of gene expression data. Recent advances in proteomics technologies, such as two-hybrid, phage display and mass spectrometry, have enabled the creation of detailed maps of biomolecular interaction networks. To further understanding in this area, a clustering mechanism that detects densely connected regions in large protein-protein interaction networks that may represent molecular complexes was constructed (Bader & Hogue, 2003). In the interpretation of remote sensing images, clustering algorithms (Sander, Ester, Kriegel, & Xu, 1998) have been employed to recognize and understand the content of such images. In the management of web directories, document annotation is an important task. Given a predefined taxonomy, the objective is to identify a category related to the content of an unclassified document. Self-Organizing Maps have been harnessed to influence the learning process with knowledge encoded within a taxonomy (Adami, Avesani, & Sona, 2005). Earth scientists are interested in discovering areas of the ocean that have a demonstrable effect on climatic events on land, and the SNN clustering technique (Ertöz, Steinbach, & Kumar, 2002) is one example of a technique that has been adopted in this domain. Also, scientists have developed climate indices, which are time series that summarize the behavior of selected regions of the Earth’s oceans and atmosphere. Clustering techniques have proved crucial in the production of climate indices (Steinbach, Tan, Kumar, Klooster, & Potter, 2003). In many application domains, clusters of data are of arbitrary shape, size and density, and the number of clusters is unknown. In such scenarios, traditional clustering algorithms, including partitioning methods, hierarchical methods, density-based methods and gridbased methods, cannot identify clusters efficiently or accurately. Obviously, this is a critical limitation. In the following sections, a number of clustering methods are presented and discussed, after which the design of an algorithm based on Density and Density-reachable (CADD) is presented. CADD seeks to remedy some of the deficiencies of classical clustering approaches by robustly clustering data that is of arbitrary shape, size, and density in an effective and efficient manner.

Download Full-text

Do galactic bars depend on environment?: an information theoretic analysis of Galaxy Zoo 2

Monthly Notices of the Royal Astronomical Society ◽

10.1093/mnras/staa3665 ◽

2020 ◽

Vol 501 (1) ◽

pp. 994-1001

Author(s):

Suman Sarkar ◽

Biswajit Pandey ◽

Snehasish Bhattacharjee

Keyword(s):

Spatial Distribution ◽

Mutual Information ◽

Local Density ◽

Statistical Significance ◽

Distribution Functions ◽

Cumulative Distribution ◽

Host Galaxy ◽

Data Sets ◽

Data Set ◽

Information Theoretic

ABSTRACT We use an information theoretic framework to analyse data from the Galaxy Zoo 2 project and study if there are any statistically significant correlations between the presence of bars in spiral galaxies and their environment. We measure the mutual information between the barredness of galaxies and their environments in a volume limited sample (Mr ≤ −21) and compare it with the same in data sets where (i) the bar/unbar classifications are randomized and (ii) the spatial distribution of galaxies are shuffled on different length scales. We assess the statistical significance of the differences in the mutual information using a t-test and find that both randomization of morphological classifications and shuffling of spatial distribution do not alter the mutual information in a statistically significant way. The non-zero mutual information between the barredness and environment arises due to the finite and discrete nature of the data set that can be entirely explained by mock Poisson distributions. We also separately compare the cumulative distribution functions of the barred and unbarred galaxies as a function of their local density. Using a Kolmogorov–Smirnov test, we find that the null hypothesis cannot be rejected even at $75{{\ \rm per\ cent}}$ confidence level. Our analysis indicates that environments do not play a significant role in the formation of a bar, which is largely determined by the internal processes of the host galaxy.

Download Full-text

Experiments of Image Classification Using Dissimilarity Spaces Built with Siamese Networks

Sensors ◽

10.3390/s21051573 ◽

2021 ◽

Vol 21 (5) ◽

pp. 1573

Author(s):

Loris Nanni ◽

Giovanni Minchio ◽

Sheryl Brahnam ◽

Gianluca Maguolo ◽

Alessandra Lumini

Keyword(s):

Vector Space ◽

Image Classification ◽

Ad Hoc ◽

Feature Space ◽

Medical Data ◽

Training Data ◽

Data Sets ◽

Large Set ◽

Clustering Methods ◽

Siamese Networks

Traditionally, classifiers are trained to predict patterns within a feature space. The image classification system presented here trains classifiers to predict patterns within a vector space by combining the dissimilarity spaces generated by a large set of Siamese Neural Networks (SNNs). A set of centroids from the patterns in the training data sets is calculated with supervised k-means clustering. The centroids are used to generate the dissimilarity space via the Siamese networks. The vector space descriptors are extracted by projecting patterns onto the similarity spaces, and SVMs classify an image by its dissimilarity vector. The versatility of the proposed approach in image classification is demonstrated by evaluating the system on different types of images across two domains: two medical data sets and two animal audio data sets with vocalizations represented as images (spectrograms). Results show that the proposed system’s performance competes competitively against the best-performing methods in the literature, obtaining state-of-the-art performance on one of the medical data sets, and does so without ad-hoc optimization of the clustering methods on the tested data sets.

Download Full-text

Role of thermal field in entanglement harvesting between two accelerated Unruh-DeWitt detectors

Journal of High Energy Physics ◽

10.1007/jhep07(2021)124 ◽

2021 ◽

Vol 2021 (7) ◽

Author(s):

Dipankar Barman ◽

Subhajit Barman ◽

Bibhas Ranjan Majhi

Keyword(s):

Mutual Information ◽

Critical Point ◽

Thermal Field ◽

Narrow Range ◽

Critical Value ◽

Critical Values ◽

Parallel Motion ◽

Field Temperature ◽

The Right

Abstract We investigate the effects of field temperature T(f) on the entanglement harvesting between two uniformly accelerated detectors. For their parallel motion, the thermal nature of fields does not produce any entanglement, and therefore, the outcome is the same as the non-thermal situation. On the contrary, T(f) affects entanglement harvesting when the detectors are in anti-parallel motion, i.e., when detectors A and B are in the right and left Rindler wedges, respectively. While for T(f) = 0 entanglement harvesting is possible for all values of A’s acceleration aA, in the presence of temperature, it is possible only within a narrow range of aA. In (1 + 1) dimensions, the range starts from specific values and extends to infinity, and as we increase T(f), the minimum required value of aA for entanglement harvesting increases. Moreover, above a critical value aA = ac harvesting increases as we increase T(f), which is just opposite to the accelerations below it. There are several critical values in (1 + 3) dimensions when they are in different accelerations. Contrary to the single range in (1 + 1) dimensions, here harvesting is possible within several discrete ranges of aA. Interestingly, for equal accelerations, one has a single critical point, with nature quite similar to (1 + 1) dimensional results. We also discuss the dependence of mutual information among these detectors on aA and T(f).

Download Full-text

Evaluation of Unsupervised Clustering Methods on Hyperspectral Image Data Sets

2018 IEEE International Conference on Progress in Informatics and Computing (PIC) ◽

10.1109/pic.2018.8706315 ◽

2018 ◽

Author(s):

Wei Zhang ◽

Zhichao Lian ◽

Chanying Huang

Keyword(s):

Hyperspectral Image ◽

Image Data ◽

Unsupervised Clustering ◽

Data Sets ◽

Clustering Methods ◽

Hyperspectral Image Data

Download Full-text

A simple clustering technique to extract subsets of data for function approximation

Journal of Hydroinformatics ◽

10.2166/hydro.2015.065 ◽

2015 ◽

Vol 17 (5) ◽

pp. 719-732

Author(s):

Dulakshi Santhusitha Kumari Karunasingha ◽

Shie-Yui Liong

Keyword(s):

Function Approximation ◽

Prediction Models ◽

Data Extraction ◽

Single Parameter ◽

Subtractive Clustering ◽

Data Sets ◽

Clustering Methods ◽

Clustering Method ◽

Data Set ◽

Functional Relationships

A simple clustering method is proposed for extracting representative subsets from lengthy data sets. The main purpose of the extracted subset of data is to use it to build prediction models (of the form of approximating functional relationships) instead of using the entire large data set. Such smaller subsets of data are often required in exploratory analysis stages of studies that involve resource consuming investigations. A few recent studies have used a subtractive clustering method (SCM) for such data extraction, in the absence of clustering methods for function approximation. SCM, however, requires several parameters to be specified. This study proposes a clustering method, which requires only a single parameter to be specified, yet it is shown to be as effective as the SCM. A method to find suitable values for the parameter is also proposed. Due to having only a single parameter, using the proposed clustering method is shown to be orders of magnitudes more efficient than using SCM. The effectiveness of the proposed method is demonstrated on phase space prediction of three univariate time series and prediction of two multivariate data sets. Some drawbacks of SCM when applied for data extraction are identified, and the proposed method is shown to be a solution for them.

Download Full-text

Adaptive Structure Concept Factorization for Multiview Clustering

Neural Computation ◽

10.1162/neco_a_01055 ◽

2018 ◽

Vol 30 (4) ◽

pp. 1080-1103 ◽

Cited By ~ 10

Author(s):

Kun Zhan ◽

Jinhui Shi ◽

Jing Wang ◽

Haibo Wang ◽

Yuange Xie

Keyword(s):

Nonnegative Matrix Factorization ◽

State Of The Art ◽

Nonnegative Matrix ◽

Adaptive Method ◽

Data Sets ◽

Clustering Methods ◽

Normalized Mutual Information ◽

Adaptive Structure ◽

Concept Factorization ◽

Multiview Clustering

Most existing multiview clustering methods require that graph matrices in different views are computed beforehand and that each graph is obtained independently. However, this requirement ignores the correlation between multiple views. In this letter, we tackle the problem of multiview clustering by jointly optimizing the graph matrix to make full use of the data correlation between views. With the interview correlation, a concept factorization–based multiview clustering method is developed for data integration, and the adaptive method correlates the affinity weights of all views. This method differs from nonnegative matrix factorization–based clustering methods in that it can be applicable to data sets containing negative values. Experiments are conducted to demonstrate the effectiveness of the proposed method in comparison with state-of-the-art approaches in terms of accuracy, normalized mutual information, and purity.

Download Full-text