ClusterEnG: An interactive educational web resource for clustering big data

Mapping Intimacies ◽

10.1101/120915 ◽

2017 ◽

Author(s):

Mohith Manjunath ◽

Yi Zhang ◽

Steve H. Yeo ◽

Omar Sobh ◽

Nathan Russell ◽

...

Keyword(s):

Big Data ◽

State Of The Art ◽

Clustering Algorithms ◽

Clustering Methods ◽

Web Resource ◽

Interactive Visualizations ◽

Data Points ◽

Similarities And Differences ◽

Intuitive Manner ◽

The Web

AbstractSummaryClustering is one of the most common techniques used in data analysis to discover hidden structures by grouping together data points that are similar in some measure into clusters. Although there are many programs available for performing clustering, a single web resource that provides both state-of-the-art clustering methods and interactive visualizations is lacking. ClusterEnG (acronym for Clustering Engine for Genomics) provides an interface for clustering big data and interactive visualizations including 3D views, cluster selection and zoom features. ClusterEnG also aims at educating the user about the similarities and differences between various clustering algorithms and provides clustering tutorials that demonstrate potential pitfalls of each algorithm. The web resource will be particularly useful to scientists who are not conversant with computing but want to understand the structure of their data in an intuitive manner.AvailabilityClusterEnG is part of a bigger project called KnowEnG (Knowledge Engine for Genomics) and is available at http://education.knoweng.org/[email protected]

Download Full-text

ClusterEnG: an interactive educational web resource for clustering and visualizing high-dimensional data

PeerJ Computer Science ◽

10.7717/peerj-cs.155 ◽

2018 ◽

Vol 4 ◽

pp. e155 ◽

Cited By ~ 3

Author(s):

Mohith Manjunath ◽

Yi Zhang ◽

Yeonsung Kim ◽

Steve H. Yeo ◽

Omar Sobh ◽

...

Keyword(s):

Clustering Algorithm ◽

Clustering Algorithms ◽

Clustering Methods ◽

Web Interface ◽

Web Resource ◽

Interactive Visualizations ◽

Data Points ◽

Clustering Data ◽

Clustering Validation ◽

Intuitive Manner

Background Clustering is one of the most common techniques in data analysis and seeks to group together data points that are similar in some measure. Although there are many computer programs available for performing clustering, a single web resource that provides several state-of-the-art clustering methods, interactive visualizations and evaluation of clustering results is lacking. Methods ClusterEnG (acronym for Clustering Engine for Genomics) provides a web interface for clustering data and interactive visualizations including 3D views, data selection and zoom features. Eighteen clustering validation measures are also presented to aid the user in selecting a suitable algorithm for their dataset. ClusterEnG also aims at educating the user about the similarities and differences between various clustering algorithms and provides tutorials that demonstrate potential pitfalls of each algorithm. Conclusions The web resource will be particularly useful to scientists who are not conversant with computing but want to understand the structure of their data in an intuitive manner. The validation measures facilitate the process of choosing a suitable clustering algorithm among the available options. ClusterEnG is part of a bigger project called KnowEnG (Knowledge Engine for Genomics) and is available at http://education.knoweng.org/clustereng.

Download Full-text

A state-of-the-art survey on semantic similarity for document clustering using GloVe and density-based algorithms

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v22.i1.pp552-562 ◽

2021 ◽

Vol 22 (1) ◽

pp. 552

Author(s):

Shapol M. Mohammed ◽

Karwan Jacksi ◽

Subhi R. M. Zeebaree

Keyword(s):

Semantic Similarity ◽

State Of The Art ◽

Clustering Algorithms ◽

Document Clustering ◽

Accuracy Evaluation ◽

Similar Data ◽

Document Similarity ◽

Density Based Clustering ◽

Data Points ◽

The Common

<p><span>Semantic similarity is the process of identifying relevant data semantically. The traditional way of identifying document similarity is by using synonymous keywords and syntactician. In comparison, semantic similarity is to find similar data using meaning of words and semantics. Clustering is a concept of grouping objects that have the same features and properties as a cluster and separate from those objects that have different features and properties. In semantic document clustering, documents are clustered using semantic similarity techniques with similarity measurements. One of the common techniques to cluster documents is the density-based clustering algorithms using the density of data points as a main strategic to measure the similarity between them. In this paper, a state-of-the-art survey is presented to analyze the density-based algorithms for clustering documents. Furthermore, the similarity and evaluation measures are investigated with the selected algorithms to grasp the common ones. The delivered review revealed that the most used density-based algorithms in document clustering are DBSCAN and DPC. The most effective similarity measurement has been used with density-based algorithms, specifically DBSCAN and DPC, is Cosine similarity with F-measure for performance and accuracy evaluation.</span></p>

Download Full-text

Semantic Integration in Big Data: State-of-the-Art

Journal of Mobile Multimedia ◽

10.13052/jmm1550-4646.1533 ◽

2020 ◽

Author(s):

Zaoui Sayah ◽

Okba Kazar ◽

Ahmed Ghenabzia

Keyword(s):

Health Care ◽

Big Data ◽

State Of The Art ◽

Aviation Safety ◽

Semantic Integration ◽

Decision Makers ◽

Considerable Time ◽

Comprehensive Overview ◽

Information Providers ◽

The Web

Nowadays, web users and systems continually overload the web with an exponential generation of a massive amount of data. This leads to making big data more important in several domains such as social networks, internet of things, health care, E-commerce, aviation safety, etc. The use of big data has become increasingly crucial for companies due to the significant evolution of information providers and users on the web. However, big data remain meaningless without semantics. In order to get a good comprehension of big data, we raise questions about how big data and semantic are related to each other and how semantic may help. To overcome this problem, researchers devote considerable time to the integration of ontology in big data to ensure reliable interoperability between systems in order to make big data more useful, readable and exploitable. This technology can hide the heterogeneity of different data resources. Moreover, in given domains, users can exchange knowledge without caring to choose the suitable semantic that makes their content more expressive. This paper aims to provide a comprehensive overview for readers about big data and the appropriate tools to manipulate and analyse them such as Hadoop. Afterwards, we talk about ontology and how it can be used to improve big data management and analyses for decision makers. Finally, different semantic integration approaches are seen in a comparative study. This survey is concluded with a discussion and some perspectives.

Download Full-text

Learning with Adaptive Neighbors for Image Clustering

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2018/344 ◽

2018 ◽

Cited By ~ 2

Author(s):

Yang Liu ◽

Quanxue Gao ◽

Zhaohua Yang ◽

Shujian Wang

Keyword(s):

State Of The Art ◽

Clustering Algorithms ◽

Original Data ◽

Image Clustering ◽

Complex Structures ◽

Clustering Methods ◽

Proposed Model ◽

Data Graph ◽

The Given ◽

Optimal Graph

Due to the importance and efficiency of learning complex structures hidden in data, graph-based methods have been widely studied and get successful in unsupervised learning. Generally, most existing graph-based clustering methods require post-processing on the original data graph to extract the clustering indicators. However, there are two drawbacks with these methods: (1) the cluster structures are not explicit in the clustering results; (2) the final clustering performance is sensitive to the construction of the original data graph. To solve these problems, in this paper, a novel learning model is proposed to learn a graph based on the given data graph such that the new obtained optimal graph is more suitable for the clustering task. We also propose an efficient algorithm to solve the model. Extensive experimental results illustrate that the proposed model outperforms other state-of-the-art clustering algorithms.

Download Full-text

The Fundamental Clustering and Projection Suite (FCPS): A Dataset Collection to Test the Performance of Clustering and Data Projection Algorithms

Data ◽

10.3390/data5010013 ◽

2020 ◽

Vol 5 (1) ◽

pp. 13 ◽

Cited By ~ 1

Author(s):

Alfred Ultsch ◽

Jörn Lötsch

Keyword(s):

Data Science ◽

Clustering Algorithms ◽

Cluster Structure ◽

Projection Methods ◽

Analysis Method ◽

Clustering Methods ◽

Projection Algorithms ◽

Science Data ◽

Data Density ◽

Data Points

In the context of data science, data projection and clustering are common procedures. The chosen analysis method is crucial to avoid faulty pattern recognition. It is therefore necessary to know the properties and especially the limitations of projection and clustering algorithms. This report describes a collection of datasets that are grouped together in the Fundamental Clustering and Projection Suite (FCPS). The FCPS contains 10 datasets with the names “Atom”, “Chainlink”, “EngyTime”, “Golfball”, “Hepta”, “Lsun”, “Target”, “Tetra”, “TwoDiamonds”, and “WingNut”. Common clustering methods occasionally identified non-existent clusters or assigned data points to the wrong clusters in the FCPS suite. Likewise, common data projection methods could only partially reproduce the data structure correctly on a two-dimensional plane. In conclusion, the FCPS dataset collection addresses general challenges for clustering and projection algorithms such as lack of linear separability, different or small inner class spacing, classes defined by data density rather than data spacing, no cluster structure at all, outliers, or classes that are in contact. This report describes a collection of datasets that are grouped together in the Fundamental Clustering and Projection Suite (FCPS). It is designed to address specific problems of structure discovery in high-dimensional spaces.

Download Full-text

Metagenome sequence clustering with hash-based canopies

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720017400066 ◽

2017 ◽

Vol 15 (06) ◽

pp. 1740006 ◽

Cited By ~ 6

Author(s):

Mohammad Arifur Rahman ◽

Nathan LaPierre ◽

Huzefa Rangwala ◽

Daniel Barbara

Keyword(s):

Large Scale ◽

Clustering Algorithm ◽

State Of The Art ◽

Sequence Data ◽

Clustering Algorithms ◽

Clustering Methods ◽

Operational Taxonomic Units ◽

Sequence Clustering ◽

Scalable Clustering ◽

Metagenome Sequence

Metagenomics is the collective sequencing of co-existing microbial communities which are ubiquitous across various clinical and ecological environments. Due to the large volume and random short sequences (reads) obtained from community sequences, analysis of diversity, abundance and functions of different organisms within these communities are challenging tasks. We present a fast and scalable clustering algorithm for analyzing large-scale metagenome sequence data. Our approach achieves efficiency by partitioning the large number of sequence reads into groups (called canopies) using hashing. These canopies are then refined by using state-of-the-art sequence clustering algorithms. This canopy-clustering (CC) algorithm can be used as a pre-processing phase for computationally expensive clustering algorithms. We use and compare three hashing schemes for canopy construction with five popular and state-of-the-art sequence clustering methods. We evaluate our clustering algorithm on synthetic and real-world 16S and whole metagenome benchmarks. We demonstrate the ability of our proposed approach to determine meaningful Operational Taxonomic Units (OTU) and observe significant speedup with regards to run time when compared to different clustering algorithms. We also make our source code publicly available on Github. a

Download Full-text

Hyperspectral Image Clustering with Spatially-Regularized Ultrametrics

Remote Sensing ◽

10.3390/rs13050955 ◽

2021 ◽

Vol 13 (5) ◽

pp. 955

Author(s):

Shukun Zhang ◽

James M. Murphy

Keyword(s):

Spectral Clustering ◽

Hyperspectral Image ◽

State Of The Art ◽

Image Clustering ◽

Clustering Methods ◽

Performance Guarantees ◽

Data Density ◽

Spatial Geometry ◽

Data Points ◽

Almost All

We propose a method for the unsupervised clustering of hyperspectral images based on spatially regularized spectral clustering with ultrametric path distances. The proposed method efficiently combines data density and spectral-spatial geometry to distinguish between material classes in the data, without the need for training labels. The proposed method is efficient, with quasilinear scaling in the number of data points, and enjoys robust theoretical performance guarantees. Extensive experiments on synthetic and real HSI data demonstrate its strong performance compared to benchmark and state-of-the-art methods. Indeed, the proposed method not only achieves excellent labeling accuracy, but also efficiently estimates the number of clusters. Thus, unlike almost all existing hyperspectral clustering methods, the proposed algorithm is essentially parameter-free.

Download Full-text

Multiple Partitions Aligned Clustering

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/375 ◽

2019 ◽

Cited By ~ 1

Author(s):

Zhao Kang ◽

Zipeng Guo ◽

Shudong Huang ◽

Siying Wang ◽

Wenyu Chen ◽

...

Keyword(s):

State Of The Art ◽

Cluster Structure ◽

Consensus Clustering ◽

Consensus Cluster ◽

Clustering Methods ◽

Unified Framework ◽

Significant Information ◽

Heterogeneous Information ◽

Data Points ◽

Indicator Matrix

Multi-view clustering is an important yet challenging task due to the difficulty of integrating the information from multiple representations. Most existing multi-view clustering methods explore the heterogeneous information in the space where the data points lie. Such common practice may cause significant information loss because of unavoidable noise or inconsistency among views. Since different views admit the same cluster structure, the natural space should be all partitions. Orthogonal to existing techniques, in this paper, we propose to leverage the multi-view information by fusing partitions. Specifically, we align each partition to form a consensus cluster indicator matrix through a distinct rotation matrix. Moreover, a weight is assigned for each view to account for the clustering capacity differences of views. Finally, the basic partitions, weights, and consensus clustering are jointly learned in a unified framework. We demonstrate the effectiveness of our approach on several real datasets, where significant improvement is found over other state-of-the-art multi-view clustering methods.

Download Full-text

Semantic Unsupervised Automatic Keyphrases Extraction by Integrating Word Embedding with Clustering Methods

Multimodal Technologies and Interaction ◽

10.3390/mti4020030 ◽

2020 ◽

Vol 4 (2) ◽

pp. 30 ◽

Cited By ~ 1

Author(s):

Isabella Gagliardi ◽

Maria Teresa Artese

Keyword(s):

Clustering Algorithms ◽

Semantic Content ◽

Word Embedding ◽

Keyword Extraction ◽

Clustering Methods ◽

Novel Method ◽

The Web

Increasingly, the web produces massive volumes of texts, alone or associated with images, videos, photographs, together with some metadata, indispensable for their finding and retrieval. Keywords/keyphrases that characterize the semantic content of documents should be, automatically or manually, extracted, and/or associated with them. The paper presents a novel method to address the problem of the automatic unsupervised extraction of keywords/phrases from texts, expressed both in English and in Italian. The main feature of this approach is the integration of two methods that have given interesting results: word embedding models, such as Word2Vec or GloVe able to capture the semantics of words and their context, and clustering algorithms, able to identify the essence of the terms and choose the more significant one(s), to represent the contents of a text. In the paper, the datasets used are presented, together with the method implemented and the results obtained. These results will be discussed, commented, and compared with those obtained in previous experimentations, using TextRank, Rapid Automatic Keyword Extraction (RAKE), and TF-IDF.

Download Full-text

The Modeling and Simulation of Data Clustering Algorithms in Data Mining with Big Data

Journal of Industrial Integration and Management ◽

10.1142/s2424862218500173 ◽

2019 ◽

Vol 04 (01) ◽

pp. 1850017 ◽

Cited By ~ 3

Author(s):

Weiru Chen ◽

Jared Oliverio ◽

Jin Ho Kim ◽

Jiayue Shen

Keyword(s):

Data Mining ◽

Big Data ◽

Data Reduction ◽

Data Clustering ◽

Clustering Algorithms ◽

High Volume ◽

Clustering Methods ◽

Data Set ◽

Processing Methods ◽

Integration Data

Big Data is a popular cutting-edge technology nowadays. Techniques and algorithms are expanding in different areas including engineering, biomedical, and business. Due to the high-volume and complexity of Big Data, it is necessary to conduct data pre-processing methods when data mining. The pre-processing methods include data cleaning, data integration, data reduction, and data transformation. Data clustering is the most important step of data reduction. With data clustering, mining on the reduced data set should be more efficient yet produce quality analytical results. This paper presents the different data clustering methods and related algorithms for data mining with Big Data. Data clustering can increase the efficiency and accuracy of data mining.

Download Full-text