IDCUP Algorithm to Classifying Arbitrary Shapes and Densities for Center-based Clustering Performance Analysis

Interdisciplinary Journal of Information Knowledge and Management ◽

10.28945/4541 ◽

2020 ◽

Vol 15 ◽

pp. 091-108

Author(s):

Saud Altaf ◽

Muhammad Waseem Waseem ◽

Laila Kazmi

Keyword(s):

Knowledge Discovery ◽

Arbitrary Shape ◽

Hybrid Approach ◽

Large Data ◽

Data Sets ◽

Complex Data ◽

Discovery Process ◽

Clustering Methods ◽

Density Based Clustering ◽

Arbitrary Shapes

Aim/Purpose: The clustering techniques are normally considered to determine the significant and meaningful subclasses purposed in datasets. It is an unsupervised type of Machine Learning (ML) where the objective is to form groups from objects based on their similarity and used to determine the implicit relationships between the different features of the data. Cluster Analysis is considered a significant problem area in data exploration when dealing with arbitrary shape problems in different datasets. Clustering on large data sets has the following challenges: (1) clusters with arbitrary shapes; (2) less knowledge discovery process to decide the possible input features; (3) scalability for large data sizes. Density-based clustering has been known as a dominant method for determining the arbitrary-shape clusters. Background: Existing density-based clustering methods commonly cited in the literature have been examined in terms of their behavior with data sets that contain nested clusters of varying density. The existing methods are not enough or ideal for such data sets, because they typically partition the data into clusters that cannot be nested. Methodology: A density-based approach on traditional center-based clustering is introduced that assigns a weight to each cluster. The weights are then utilized in calculating the distances from data vectors to centroids by multiplying the distance by the centroid weight. Contribution: In this paper, we have examined different density-based clustering methods for data sets with nested clusters of varying density. Two such data sets were used to evaluate some of the commonly cited algorithms found in the literature. Nested clusters were found to be challenging for the existing algorithms. In utmost cases, the targeted algorithms either did not detect the largest clusters or simply divided large clusters into non-overlapping regions. But, it may be possible to detect all clusters by doing multiple runs of the algorithm with different inputs and then combining the results. This work considered three challenges of clustering methods. Findings: As a result, a center with a low weight will attract objects from further away than a centroid with higher weight. This allows dense clusters inside larger clusters to be recognized. The methods are tested experimentally using the K-means, DBSCAN, TURN*, and IDCUP algorithms. The experimental results with different data sets showed that IDCUP is more robust and produces better clusters than DBSCAN, TURN*, and K-means. Finally, we compare K-means, DBSCAN, TURN*, and to deal with arbitrary shapes problems at different datasets. IDCUP shows better scalability compared to TURN*. Future Research: As future recommendations of this research, we are concerned with the exploration of further available challenges of the knowledge discovery process in clustering along with complex data sets with more time. A hybrid approach based on density-based and model-based clustering algorithms needs to compare to achieve maximum performance accuracy and avoid the arbitrary shapes related problems including optimization. It is anticipated that the comparable kind of the future suggested process will attain improved performance with analogous precision in identification of clustering shapes.

Download Full-text

Knowledge Discovery in Large Data Sets: A Primer for Data Mining Applications in Health Care

Health Informatics - Nursing Informatics ◽

10.1007/978-1-4757-3252-8_10 ◽

2000 ◽

pp. 139-148 ◽

Cited By ~ 2

Author(s):

Patricia A. Abbott

Keyword(s):

Data Mining ◽

Health Care ◽

Knowledge Discovery ◽

Large Data ◽

Large Data Sets ◽

Data Sets

Download Full-text

clusTCR: a Python interface for rapid clustering of large sets of CDR3 sequences

10.1101/2021.02.22.432291 ◽

2021 ◽

Author(s):

Sebastiaan Valkiers ◽

Max Van Houcke ◽

Kris Laukens ◽

Pieter Meysman

Keyword(s):

T Cell ◽

Large Data ◽

Cell Receptor ◽

Amino Acid Sequences ◽

Large Data Sets ◽

Data Sets ◽

Clustering Methods ◽

Link Type ◽

Large Sets ◽

Similar Accuracy

The T-cell receptor (TCR) determines the specificity of a T-cell towards an epitope. As of yet, the rules for antigen recognition remain largely undetermined. Current methods for grouping TCRs according to their epitope specificity remain limited in performance and scalability. Multiple methodologies have been developed, but all of them fail to efficiently cluster large data sets exceeding 1 million sequences. To account for this limitation, we developed clusTCR, a rapid TCR clustering alternative that efficiently scales up to millions of CDR3 amino acid sequences. Benchmarking comparisons revealed similar accuracy of clusTCR with other TCR clustering methods. clusTCR offers a drastic improvement in clustering speed, which allows clustering of millions of TCR sequences in just a few minutes through efficient similarity searching and sequence hashing.clusTCR was written in Python 3. It is available as an anaconda package (https://anaconda.org/svalkiers/clustcr) and on github (https://github.com/svalkiers/clusTCR).

Download Full-text

On the Generation of Point Cloud Data Sets: Step One in the Knowledge Discovery Process

Interactive Knowledge Discovery and Data Mining in Biomedical Informatics - Lecture Notes in Computer Science ◽

10.1007/978-3-662-43968-5_4 ◽

2014 ◽

pp. 57-80 ◽

Cited By ~ 8

Author(s):

Andreas Holzinger ◽

Bernd Malle ◽

Marcus Bloice ◽

Marco Wiltgen ◽

Massimo Ferri ◽

...

Keyword(s):

Knowledge Discovery ◽

Point Cloud ◽

Data Sets ◽

Point Cloud Data ◽

Discovery Process ◽

Cloud Data

Download Full-text

Moving Window Two-Dimensional Correlation Spectroscopy and Determination of Signal-To-Noise Threshold in Correlation Spectra

Applied Spectroscopy ◽

10.1366/000370203322258977 ◽

2003 ◽

Vol 57 (8) ◽

pp. 996-1006 ◽

Cited By ~ 15

Author(s):

Slobodan Šašić ◽

Yukihiro Ozaki

Keyword(s):

Large Data ◽

Correlation Spectroscopy ◽

Data Matrix ◽

Data Sets ◽

Complex Data ◽

Moving Window ◽

Two Dimensional ◽

Complex Data Sets ◽

Coexisting Species ◽

Definition Of

In this paper we report two new developments in two-dimensional (2D) correlation spectroscopy; one is the combination of the moving window concept with 2D spectroscopy to facilitate the analysis of complex data sets, and the other is the definition of the noise level in synchronous/asynchronous maps. A graphical criterion for the latter is also proposed. The combination of the moving window concept with correlation spectra allows one to split a large data matrix into smaller and simpler subsets and to analyze them instead of computing overall correlation. A three-component system that mimics a consecutive chemical reaction is used as a model for the illustration of the two ideas. Both types of correlation matrices, variable–variable and sample–sample, are analyzed, and a very good agreement between the two is met. The proposed innovations enable one to comprehend the complexity of the data to be analyzed by 2D spectroscopy and thus to avoid the risks of over-interpretation, liable to occur whenever improper caution about the number of coexisting species in the system is taken.

Download Full-text

AN ASSESSMENT OF A METRIC SPACE DATABASE INDEX TO SUPPORT SEQUENCE HOMOLOGY

International Journal of Artificial Intelligence Tools ◽

10.1142/s0218213005002430 ◽

2005 ◽

Vol 14 (05) ◽

pp. 867-885 ◽

Cited By ~ 9

Author(s):

RUI MAO ◽

WEIJIA XU ◽

NEHA SINGH ◽

DANIEL P. MIRANKER

Keyword(s):

Metric Space ◽

Sequence Data ◽

Large Data ◽

Peptide Sequence ◽

Data Sets ◽

Clustering Methods ◽

Storage And Retrieval ◽

Database Index ◽

Bulk Load ◽

Scalable Database

Hierarchical metric-space clustering methods have been commonly used to organize proteomes into taxonomies. Consequently, it is often anticipated that hierarchical clustering can be leveraged as a basis for scalable database index structures capable of managing the hyper-exponential growth of sequence data. M-tree is one such data structure specialized for the management of large data sets on disk. We explore the application of M-trees to the storage and retrieval of peptide sequence data. Exploiting a technique first suggested by Myers, we organize the database as records of fixed length substrings. Empirical results are promising. However, metric-space indexes are subject to "the curse of dimensionality" and the ultimate performance of an index is sensitive to the quality of the initial construction of the index. We introduce new hierarchical bulk-load algorithm that alternates between top-down and bottom-up clustering to initialize the index. Using the Yeast Proteomes, the bi-directional bulk load produces a more effective index than the existing M-tree initialization algorithms.

Download Full-text

A FAST IMPLEMENTATION OF THE ISODATA CLUSTERING ALGORITHM

International Journal of Computational Geometry & Applications ◽

10.1142/s0218195907002252 ◽

2007 ◽

Vol 17 (01) ◽

pp. 71-103 ◽

Cited By ~ 93

Author(s):

NARGESS MEMARSADEGHI ◽

DAVID M. MOUNT ◽

NATHAN S. NETANYAHU ◽

JACQUELINE LE MOIGNE

Keyword(s):

Clustering Algorithm ◽

Empirical Studies ◽

Synthetic Data ◽

Large Data ◽

Large Data Sets ◽

Cluster Center ◽

Data Sets ◽

Clustering Methods ◽

Sensing Applications ◽

Remote Sensing Applications

Clustering is central to many image processing and remote sensing applications. ISODATA is one of the most popular and widely used clustering methods in geoscience applications, but it can run slowly, particularly with large data sets. We present a more efficient approach to ISODATA clustering, which achieves better running times by storing the points in a kd-tree and through a modification of the way in which the algorithm estimates the dispersion of each cluster. We also present an approximate version of the algorithm which allows the user to further improve the running time, at the expense of lower fidelity in computing the nearest cluster center to each point. We provide both theoretical and empirical justification that our modified approach produces clusterings that are very similar to those produced by the standard ISODATA approach. We also provide empirical studies on both synthetic data and remotely sensed Landsat and MODIS images that show that our approach has significantly lower running times.

Download Full-text

DESCRY: A Density Based Clustering Algorithm for Very Large Data Sets

Lecture Notes in Computer Science - Intelligent Data Engineering and Automated Learning – IDEAL 2004 ◽

10.1007/978-3-540-28651-6_30 ◽

2004 ◽

pp. 203-210 ◽

Cited By ~ 8

Author(s):

Fabrizio Angiulli ◽

Clara Pizzuti ◽

Massimo Ruffolo

Keyword(s):

Clustering Algorithm ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Density Based Clustering

Download Full-text

A hybrid approach for training recurrent neural networks: application to multi-step-ahead prediction of noisy and large data sets

Neural Computing and Applications ◽

10.1007/s00521-007-0116-8 ◽

2007 ◽

Vol 17 (3) ◽

pp. 245-254 ◽

Cited By ~ 9

Author(s):

S. Chtourou ◽

M. Chtourou ◽

O. Hammami

Keyword(s):

Neural Networks ◽

Recurrent Neural Networks ◽

Hybrid Approach ◽

Large Data ◽

Large Data Sets ◽

Data Sets

Download Full-text

List-columns in data.table: Nesting and unnesting data tables and vectors

10.31234/osf.io/u8ekc ◽

2019 ◽

Author(s):

Tyson S. Barrett

Keyword(s):

Large Data ◽

The United States ◽

Bench Mark ◽

Data Sets ◽

Complex Data ◽

Basketball Players ◽

Text Data ◽

Data Set ◽

Data Tables ◽

Student Information

The use of list-columns in data frames and tibbles in the R statistical environment is well documented (e.g. Bryan, 2018), providing a cognitively efficient way to organize results of complex data (e.g. several statistical models, groupings of text, data summaries, or even graphics) with corresponding data. For example, one can store student information within classrooms, player information within teams, or even analyses within groups. This allows the data to be of variable sizes without overly complicating or adding redundancies to the structure of the data. In turn, this can improve the reliability to appropriately analyze the data. Because of its efficiency and speed, being able to use data.table to work with list-columns would be beneficial in many data contexts (e.g. to reduce memory usage in large data sets). Herein, I demonstrate how one can create list-columns in a data table using the by argument in data.table and purrr::map(). This is done using an example data set containing information on professional basketball players in the United States. I compare the behavior of the data.table approach to the dplyr::group_nest() function, one of the several powerful tidyverse nesting functions. Results using bench::mark() show the speed and efficiency of using data.table to work with list-columns.

Download Full-text

A Semiautomated Framework for Integrating Expert Knowledge into Disease Marker Identification

Disease Markers ◽

10.1155/2013/613529 ◽

2013 ◽

Vol 35 ◽

pp. 513-523 ◽

Cited By ~ 3

Author(s):

Jing Wang ◽

Bobbie-Jo M. Webb-Robertson ◽

Melissa M. Matzke ◽

Susan M. Varnum ◽

Joseph N. Brown ◽

...

Keyword(s):

Biomarker Discovery ◽

Expert Knowledge ◽

Large Data ◽

Large Data Sets ◽

Biological Information ◽

Data Sets ◽

Complex Data ◽

Selection Scheme ◽

Biomarker Identification ◽

Biomarker Selection

Background. The availability of large complex data sets generated by high throughput technologies has enabled the recent proliferation of disease biomarker studies. However, a recurring problem in deriving biological information from large data sets is how to best incorporate expert knowledge into the biomarker selection process.Objective. To develop a generalizable framework that can incorporate expert knowledge into data-driven processes in a semiautomated way while providing a metric for optimization in a biomarker selection scheme.Methods. The framework was implemented as a pipeline consisting of five components for the identification of signatures from integrated clustering (ISIC). Expert knowledge was integrated into the biomarker identification process using the combination of two distinct approaches; a distance-based clustering approach and an expert knowledge-driven functional selection.Results. The utility of the developed framework ISIC was demonstrated on proteomics data from a study of chronic obstructive pulmonary disease (COPD). Biomarker candidates were identified in a mouse model using ISIC and validated in a study of a human cohort.Conclusions. Expert knowledge can be introduced into a biomarker discovery process in different ways to enhance the robustness of selected marker candidates. Developing strategies for extracting orthogonal and robust features from large data sets increases the chances of success in biomarker identification.

Download Full-text