Low-complexity fuzzy relational clustering algorithms for Web mining

R. Krishnapuram; A. Joshi; O. Nasraoui; L. Yi

doi:10.1109/91.940971

A Fuzzy Graph Framework for Initializing k-Means

International Journal of Artificial Intelligence Tools ◽

10.1142/s0218213016500317 ◽

2016 ◽

Vol 25 (06) ◽

pp. 1650031 ◽

Cited By ~ 4

Author(s):

Georgios Drakopoulos ◽

Panagiotis Gourgaris ◽

Andreas Kanavos ◽

Christos Makris

Keyword(s):

Web Mining ◽

Clustering Algorithms ◽

Low Complexity ◽

Design Parameters ◽

Fuzzy Graph ◽

Community Discovery ◽

Number Of Clusters ◽

Overlapping Communities ◽

Underlying Space ◽

True Number

k-Means is among the most significant clustering algorithms for vectors chosen from an underlying space S. Its applications span a broad range of fields including machine learning, image and signal processing, and Web mining. Since the introduction of k-Means, two of its major design parameters remain open to research. The first is the number of clusters to be formed and the second is the initial vectors. The latter is also inherently related to selecting a density measure for S. This article presents a two-step framework for estimating both parameters. First, the underlying vector space is represented as a fuzzy graph. Afterwards, two algorithms for partitioning a fuzzy graph to non-overlapping communities, namely Fuzzy Walktrap and Fuzzy Newman-Girvan, are executed. The former is a low complexity evolving heuristic, whereas the latter is deterministic and combines a graph communication metric with an exhaustive search principle. Once communities are discovered, their number is taken as an estimate of the true number of clusters. The initial centroids or seeds are subsequently selected based on the density of S. The proposed framework is modular, allowing thus more initialization schemes to be derived. The secondary contributions of this article are HI, a similarity metric for vectors with numerical and categorical entries and the assessment of its stochastic behavior, and TD, a metric for assessing cluster confusion. The aforementioned framework has been implemented mainly in C# and partially in C++ and its performance in terms of efficiency, accuracy, and cluster confusion was experimentally assessed. Post-processing results conducted with MATLAB indicate that the evolving community discovery algorithm approaches the performance of its deterministic counterpart with considerably less complexity.

Download Full-text

A New Similarity Metric for Sequential Data

International Journal of Data Warehousing and Mining ◽

10.4018/jdwm.2010100102 ◽

2010 ◽

Vol 6 (4) ◽

pp. 16-32 ◽

Cited By ~ 11

Author(s):

Pradeep Kumar ◽

Bapi S. Raju ◽

P. Radha Krishna

Keyword(s):

Data Mining ◽

Similarity Measure ◽

Web Mining ◽

Clustering Algorithms ◽

Sequential Data ◽

Similarity Metric ◽

Benchmark Datasets ◽

Similarity Preserving ◽

Sequential Nature ◽

Classification And Clustering

In many data mining applications, both classification and clustering algorithms require a distance/similarity measure. The central problem in similarity based clustering/classification comprising sequential data is deciding an appropriate similarity metric. The existing metrics like Euclidean, Jaccard, Cosine, and so forth do not exploit the sequential nature of data explicitly. In this paper, the authors propose a similarity preserving function called Sequence and Set Similarity Measure (S3M) that captures both the order of occurrence of items in sequences and the constituent items of sequences. The authors demonstrate the usefulness of the proposed measure for classification and clustering tasks. Experiments were conducted on benchmark datasets, that is, DARPA’98 and msnbc, for classification task in intrusion detection and clustering task in web mining domains. Results show the usefulness of the proposed measure.

Download Full-text

A Relative Performance of Dissimilarity Measures for Matching Relational Web Access Patterns Between User Sessions

Advances in Systems Analysis, Software Engineering, and High Performance Computing - Handbook of Research on Pattern Engineering System Development for Big Data Analytics ◽

10.4018/978-1-5225-3870-7.ch010 ◽

2018 ◽

pp. 153-176

Author(s):

Dilip Singh Sisodia

Keyword(s):

Clustering Algorithms ◽

Similarity Measures ◽

Comparative Performance ◽

Dissimilarity Measures ◽

User Session ◽

Relational Clustering ◽

Cluster Distance ◽

Web Access ◽

Access Patterns ◽

Cluster Quality

Customized web services are offered to users by grouping them according to their access patterns. Clustering techniques are very useful in grouping users and analyzing web access patterns. Clustering can be an object clustering performed on feature vectors or relational clustering performed on relational data. The relational clustering is preferred over object clustering for web users' sessions because of high dimensionality and sparsity of web users' data. However, relational clustering of web users depends on underlying dissimilarity measures used. Therefore, correct dissimilarity measure for matching relational web access patterns between user sessions is very important. In this chapter, the various dissimilarity measures used in relational clustering of web users' data are discussed. The concept of an augmented user session is also discussed to derive different augmented session dissimilarity measures. The discussed session dissimilarity measures are used with relational fuzzy clustering algorithms. The comparative performance binary session similarity and augmented session similarity measures are evaluated using intra-cluster and inter-cluster distance-based cluster quality ratio. The results suggested the augmented session dissimilarity measures in general, and intuitive augmented session (dis)similarity measure, in particular, performed better than the other measures.

Download Full-text

A Hybrid Algorithm for Clustering of Time Series Data Based on Affinity Search Technique

The Scientific World JOURNAL ◽

10.1155/2014/562194 ◽

2014 ◽

Vol 2014 ◽

pp. 1-12 ◽

Cited By ~ 19

Author(s):

Saeed Aghabozorgi ◽

Teh Ying Wah ◽

Tutut Herawan ◽

Hamid A. Jalab ◽

Mohammad Amin Shaygan ◽

...

Keyword(s):

Time Series ◽

Clustering Algorithm ◽

Time Series Data ◽

Clustering Algorithms ◽

Medical Science ◽

Low Complexity ◽

Series Data ◽

Hybrid Approaches ◽

Static Data ◽

Important Solution

Time series clustering is an important solution to various problems in numerous fields of research, including business, medical science, and finance. However, conventional clustering algorithms are not practical for time series data because they are essentially designed for static data. This impracticality results in poor clustering accuracy in several systems. In this paper, a new hybrid clustering algorithm is proposed based on the similarity in shape of time series data. Time series data are first grouped as subclusters based on similarity in time. The subclusters are then merged using thek-Medoids algorithm based on similarity in shape. This model has two contributions: (1) it is more accurate than other conventional and hybrid approaches and (2) it determines the similarity in shape among time series data with a low complexity. To evaluate the accuracy of the proposed model, the model is tested extensively using syntactic and real-world time series datasets.

Download Full-text

The Use of Clustering Algorithms Ensemble with Variable Distance Metrics in Solving Problems of Web Mining

2017 5th International Conference on Future Internet of Things and Cloud Workshops (FiCloudW) ◽

10.1109/ficloudw.2017.82 ◽

2017 ◽

Cited By ~ 2

Author(s):

Pyotr V. Bochkaryov ◽

Anna I. Guseva

Keyword(s):

Web Mining ◽

Clustering Algorithms ◽

Distance Metrics ◽

Variable Distance

Download Full-text

New automatic fuzzy relational clustering algorithms using multi-objective NSGA-II

Information Sciences ◽

10.1016/j.ins.2018.03.025 ◽

2018 ◽

Vol 448-449 ◽

pp. 112-133 ◽

Cited By ~ 18

Author(s):

Animesh Kumar Paul ◽

Pintu Chandra Shill

Keyword(s):

Clustering Algorithms ◽

Nsga Ii ◽

Multi Objective ◽

Relational Clustering

Download Full-text

Topographic Mapping of Large Dissimilarity Data Sets

Neural Computation ◽

10.1162/neco_a_00012 ◽

2010 ◽

Vol 22 (9) ◽

pp. 2229-2284 ◽

Cited By ~ 63

Author(s):

Barbara Hammer ◽

Alexander Hasenfuss

Keyword(s):

Linear Time ◽

Clustering Algorithms ◽

Topographic Maps ◽

Data Sets ◽

Self Organizing Map ◽

Clustering Methods ◽

Neighborhood Structure ◽

Proximity Data ◽

Dissimilarity Data ◽

Relational Clustering

Topographic maps such as the self-organizing map (SOM) or neural gas (NG) constitute powerful data mining techniques that allow simultaneously clustering data and inferring their topological structure, such that additional features, for example, browsing, become available. Both methods have been introduced for vectorial data sets; they require a classical feature encoding of information. Often data are available in the form of pairwise distances only, such as arise from a kernel matrix, a graph, or some general dissimilarity measure. In such cases, NG and SOM cannot be applied directly. In this article, we introduce relational topographic maps as an extension of relational clustering algorithms, which offer prototype-based representations of dissimilarity data, to incorporate neighborhood structure. These methods are equivalent to the standard (vectorial) techniques if a Euclidean embedding exists, while preventing the need to explicitly compute such an embedding. Extending these techniques for the general case of non-Euclidean dissimilarities makes possible an interpretation of relational clustering as clustering in pseudo-Euclidean space. We compare the methods to well-known clustering methods for proximity data based on deterministic annealing and discuss how far convergence can be guaranteed in the general case. Relational clustering is quadratic in the number of data points, which makes the algorithms infeasible for huge data sets. We propose an approximate patch version of relational clustering that runs in linear time. The effectiveness of the methods is demonstrated in a number of examples.

Download Full-text

Dataset complexity impacts both MOTU delimitation and biodiversity estimates in eukaryotic 18S rRNA metabarcoding studies

10.1101/2021.06.16.448699 ◽

2021 ◽

Author(s):

Alejandro De Santiago ◽

Tiago José Pereira ◽

Sarah L. Mincks ◽

Holly M. Bik

Keyword(s):

18S Rrna ◽

High Throughput Sequencing ◽

Clustering Algorithms ◽

Alpha Diversity ◽

Low Complexity ◽

Taxonomic Assignment ◽

Sequencing Errors ◽

Operational Taxonomic Units ◽

Eukaryotic Genes ◽

Statistical Trends

How does the evolution of bioinformatics tools impact the biological interpretation of high-throughput sequencing datasets? For eukaryotic metabarcoding studies, in particular, researchers often rely on tools originally developed for the analysis of 16S ribosomal RNA (rRNA) datasets. Such tools do not adequately account for the complexity of eukaryotic genomes, the ubiquity of intragenomic variation in eukaryotic metabarcoding loci, or the differential evolutionary rates observed across eukaryotic genes and taxa. Recently, metabarcoding workflows have shifted away from the use of Operational Taxonomic Units (OTUs) towards delimitation of Amplicon Sequence Variants (ASVs). We assessed how the choice of bioinformatics algorithm impacts the downstream biological conclusions that are drawn from eukaryotic 18S rRNA metabarcoding studies. We focused on four workflows including UCLUST and VSearch algorithms for OTU clustering, and DADA2 and Deblur algorithms for ASV delimitation. We used two 18S rRNA datasets to further evaluate whether dataset complexity had a major impact on the statistical trends and ecological metrics: a "high complexity" (HC) environmental dataset generated from community DNA in Arctic marine sediments, and a "low complexity" (LC) dataset representing individually-barcoded nematodes. Our results indicate that ASV algorithms produce more biologically realistic metabarcoding outputs, with DADA2 being the most consistent and accurate pipeline regardless of dataset complexity. In contrast, OTU clustering algorithms inflate the metabarcoding-derived estimates of biodiversity, consistently returning a high proportion of "rare" Molecular Operational Taxonomic Units (MOTUs) that appear to represent computational artifacts and sequencing errors. However, species-specific MOTUs with high relative abundance are often recovered regardless of the bioinformatics approach. We also found high concordance across pipelines for downstream ecological analysis based on beta-diversity and alpha-diversity comparisons that utilize taxonomic assignment information. Analyses of LC datasets and rare MOTUs are especially sensitive to the choice of algorithms and better software tools may be needed to address these scenarios.

Download Full-text

Web mining with relational clustering

International Journal of Approximate Reasoning ◽

10.1016/s0888-613x(02)00084-1 ◽

2003 ◽

Vol 32 (2-3) ◽

pp. 217-236 ◽

Cited By ~ 52

Author(s):

T.A. Runkler ◽

J.C. Bezdek

Keyword(s):

Web Mining ◽

Relational Clustering

Download Full-text

Modeling for Time Generating Network

Advances in Wireless Technologies and Telecommunication - Big Data Applications in the Telecommunications Industry ◽

10.4018/978-1-5225-1750-4.ch003 ◽

2016 ◽

pp. 31-40

Author(s):

Yirui Hu

Keyword(s):

Semantic Analysis ◽

Fundamental Problem ◽

Clustering Algorithms ◽

Low Complexity ◽

Gaussian Mixture ◽

Probabilistic Latent Semantic Analysis ◽

Traffic Data ◽

Automatic Data ◽

Occurrence Data ◽

Fully Automatic

Modeling co-occurrence data generated by more than one processes in network is a fundamental problem in anomaly detection. Co-occurrence data are joint occurrences of pairs of elementary observations from two sets: traffic data in one set are associated with the generating entities (Time) in the other set. Clustering algorithms are valuable because they can obtain the insights from the varied distribution associated with generating entities. This chapter leverages co-occurrence data that combine traffic data with time, and compares Gaussian probabilistic latent semantic analysis (GPLSA) model to a Gaussian Mixture Model (GMM) using temporal network data. Experimental results support that GPLSA holds better promise in early detection and low false alarm rate with low complexity of implementation in a fully automatic, data-driven solution.

Download Full-text