Shared Nearest Neighbor Clustering in a Locality Sensitive Hashing Framework

AbstractWe present a new algorithm to cluster high dimensional sequence data, and its application to the field of metagenomics, which aims to reconstruct individual genomes from a mixture of genomes sampled from an environ-mental site, without any prior knowledge of reference data (genomes) or the shape of clusters. Such problems typically cannot be solved directly with classical approaches seeking to estimate the density of clusters, e.g., using the shared nearest neighbors rule, due to the prohibitive size of contemporary sequence datasets. We explore here a new method based on combining the shared nearest neighbor (SNN) rule with the concept of Locality Sensitive Hashing (LSH). The proposed method, called LSH-SNN, works by randomly splitting the input data into smaller-sized subsets (buckets) and, employing the shared nearest neighbor rule on each of these buckets. Links can be created among neighbors sharing a sufficient number of elements, hence allowing clusters to be grown from linked elements. LSH-SNN can scale up to larger datasets consisting of millions of sequences, while achieving high accuracy across a variety of sample sizes and complexities.

Download Full-text

Theoretical analysis on pruning nearest neighbor candidates by locality sensitive hashing

TENCON 2010 - 2010 IEEE Region 10 Conference ◽

10.1109/tencon.2010.5686450 ◽

2010 ◽

Author(s):

T Mutohy ◽

M Iwamura ◽

K Kise

Keyword(s):

Theoretical Analysis ◽

Nearest Neighbor ◽

Locality Sensitive Hashing

Download Full-text

Bi-level Locality Sensitive Hashing for k-Nearest Neighbor Computation

2012 IEEE 28th International Conference on Data Engineering ◽

10.1109/icde.2012.40 ◽

2012 ◽

Cited By ~ 16

Author(s):

Jia Pan ◽

Dinesh Manocha

Keyword(s):

Nearest Neighbor ◽

Locality Sensitive Hashing ◽

K Nearest Neighbor

Download Full-text

Multi-objective optimization of shared nearest neighbor similarity for feature selection

Applied Soft Computing ◽

10.1016/j.asoc.2015.08.042 ◽

2015 ◽

Vol 37 ◽

pp. 751-762 ◽

Cited By ~ 9

Author(s):

Partha Pratim Kundu ◽

Sushmita Mitra

Keyword(s):

Feature Selection ◽

Nearest Neighbor ◽

Multi Objective Optimization ◽

Multi Objective ◽

Shared Nearest Neighbor

Download Full-text

Region-Based Graph Learning towards Large Scale Image Annotation

Graph-Based Methods in Computer Vision ◽

10.4018/978-1-4666-1891-6.ch013 ◽

2012 ◽

pp. 244-260

Author(s):

Bao Bing-Kun ◽

Yan Shuicheng

Keyword(s):

Large Scale ◽

Nearest Neighbor ◽

Image Annotation ◽

Learning Algorithm ◽

Label Propagation ◽

Locality Sensitive Hashing ◽

K Nearest Neighbor ◽

Neighbor Graph ◽

Nearest Neighbor Graph ◽

Modeling Data

Graph-based learning provides a useful approach for modeling data in image annotation problems. In this chapter, the authors introduce how to construct a region-based graph to annotate large scale multi-label images. It has been well recognized that analysis in semantic region level may greatly improve image annotation performance compared to that in whole image level. However, the region level approach increases the data scale to several orders of magnitude and lays down new challenges to most existing algorithms. To this end, each image is firstly encoded as a Bag-of-Regions based on multiple image segmentations. And then, all image regions are constructed into a large k-nearest-neighbor graph with efficient Locality Sensitive Hashing (LSH) method. At last, a sparse and region-aware image-based graph is fed into the multi-label extension of the Entropic graph regularized semi-supervised learning algorithm (Subramanya & Bilmes, 2009). In combination they naturally yield the capability in handling large-scale dataset. Extensive experiments on NUS-WIDE (260k images) and COREL-5k datasets well validate the effectiveness and efficiency of the framework for region-aware and scalable multi-label propagation.

Download Full-text

Improving the Performance of kNN in the MapReduce Framework Using Locality Sensitive Hashing

International Journal of Distributed Systems and Technologies ◽

10.4018/ijdst.2019100101 ◽

2019 ◽

Vol 10 (4) ◽

pp. 1-16

Author(s):

Sikha Bagui ◽

Arup Kumar Mondal ◽

Subhash Bagui

Keyword(s):

Nearest Neighbor ◽

Parallel Implementation ◽

Block Size ◽

Computation Time ◽

Locality Sensitive Hashing ◽

K Nearest Neighbor ◽

Mapreduce Framework ◽

Data Set ◽

Data Object ◽

Very Large Datasets

In this work the authors present a parallel k nearest neighbor (kNN) algorithm using locality sensitive hashing to preprocess the data before it is classified using kNN in Hadoop's MapReduce framework. This is compared with the sequential (conventional) implementation. Using locality sensitive hashing's similarity measure with kNN, the iterative procedure to classify a data object is performed within a hash bucket rather than the whole data set, greatly reducing the computation time needed for classification. Several experiments were run that showed that the parallel implementation performed better than the sequential implementation on very large datasets. The study also experimented with a few map and reduce side optimization features for the parallel implementation and presented some optimum map and reduce side parameters. Among the map side parameters, the block size and input split size were varied, and among the reduce side parameters, the number of planes were varied, and their effects were studied.

Download Full-text

Shared Nearest-Neighbor Quantum Game-Based Attribute Reduction With Hierarchical Coevolutionary Spark and Its Application in Consistent Segmentation of Neonatal Cerebral Cortical Surfaces

IEEE Transactions on Neural Networks and Learning Systems ◽

10.1109/tnnls.2018.2872974 ◽

2019 ◽

Vol 30 (7) ◽

pp. 2013-2027 ◽

Cited By ~ 9

Author(s):

Weiping Ding ◽

Chin-Teng Lin ◽

Zehong Cao

Keyword(s):

Nearest Neighbor ◽

Attribute Reduction ◽

Quantum Game ◽

Shared Nearest Neighbor

Download Full-text

A Multi-Relational Hierarchical Clustering Algorithm Based on Shared Nearest Neighbor Similarity

2007 International Conference on Machine Learning and Cybernetics ◽

10.1109/icmlc.2007.4370836 ◽

2007 ◽

Cited By ~ 1

Author(s):

Jing-Feng Guo ◽

Yu-Yan Zhao ◽

Jing Li

Keyword(s):

Hierarchical Clustering ◽

Clustering Algorithm ◽

Nearest Neighbor ◽

Hierarchical Clustering Algorithm ◽

Shared Nearest Neighbor

Download Full-text

Fast document summarization using locality sensitive hashing and memory access efficient node ranking

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v6i3.9030 ◽

2016 ◽

Vol 6 (3) ◽

pp. 945

Author(s):

Ercan Canhasi

Keyword(s):

Time Complexity ◽

Nearest Neighbor ◽

Linear Time ◽

Nearest Neighbor Search ◽

Memory Access ◽

Locality Sensitive Hashing ◽

Document Summarization ◽

Neighbor Search ◽

Node Ranking ◽

Similarity Graph

Text modeling and sentence selection are the fundamental steps of a typical extractive document summarization algorithm. The common text modeling method connects a pair of sentences based on their similarities. Even thought it can effectively represent the sentence similarity graph of given document(s) its big drawback is a large time complexity of $O(n^2)$, where n represents the number of sentences. The quadratic time complexity makes it impractical for large documents. In this paper we propose the fast approximation algorithms for the text modeling and the sentence selection. Our text modeling algorithm reduces the time complexity to near-linear time by rapidly finding the most similar sentences to form the sentences similarity graph. In doing so we utilized Locality-Sensitive Hashing, a fast algorithm for the approximate nearest neighbor search. For the sentence selection step we propose a simple memory-access-efficient node ranking method based on the idea of scanning sequentially only the neighborhood arrays. Experimentally, we show that sacrificing a rather small percentage of recall and precision in the quality of the produced summary can reduce the quadratic to sub-linear time complexity. We see the big potential of proposed method in text summarization for mobile devices and big text data summarization for internet of things on cloud. In our experiments, beside evaluating the presented method on the standard general and query multi-document summarization tasks, we also tested it on few alternative summarization tasks including general and query, timeline, and comparative summarization.

Download Full-text

An Efficient Clustering Method for Hyperspectral Optimal Band Selection via Shared Nearest Neighbor

Remote Sensing ◽

10.3390/rs11030350 ◽

2019 ◽

Vol 11 (3) ◽

pp. 350 ◽

Cited By ~ 7

Author(s):

Qiang Li ◽

Qi Wang ◽

Xuelong Li

Keyword(s):

Nearest Neighbor ◽

Hyperspectral Image ◽

Local Density ◽

Computational Time ◽

Band Selection ◽

Data Sets ◽

Selection Methods ◽

Clustering Method ◽

Slope Change ◽

Shared Nearest Neighbor

A hyperspectral image (HSI) has many bands, which leads to high correlation between adjacent bands, so it is necessary to find representative subsets before further analysis. To address this issue, band selection is considered as an effective approach that removes redundant bands for HSI. Recently, many band selection methods have been proposed, but the majority of them have extremely poor accuracy in a small number of bands and require multiple iterations, which does not meet the purpose of band selection. Therefore, we propose an efficient clustering method based on shared nearest neighbor (SNNC) for hyperspectral optimal band selection, claiming the following contributions: (1) the local density of each band is obtained by shared nearest neighbor, which can more accurately reflect the local distribution characteristics; (2) in order to acquire a band subset containing a large amount of information, the information entropy is taken as one of the weight factors; (3) a method for automatically selecting the optimal band subset is designed by the slope change. The experimental results reveal that compared with other methods, the proposed method has competitive computational time and the selected bands achieve higher overall classification accuracy on different data sets, especially when the number of bands is small.

Download Full-text