scholarly journals Shared Nearest Neighbor clustering in a Locality Sensitive Hashing framework

2016 ◽  
Author(s):  
Sawsan Kanj ◽  
Thomas Brüls ◽  
Stéphane Gazut

AbstractWe present a new algorithm to cluster high dimensional sequence data, and its application to the field of metagenomics, which aims to reconstruct individual genomes from a mixture of genomes sampled from an environ-mental site, without any prior knowledge of reference data (genomes) or the shape of clusters. Such problems typically cannot be solved directly with classical approaches seeking to estimate the density of clusters, e.g., using the shared nearest neighbors rule, due to the prohibitive size of contemporary sequence datasets. We explore here a new method based on combining the shared nearest neighbor (SNN) rule with the concept of Locality Sensitive Hashing (LSH). The proposed method, called LSH-SNN, works by randomly splitting the input data into smaller-sized subsets (buckets) and, employing the shared nearest neighbor rule on each of these buckets. Links can be created among neighbors sharing a sufficient number of elements, hence allowing clusters to be grown from linked elements. LSH-SNN can scale up to larger datasets consisting of millions of sequences, while achieving high accuracy across a variety of sample sizes and complexities.

Mathematics ◽  
2021 ◽  
Vol 9 (7) ◽  
pp. 779
Author(s):  
Ruriko Yoshida

A tropical ball is a ball defined by the tropical metric over the tropical projective torus. In this paper we show several properties of tropical balls over the tropical projective torus and also over the space of phylogenetic trees with a given set of leaf labels. Then we discuss its application to the K nearest neighbors (KNN) algorithm, a supervised learning method used to classify a high-dimensional vector into given categories by looking at a ball centered at the vector, which contains K vectors in the space.


1985 ◽  
Vol 17 (4) ◽  
pp. 794-809 ◽  
Author(s):  
Charles M. Newman ◽  
Yosef Rinott

Consider a Poisson point process of density 1 in Rd, centered so that the origin is one of the points. Using lv distances, 1≦p≦∞, define Nd as the number of other points which have the origin as their nearest neighbor and Vol Vd as the volume of the Voronoi region of the origin. We prove that Nd → Poisson (λ = 1) and Vol Vd → 1 in distribution as d →∞, thus extending previous results from the case p = 2. More generally, for a variety of exchangeable distributions for n + 1 points, e0, · ··, en, in Rd and a variety of distances, we obtain the asymptotic behavior of Ndn, the number of points which have e0 as their nearest neighbor, as n, d → ∞ in one or both of the possible iterated orders. The distributions treated include points distributed on the unit l2 sphere and the distances treated include non-lp distances related to correlation coefficients.


2018 ◽  
Vol 38 (1) ◽  
pp. 123-137 ◽  
Author(s):  
Selim Bahadir ◽  
Elvan Ceyhan

For a random sample of points in R, we consider the number of pairs whose members are nearest neighbors NNs to each other and the number of pairs sharing a common NN. The pairs of the first type are called reflexive NNs, whereas the pairs of the latter type are called shared NNs. In this article, we consider the case where the random sample of size n is from the uniform distribution on an interval. We denote the number of reflexive NN pairs and the number of shared NN pairs in the sample by Rn and Qn, respectively. We derive the exact forms of the expected value and the variance for both Rn and Qn, and derive a recurrence relation for Rn which may also be used to compute the exact probability mass function pmf of Rn. Our approach is a novel method for finding the pmf of Rn and agrees with the results in the literature. We also present SLLN and CLT results for both Rn and Qn as n goes to infinity.


1985 ◽  
Vol 17 (04) ◽  
pp. 794-809 ◽  
Author(s):  
Charles M. Newman ◽  
Yosef Rinott

Consider a Poisson point process of density 1 in R d, centered so that the origin is one of the points. Using lv distances, 1≦p≦∞, define Nd as the number of other points which have the origin as their nearest neighbor and Vol Vd as the volume of the Voronoi region of the origin. We prove that Nd → Poisson (λ = 1) and Vol Vd → 1 in distribution as d →∞, thus extending previous results from the case p = 2. More generally, for a variety of exchangeable distributions for n + 1 points, e 0, · ··, e n, in Rd and a variety of distances, we obtain the asymptotic behavior of Nd n , the number of points which have e 0 as their nearest neighbor, as n, d → ∞ in one or both of the possible iterated orders. The distributions treated include points distributed on the unit l2 sphere and the distances treated include non-l p distances related to correlation coefficients.


2011 ◽  
Vol 145 ◽  
pp. 189-193 ◽  
Author(s):  
Horng Lin Shieh

In this paper, a hybrid method combining rough set and shared nearest neighbor algorithms is proposed for data clustering with non-globular shapes. The roughk-means algorithm is based on the distances between data and cluster centers. It partitions a data set with globular shapes well, but when the data are non-globular shapes, the results obtained by a roughk-means algorithm are not very satisfactory. In order to resolve this problem, a combined rough set and shared nearest neighbor algorithm is proposed. The proposed algorithm first adopts a shared nearest neighbor algorithm to evaluate the similarity among data, then the lower and upper approximations of a rough set algorithm are used to partition the data set into clusters.


Author(s):  
RICHARD NOCK ◽  
MARC SEBBAN ◽  
DIDIER BERNARD

In this paper, we propose a thorough investigation of a nearest neighbor rule which we call the "Symmetric Nearest Neighbor (sNN) rule". Basically, it symmetrises the classical nearest neighbor relationship from which are computed the points voting for some instances. Experiments on 29 datasets, most of which are readily available, show that the method significantly outperforms the traditional Nearest Neighbors methods. Experiments on a domain of interest related to tropical pollution normalization also show the greater potential of this method. We finally discuss the reasons for the rule's efficiency, provide methods for speeding-up the classification time, and derive from the sNN rule a reliable and fast algorithm to fix the parameter k in the k-NN rule, a longstanding problem in this field.


2018 ◽  
Vol 25 (2) ◽  
pp. 236-250 ◽  
Author(s):  
Sawsan Kanj ◽  
Thomas Brüls ◽  
Stéphane Gazut

Sign in / Sign up

Export Citation Format

Share Document