Bi-level Locality Sensitive Hashing for k-Nearest Neighbor Computation

Graph-based learning provides a useful approach for modeling data in image annotation problems. In this chapter, the authors introduce how to construct a region-based graph to annotate large scale multi-label images. It has been well recognized that analysis in semantic region level may greatly improve image annotation performance compared to that in whole image level. However, the region level approach increases the data scale to several orders of magnitude and lays down new challenges to most existing algorithms. To this end, each image is firstly encoded as a Bag-of-Regions based on multiple image segmentations. And then, all image regions are constructed into a large k-nearest-neighbor graph with efficient Locality Sensitive Hashing (LSH) method. At last, a sparse and region-aware image-based graph is fed into the multi-label extension of the Entropic graph regularized semi-supervised learning algorithm (Subramanya & Bilmes, 2009). In combination they naturally yield the capability in handling large-scale dataset. Extensive experiments on NUS-WIDE (260k images) and COREL-5k datasets well validate the effectiveness and efficiency of the framework for region-aware and scalable multi-label propagation.

Download Full-text

Improving the Performance of kNN in the MapReduce Framework Using Locality Sensitive Hashing

International Journal of Distributed Systems and Technologies ◽

10.4018/ijdst.2019100101 ◽

2019 ◽

Vol 10 (4) ◽

pp. 1-16

Author(s):

Sikha Bagui ◽

Arup Kumar Mondal ◽

Subhash Bagui

Keyword(s):

Nearest Neighbor ◽

Parallel Implementation ◽

Block Size ◽

Computation Time ◽

Locality Sensitive Hashing ◽

K Nearest Neighbor ◽

Mapreduce Framework ◽

Data Set ◽

Data Object ◽

Very Large Datasets

In this work the authors present a parallel k nearest neighbor (kNN) algorithm using locality sensitive hashing to preprocess the data before it is classified using kNN in Hadoop's MapReduce framework. This is compared with the sequential (conventional) implementation. Using locality sensitive hashing's similarity measure with kNN, the iterative procedure to classify a data object is performed within a hash bucket rather than the whole data set, greatly reducing the computation time needed for classification. Several experiments were run that showed that the parallel implementation performed better than the sequential implementation on very large datasets. The study also experimented with a few map and reduce side optimization features for the parallel implementation and presented some optimum map and reduce side parameters. Among the map side parameters, the block size and input split size were varied, and among the reduce side parameters, the number of planes were varied, and their effects were studied.

Download Full-text

Scaling Unsupervised Risk Stratification to Massive Clinical Datasets

International Journal of Knowledge Discovery in Bioinformatics ◽

10.4018/jkdb.2011010103 ◽

2011 ◽

Vol 2 (1) ◽

pp. 45-59

Author(s):

Zeeshan Syed ◽

Ilan Rubinfeld

Keyword(s):

Risk Stratification ◽

Nearest Neighbor ◽

A Priori ◽

Substantial Improvement ◽

Adverse Outcomes ◽

Locality Sensitive Hashing ◽

Training Data ◽

K Nearest Neighbor ◽

Improvement Program ◽

K Nearest Neighbors

While rare clinical events, by definition, occur infrequently in a population, the consequences of these events can be drastic. Unfortunately, developing risk stratification algorithms for these conditions requires large volumes of data to capture enough positive and negative cases. This process is slow, expensive, and burdensome to both patients and caregivers. This paper proposes an unsupervised machine learning approach to address this challenge and risk stratify patients for adverse outcomes without use of a priori knowledge or labeled training data. The key idea of the approach is to identify high-risk patients as anomalies in a population. Cases are identified through a novel algorithm that finds an approximate solution to the k-nearest neighbor problem using locality sensitive hashing (LSH) based on p-stable distributions. The algorithm is optimized to use multiple LSH searches, each with a geometrically increasing radius, to find the k-nearest neighbors of patients in a dynamically changing dataset where patients are being added or removed over time. When evaluated on data from the National Surgical Quality Improvement Program (NSQIP), this approach successfully identifies patients at an elevated risk of mortality and rare morbidities. The LSH-based algorithm provided a substantial improvement over an exact k-nearest neighbor algorithm in runtime, while achieving a similar accuracy.

Download Full-text

LSR‐forest: An locality sensitive hashing‐based approximate k ‐nearest neighbor query algorithm on high‐dimensional uncertain data

Concurrency and Computation Practice and Experience ◽

10.1002/cpe.5795 ◽

2020 ◽

Author(s):

Jiagang Wang ◽

Tu Qian ◽

Anbang Yang ◽

Hui Wang ◽

Jiangbo Qian

Keyword(s):

Nearest Neighbor ◽

Uncertain Data ◽

Locality Sensitive Hashing ◽

High Dimensional ◽

K Nearest Neighbor ◽

Nearest Neighbor Query ◽

Query Algorithm

Download Full-text

Fast GPU-based locality sensitive hashing for k-nearest neighbor computation

Proceedings of the 19th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems - GIS '11 ◽

10.1145/2093973.2094002 ◽

2011 ◽

Cited By ~ 52

Author(s):

Jia Pan ◽

Dinesh Manocha

Keyword(s):

Nearest Neighbor ◽

Locality Sensitive Hashing ◽

K Nearest Neighbor

Download Full-text

Machine Learning Verdict of EEG Signals in Brain Computer Interface

International Journal of Scientific Research in Computer Science Engineering and Information Technology ◽

10.32628/cseit1838114 ◽

2018 ◽

pp. 429-441

Author(s):

M. Jeyanthi ◽

C. Velayutham

Keyword(s):

Nearest Neighbor ◽

Technology Development ◽

Vital Role ◽

Svm Classifier ◽

K Nearest Neighbor ◽

Data Mining Technique ◽

Data Set ◽

Eeg Data ◽

Irrelevant Attributes

In Science and Technology Development BCI plays a vital role in the field of Research. Classification is a data mining technique used to predict group membership for data instances. Analyses of BCI data are challenging because feature extraction and classification of these data are more difficult as compared with those applied to raw data. In this paper, We extracted features using statistical Haralick features from the raw EEG data . Then the features are Normalized, Binning is used to improve the accuracy of the predictive models by reducing noise and eliminate some irrelevant attributes and then the classification is performed using different classification techniques such as Naïve Bayes, k-nearest neighbor classifier, SVM classifier using BCI dataset. Finally we propose the SVM classification algorithm for the BCI data set.

Download Full-text

PENENTUAN DAERAH PRIORITAS PELAYANAN AKTA KELAHIRAN DENGAN METODE K-NN DAN K-MEANS

Komputasi: Jurnal Ilmiah Ilmu Komputer dan Matematika ◽

10.33751/komputasi.v17i1.1735 ◽

2020 ◽

Vol 17 (1) ◽

pp. 319-328

Author(s):

Ade Muchlis Maulana Anwar ◽

Prihastuti Harsani ◽

Aries Maesya

Keyword(s):

Nearest Neighbor ◽

Information Gain ◽

Birth Certificate ◽

Population Data ◽

Community Services ◽

Birth Certificates ◽

Similar Data ◽

K Nearest Neighbor ◽

Civil Registration ◽

The Family

Population Data is individual data or aggregate data that is structured as a result of Population Registration and Civil Registration activities. Birth Certificate is a Civil Registration Deed as a result of recording the birth event of a baby whose birth is reported to be registered on the Family Card and given a Population Identification Number (NIK) as a basis for obtaining other community services. From the total number of integrated birth certificate reporting for the 2018 Population Administration Information System (SIAK) totaling 570,637 there were 503,946 reported late and only 66,691 were reported publicly. Clustering is a method used to classify data that is similar to others in one group or similar data to other groups. K-Nearest Neighbor is a method for classifying objects based on learning data that is the closest distance to the test data. k-means is a method used to divide a number of objects into groups based on existing categories by looking at the midpoint. In data mining preprocesses, data is cleaned by filling in the blank data with the most dominating data, and selecting attributes using the information gain method. Based on the k-nearest neighbor method to predict delays in reporting and the k-means method to classify priority areas of service with 10,000 birth certificate data on birth certificates in 2019 that have good enough performance to produce predictions with an accuracy of 74.00% and with K = 2 on k-means produces a index davies bouldin of 1,179.

Download Full-text

A Scalable K-Nearest Neighbor Algorithm for Recommendation System Problems

2020 43rd International Convention on Information, Communication and Electronic Technology (MIPRO) ◽

10.23919/mipro48935.2020.9245195 ◽

2020 ◽

Author(s):

A. Sagdic ◽

C. Tekinbas ◽

E. Arslan ◽

T. Kucukyilmaz

Keyword(s):

Recommendation System ◽

Nearest Neighbor ◽

K Nearest Neighbor ◽

Nearest Neighbor Algorithm ◽

K Nearest Neighbor Algorithm

Download Full-text

Optimizing Error Rate in Intrusion Detection System Using Artificial Neural Network Algorithm

International Journal of Emerging Research in Management and Technology ◽

10.23956/ijermt.v6i9.102 ◽

2018 ◽

Vol 6 (9) ◽

pp. 152

Author(s):

S. Vijaya Rani ◽

G. N. K. Suresh Babu

Keyword(s):

Neural Network ◽

Artificial Neural Network ◽

Intrusion Detection ◽

Error Rate ◽

Learning Process ◽

Nearest Neighbor ◽

Detection System ◽

Support Vector ◽

K Nearest Neighbor ◽

Artificial Neural

The illegal hackers penetrate the servers and networks of corporate and financial institutions to gain money and extract vital information. The hacking varies from one computing system to many system. They gain access by sending malicious packets in the network through virus, worms, Trojan horses etc. The hackers scan a network through various tools and collect information of network and host. Hence it is very much essential to detect the attacks as they enter into a network. The methods available for intrusion detection are Naive Bayes, Decision tree, Support Vector Machine, K-Nearest Neighbor, Artificial Neural Networks. A neural network consists of processing units in complex manner and able to store information and make it functional for use. It acts like human brain and takes knowledge from the environment through training and learning process. Many algorithms are available for learning process This work carry out research on analysis of malicious packets and predicting the error rate in detection of injured packets through artificial neural network algorithms.

Download Full-text

Perancangan Aplikasi Prediksi Kelulusan Tepat Waktu Bagi Mahasiswa Baru Dengan Teknik Data Mining (Studi Kasus: Data Akademik Mahasiswa STMIK Dipanegara Makassar)

Creative Information Technology Journal ◽

10.24076/citec.2014v1i4.27 ◽

2015 ◽

Vol 1 (4) ◽

pp. 270

Author(s):

Muhammad Syukri Mustafa ◽

I. Wayan Simpen

Keyword(s):

Data Mining ◽

Nearest Neighbor ◽

Test Results ◽

K Nearest Neighbor ◽

Accuracy Rate ◽

Sample Data ◽

New Students ◽

K Nearest Neighbor Algorithm ◽

Using Data ◽

Existing Data

Penelitian ini dimaksudkan untuk melakukan prediksi terhadap kemungkian mahasiswa baru dapat menyelesaikan studi tepat waktu dengan menggunakan analisis data mining untuk menggali tumpukan histori data dengan menggunakan algoritma K-Nearest Neighbor (KNN). Aplikasi yang dihasilkan pada penelitian ini akan menggunakan berbagai atribut yang klasifikasikan dalam suatu data mining antara lain nilai ujian nasional (UN), asal sekolah/ daerah, jenis kelamin, pekerjaan dan penghasilan orang tua, jumlah bersaudara, dan lain-lain sehingga dengan menerapkan analysis KNN dapat dilakukan suatu prediksi berdasarkan kedekatan histori data yang ada dengan data yang baru, apakah mahasiswa tersebut berpeluang untuk menyelesaikan studi tepat waktu atau tidak. Dari hasil pengujian dengan menerapkan algoritma KNN dan menggunakan data sampel alumni tahun wisuda 2004 s.d. 2010 untuk kasus lama dan data alumni tahun wisuda 2011 untuk kasus baru diperoleh tingkat akurasi sebesar 83,36%.This research is intended to predict the possibility of new students time to complete studies using data mining analysis to explore the history stack data using K-Nearest Neighbor algorithm (KNN). Applications generated in this study will use a variety of attributes in a data mining classified among other Ujian Nasional scores (UN), the origin of the school / area, gender, occupation and income of parents, number of siblings, and others that by applying the analysis KNN can do a prediction based on historical proximity of existing data with new data, whether the student is likely to complete the study on time or not. From the test results by applying the KNN algorithm and uses sample data alumnus graduation year 2004 s.d 2010 for the case of a long and alumni data graduation year 2011 for new cases obtained accuracy rate of 83.36%.

Download Full-text