Genome Scans for Selection and Introgression based on k-nearest Neighbor Techniques

Mapping Intimacies ◽

10.1101/752758 ◽

2019 ◽

Cited By ~ 1

Author(s):

Bastian Pfeifer ◽

Nikolaos Alachiotis ◽

Pavlos Pavlidis ◽

Michael G. Schimek

Keyword(s):

Nearest Neighbor ◽

Population Genomics ◽

Nearest Neighbors ◽

Genome Scan ◽

R Package ◽

Fixation Index ◽

K Nearest Neighbor ◽

Recombination Rates ◽

Genome Scans ◽

Signatures Of Selection

AbstractIn recent years, genome-scan methods have been extensively used to detect local signatures of selection and introgression. Here, we introduce a series of versatile genome-scan methods that are based on non-parametric k-nearest neighbors (kNN) techniques, while incorporating pairwise Fixation Index (FST) estimates and pairwise nucleotide differences (dxy) as features. Simulations were performed for both positive directional selection and introgression, with varying parameters, such as recombination rates, population background histories, the proportion of introgression, and the time of gene flow. We find that kNN-based methods perform remarkably well while yielding stable results almost over the entire range of k. We provide a GitHub repository (pievos101/kNN-Genome-Scans) containing R source code to demonstrate how to apply the proposed methods to real-world genomic data using the population genomics R-package PopGenome.

Download Full-text

Tropical Balls and Its Applications to K Nearest Neighbor over the Space of Phylogenetic Trees

Mathematics ◽

10.3390/math9070779 ◽

2021 ◽

Vol 9 (7) ◽

pp. 779

Author(s):

Ruriko Yoshida

Keyword(s):

Supervised Learning ◽

Phylogenetic Trees ◽

Nearest Neighbor ◽

Nearest Neighbors ◽

High Dimensional ◽

Learning Method ◽

Dimensional Vector ◽

K Nearest Neighbor ◽

K Nearest Neighbors

A tropical ball is a ball defined by the tropical metric over the tropical projective torus. In this paper we show several properties of tropical balls over the tropical projective torus and also over the space of phylogenetic trees with a given set of leaf labels. Then we discuss its application to the K nearest neighbors (KNN) algorithm, a supervised learning method used to classify a high-dimensional vector into given categories by looking at a ball centered at the vector, which contains K vectors in the space.

Download Full-text

The K Nearest Neighbor Algorithm for Imputation of Missing Longitudinal Prenatal Alcohol Data

10.21203/rs.3.rs-32456/v2 ◽

2021 ◽

Author(s):

Ayesha Sania ◽

Nicolo Pini ◽

Morgan Nelson ◽

Michael Myers ◽

Lauren Shuffrey ◽

...

Keyword(s):

Missing Data ◽

Missing Values ◽

Nearest Neighbor ◽

Learning Algorithm ◽

Drinking Behavior ◽

Nearest Neighbors ◽

First Trimester ◽

Epidemiologic Studies ◽

K Nearest Neighbor ◽

Timeline Followback

Abstract Background — Missing data are a source of bias in epidemiologic studies. This is problematic in alcohol research where data missingness is linked to drinking behavior. Methods — The Safe Passage study was a prospective investigation of prenatal drinking and fetal/infant outcomes (n=11,083). Daily alcohol consumption for last reported drinking day and 30 days prior was recorded using Timeline Followback method. Of 3.2 million person-days, data were missing for 0.36 million. We imputed missing data using a machine learning algorithm; “K Nearest Neighbor” (K-NN). K-NN imputes missing values for a participant using data of participants closest to it. Imputed values were weighted for the distances from nearest neighbors and matched for day of week. Validation was done on randomly deleted data for 5-15 consecutive days. Results — Data from 5 nearest neighbors and segments of 55 days provided imputed values with least imputation error. After deleting data segments from with no missing days first trimester, there was no difference between actual and predicted values for 64% of deleted segments. For 31% of the segments, imputed data were within +/-1 drink/day of the actual. Conclusions — K-NN can be used to impute missing data in longitudinal studies of alcohol use during pregnancy with high accuracy.

Download Full-text

Efficient Shared Execution Processing of k-Nearest Neighbor Joins in Road Networks

Mobile Information Systems ◽

10.1155/2018/1243289 ◽

2018 ◽

Vol 2018 ◽

pp. 1-17 ◽

Cited By ~ 1

Author(s):

Hyung-Ju Cho

Keyword(s):

Euclidean Distance ◽

Nearest Neighbor ◽

Real Life ◽

Road Networks ◽

Nearest Neighbors ◽

Superior Performance ◽

K Nearest Neighbor ◽

Wide Range ◽

Primitive Operation ◽

Nested Loop

We investigate the k-nearest neighbor (kNN) join in road networks to determine the k-nearest neighbors (NNs) from a dataset S to every object in another dataset R. The kNN join is a primitive operation and is widely used in many data mining applications. However, it is an expensive operation because it combines the kNN query and the join operation, whereas most existing methods assume the use of the Euclidean distance metric. We alternatively consider the problem of processing kNN joins in road networks where the distance between two points is the length of the shortest path connecting them. We propose a shared execution-based approach called the group-nested loop (GNL) method that can efficiently evaluate kNN joins in road networks by exploiting grouping and shared execution. The GNL method can be easily implemented using existing kNN query algorithms. Extensive experiments using several real-life roadmaps confirm the superior performance and effectiveness of the proposed method in a wide range of problem settings.

Download Full-text

The K nearest neighbor algorithm for imputation of missing longitudinal prenatal alcohol data

10.21203/rs.3.rs-32456/v1 ◽

2020 ◽

Author(s):

Ayesha Sania ◽

Nicolò Pini ◽

Morgan E. Nelson ◽

Michael M. Myers ◽

Lauren C. Shuffrey ◽

...

Keyword(s):

Missing Data ◽

Alcohol Consumption ◽

Missing Values ◽

Nearest Neighbor ◽

Learning Algorithm ◽

Drinking Behavior ◽

Nearest Neighbors ◽

K Nearest Neighbor ◽

Data Set ◽

Prenatal Alcohol

Abstract Background — Missing data are a source of bias in many epidemiologic studies. This is problematic in alcohol research where data missingness may not be random as they depend on patterns of drinking behavior. Methods — The Safe Passage Study was a prospective investigation of prenatal alcohol consumption and fetal/infant outcomes (n=11,083). Daily alcohol consumption for the last reported drinking day and 30 days prior was recorded using the Timeline Followback method. Of 3.2 million person-days, data were missing for 0.36 million. We imputed missing exposure data using a machine learning algorithm; “K Nearest Neighbor” (K-NN). K-NN imputes missing values for a participant using data of other participants closest to it. Since participants with no missing days may not be comparable to those with missing data, segments from those with complete and incomplete data were included as a reference. Imputed values were weighted for the distances from nearest neighbors and matched for day of week. We validated our approach by randomly deleting non-missing data for 5-15 consecutive days. Results — We found that data from 5 nearest neighbors (i.e. K=5) and segments of 55 days provided imputed values with least imputation error. After deleting data segments from a first trimester data set with no missing days, there was no difference between actual and predicted values for 64% of deleted segments. For 31% of the segments, imputed data were within +/-1 drink/day of the actual. Conclusions — K-NN can be used to impute missing data in longitudinal studies of alcohol use during pregnancy with high accuracy.

Download Full-text

Analisis Perbandingan Algoritma Klasifikasi Citra Chest X-ray Untuk Deteksi Covid-19

Teknika ◽

10.34148/teknika.v10i2.331 ◽

2021 ◽

Vol 10 (2) ◽

pp. 96-103

Author(s):

Mohammad Farid Naufal ◽

Selvia Ferdiana Kusuma ◽

Kevin Christian Tanus ◽

Raynaldy Valentino Sukiwun ◽

Joseph Kristiano ◽

...

Keyword(s):

Neural Network ◽

Support Vector Machine ◽

Cross Validation ◽

Nearest Neighbor ◽

Nearest Neighbors ◽

Support Vector ◽

K Nearest Neighbor ◽

K Nearest Neighbors ◽

X Ray ◽

Chest X Ray

Kondisi pandemi global Covid-19 yang muncul diakhir tahun 2019 telah menjadi permasalahan utama seluruh negara di dunia. Covid-19 merupakan virus yang menyerang organ paru-paru dan dapat mengakibatkan kematian. Pasien Covid-19 banyak yang telah dirawat di rumah sakit sehingga terdapat data citra chest X-ray paru-paru pasien yang terjangkit Covid-19. Saat ini sudah banyak peneltian yang melakukan klasifikasi citra chest X-ray menggunakan Convolutional Neural Network (CNN) untuk membedakan paru-paru sehat, terinfeksi covid-19, dan penyakit paru-paru lainnya, namun belum ada penelitian yang mencoba membandingkan performa algoritma CNN dan machine learning klasik seperti Support Vector Machine (SVM), dan K-Nearest Neighbor (KNN) untuk mengetahui gap performa dan waktu eksekusi yang dibutuhkan. Penelitian ini bertujuan untuk membandingkan performa dan waktu eksekusi algoritma klasifikasi K-Nearest Neighbors (KNN), Support Vector Machine (SVM), dan CNN untuk mendeteksi Covid-19 berdasarkan citra chest X-Ray. Berdasarkan hasil pengujian menggunakan 5 Cross Validation, CNN merupakan algoritma yang memiliki rata-rata performa terbaik yaitu akurasi 0,9591, precision 0,9592, recall 0,9591, dan F1 Score 0,959 dengan waktu eksekusi rata-rata sebesar 3102,562 detik.

Download Full-text

Methods based on k-nearest neighbor regression in the prediction of basal area diameter distribution

Canadian Journal of Forest Research ◽

10.1139/x98-085 ◽

1998 ◽

Vol 28 (8) ◽

pp. 1107-1115 ◽

Cited By ~ 61

Author(s):

Matti Maltamo ◽

Annika Kangas

Keyword(s):

Nearest Neighbor ◽

Basal Area ◽

Nearest Neighbors ◽

Volume Estimation ◽

Diameter Distribution ◽

K Nearest Neighbor ◽

K Nearest Neighbors ◽

Stand Growth ◽

Weighted Averages ◽

Growing Stock

In the Finnish compartmentwise inventory systems, growing stock is described with means and sums of tree characteristics, such as mean height and basal area, by tree species. In the calculations, growing stock is described in a treewise manner using a diameter distribution predicted from stand variables. The treewise description is needed for several reasons, e.g., for predicting log volumes or stand growth and for analyzing the forest structure. In this study, methods for predicting the basal area diameter distribution based on the k-nearest neighbor (k-nn) regression are compared with methods based on parametric distributions. In the k-nn method, the predicted values for interesting variables are obtained as weighted averages of the values of neighboring observations. Using k-nn based methods, the basal area diameter distribution of a stand is predicted with a weighted average of the distributions of k-nearest neighbors. The methods tested in this study include weighted averages of (i)Weibull distributions of k-nearest neighbors, (ii)distributions of k-nearest neighbors smoothed with the kernel method, and (iii)empirical distributions of the k-nearest neighbors. These methods are compared for the accuracy of stand volume estimation, stand structure description, and stand growth prediction. Methods based on the k-nn regression proved to give a more accurate description of the stand than the parametric methods.

Download Full-text

Klasifikasi Sekolah Menengah Pertama/Sederajat Wilayah Bireuen Menggunakan Algoritma K-Nearest Neighbors Berbasis Web

Computer Engineering Science and System Journal ◽

10.24114/cess.v5i1.14962 ◽

2020 ◽

Vol 5 (1) ◽

pp. 33

Author(s):

Rozzi Kesuma Dinata ◽

Fajriana Fajriana ◽

Zulfa Zulfa ◽

Novia Hasdyna

Keyword(s):

Euclidean Distance ◽

Nearest Neighbor ◽

Nearest Neighbors ◽

K Nearest Neighbor ◽

K Nearest Neighbors

Pada penelitian ini diimplementasikan algoritma K-Nearest Neighbor dalam pengklasifikasian Sekolah Menengah Pertama/Sederajat berdasarkan peminatan calon siswa. Tujuan penelitian ini adalah untuk memudahkan pengguna dalam menemukan sekolah SMP/sederajat berdasarkan 8 kriteria sekolah yaitu akreditasi, fasilitas ruangan, fasilitas olah raga, laboratorium, ekstrakulikuler, biaya, tingkatan kelas dan waktu belajar. Adapun data yang digunakan dalam penelitian ini didapatkan dari Dinas Pendidikan Pemuda dan Olahraga Kabupaten Bireuen. Hasil penelitian dengan menggunakan K-NN dan pendekatan Euclidean Distance dengan k=3, diperoleh nilai precision sebesar 63,67%, recall 68,95% dan accuracy sebesar 79,33% .

Download Full-text

PREDIKSI KELULUSAN MAHASISWA MAGISTER TEKNIK INFORMATIKA UNIVERSITAS AMIKOM YOGYAKARTA MENGGUNAKAN METODE K-NEAREST NEIGHBOR

Respati ◽

10.35842/jtir.v13i2.260 ◽

2018 ◽

Vol 13 (2) ◽

Author(s):

Eri Sasmita Susanto ◽

Kusrini Kusrini ◽

Hanif Al Fatta

Keyword(s):

Nearest Neighbor ◽

Nearest Neighbors ◽

Training Data ◽

K Nearest Neighbor ◽

Process Data ◽

K Nearest Neighbors ◽

Testing Data ◽

Estimation Scheme ◽

Student Graduation ◽

Feasibility Test

INTISARIPenelitian ini difokuskan untuk mengetahui uji kelayakan prediksi kelulusan mahasiswa Universitas AMIKOM Yogyakarta. Dalam hal ini penulis memilih algoritma K-Nearest Neighbors (K-NN) karena K-Nearest Neighbors (K-NN) merupakan algoritma yang bisa digunakan untuk mengolah data yang bersifat numerik dan tidak membutuhkan skema estimasi parameter perulangan yang rumit, ini berarti bisa diaplikasikan untuk dataset berukuran besar.Input dari sistem ini adalah Data sampel berupa data mahasiswa tahun 2014-2015. pengujian pada penelitian ini menggunakn dua pengujian yaitu data testing dan data training. Kriteria yang digunakan dalam penelitian ini adalah , IP Semester 1-4, capaian SKS, Status Kelulusan. Output dari sistem ini berupa hasil prediksi kelulusan mahasiswa yang terbagi menjadi dua yaitu tepat waktu dan kelulusan tidak tepat waktu.Hasil pengujian menunjukkan bahwa Berdasarkan penerapan k=14 dan k-fold=5 menghasilkan performa yang terbaik dalam memprediksi kelulusan mahasiswa dengan metode K-Nearest Neighbor menggunakan indeks prestasi 4 semester dengan nilai akurasi= 98,46%, precision= 99.53% dan recall =97.64%.Kata kunci: Algoritma K-Nearest Neighbors, Prediksi Kelulusan, Data Testing, Data Training ABSTRACTThis research is focused on knowing the feasibility test of students' graduation prediction of AMIKOM University Yogyakarta. In this case the authors chose the K-Nearest Neighbors (K-NN) algorithm because K-Nearest Neighbors (K-NN) is an algorithm that can be used to process data that is numerical and does not require complicated repetitive parameter estimation scheme, this means it can be applied for large datasets.The input of this system is the sample data in the form of student data from 2014-2015. test in this research use two test that is data testing and training data. The criteria used in this study are, IP Semester 1-4, achievement of SKS, Graduation Status. The output of this system in the form of predicted results of student graduation which is divided into two that is timely and graduation is not timely.The result of the test shows that based on the application of k = 14 and k-fold = 5, the best performance in predicting the students' graduation using K-Nearest Neighbor method uses 4 semester achievement index with accuracy value = 98,46%, precision = 99.53% and recall = 97.64%.Keywords: K-Nearest Neighbors Algorithm, Graduation Prediction, Testing Data, Training Data

Download Full-text

KLASIFIKASI JENIS BUAH APEL DENGAN METODE K-NEAREST NEIGHBORS DENGAN EKSTRAKSI FITUR HSV DAN LBP

Jurnal Sisfokom (Sistem Informasi dan Komputer) ◽

10.32736/sisfokom.v8i1.610 ◽

2019 ◽

Vol 8 (1) ◽

Author(s):

Novan Wijaya

Keyword(s):

Nearest Neighbor ◽

Nearest Neighbors ◽

K Nearest Neighbor ◽

K Nearest Neighbors

Abstrak Apel merupakan salah satu jenis buah yang unggul dan sangat digemari dan dikonsumsi masyarakat. Buah apel memiliki banyak varietas yang dapat dibedakan berdasarkan warna dan bentuk buah. Fitur Hue Saturation Value (HSV) dan Local Binary Patern (LBP) digunakan pada penelitian ini sebagai ekstraksi fitur warna dan bentuk pada buah yang kemudian akan dijadikan ciri dari warna dan bentuk buah apel yang akan diteliti. Metode K-Nearest Neighbor (K-NN) adalah salah satu metode penelitian pada kecerdasan buatan yang digunakan dalam penelitian ini untuk mengklasifikasikan nilai-nilai yang didapat dari hasil ekstraksi fitur HSV dan LBP. Data yang digunakan pada penelittian ini adalah 800 citra, yang terdiri dari 600 citra latih dan 200 citra uji. Hasil evaluasi yang didapat dari metode K-Nearest Neighbor ini untuk Secara keseluruhan dapat dilihat bahwa rata-rata nilai Precision yang di dapat sebesar 94%, Recall sebesar 100%, dan Accuracy sebesar 94 %.Kata kunci: Hue Saturation Value, Local Binary Patern, K-Nearest Neighbor

Download Full-text

A generalized fuzzy k-nearest neighbor regression model based on Minkowski distance

Granular Computing ◽

10.1007/s41066-021-00288-w ◽

2021 ◽

Author(s):

Mahinda Mailagaha Kumbure ◽

Pasi Luukka

Keyword(s):

Regression Model ◽

Euclidean Distance ◽

Nearest Neighbor ◽

Weighted Average ◽

Nearest Neighbors ◽

Manhattan Distance ◽

K Nearest Neighbor ◽

Minkowski Distance ◽

Regression Methods ◽

Target Sample

AbstractThe fuzzy k-nearest neighbor (FKNN) algorithm, one of the most well-known and effective supervised learning techniques, has often been used in data classification problems but rarely in regression settings. This paper introduces a new, more general fuzzy k-nearest neighbor regression model. Generalization is based on the usage of the Minkowski distance instead of the usual Euclidean distance. The Euclidean distance is often not the optimal choice for practical problems, and better results can be obtained by generalizing this. Using the Minkowski distance allows the proposed method to obtain more reasonable nearest neighbors to the target sample. Another key advantage of this method is that the nearest neighbors are weighted by fuzzy weights based on their similarity to the target sample, leading to the most accurate prediction through a weighted average. The performance of the proposed method is tested with eight real-world datasets from different fields and benchmarked to the k-nearest neighbor and three other state-of-the-art regression methods. The Manhattan distance- and Euclidean distance-based FKNNreg methods are also implemented, and the results are compared. The empirical results show that the proposed Minkowski distance-based fuzzy regression (Md-FKNNreg) method outperforms the benchmarks and can be a good algorithm for regression problems. In particular, the Md-FKNNreg model gave the significantly lowest overall average root mean square error (0.0769) of all other regression methods used. As a special case of the Minkowski distance, the Manhattan distance yielded the optimal conditions for Md-FKNNreg and achieved the best performance for most of the datasets.

Download Full-text