Weighted k-Nearest Neighbors Feature Selection (WkNN-FS)

Feature Selection for High Dimensional Data Using Weighted K-Nearest Neighbors and Genetic Algorithm

IEEE Access ◽

10.1109/access.2020.3012768 ◽

2020 ◽

Vol 8 ◽

pp. 139512-139528

Author(s):

Shuangjie Li ◽

Kaixiang Zhang ◽

Qianru Chen ◽

Shuqin Wang ◽

Shaoqiang Zhang

Keyword(s):

Genetic Algorithm ◽

Feature Selection ◽

High Dimensional Data ◽

Nearest Neighbors ◽

High Dimensional ◽

K Nearest Neighbors ◽

Selection For

Download Full-text

Towards Optimization of Malware Detection using Chi-square Feature Selection on Ensemble Classifiers

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.d2359.0410421 ◽

2021 ◽

Vol 10 (4) ◽

pp. 254-262

Author(s):

*Fadare Oluwaseun Gbenga ◽

Adetunmbi Adebayo Olusola ◽

(Mrs) Oyinloye Oghenerukevwe Eloho ◽

Mogaji Stephen Alaba

Keyword(s):

Feature Selection ◽

Malware Detection ◽

Feature Selection Method ◽

Ensemble Methods ◽

Nearest Neighbors ◽

Selection Method ◽

Gradient Boosting ◽

K Nearest Neighbors ◽

Chi Square ◽

Extreme Gradient Boosting

The multiplication of malware variations is probably the greatest problem in PC security and the protection of information in form of source code against unauthorized access is a central issue in computer security. In recent times, machine learning has been extensively researched for malware detection and ensemble technique has been established to be highly effective in terms of detection accuracy. This paper proposes a framework that combines combining the exploit of both Chi-square as the feature selection method and eight ensemble learning classifiers on five base learners- K-Nearest Neighbors, Naïve Bayes, Support Vector Machine, Decision Trees, and Logistic Regression. K-Nearest Neighbors returns the highest accuracy of 95.37%, 87.89% on chi-square, and without feature selection respectively. Extreme Gradient Boosting Classifier ensemble accuracy is the highest with 97.407%, 91.72% with Chi-square as feature selection, and ensemble methods without feature selection respectively. Extreme Gradient Boosting Classifier and Random Forest are leading in the seven evaluative measures of chi-square as a feature selection method and ensemble methods without feature selection respectively. The study results show that the tree-based ensemble model is compelling for malware classification.

Download Full-text

Reparameterizable Subset Sampling via Continuous Relaxations

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/544 ◽

2019 ◽

Author(s):

Sang Michael Xie ◽

Stefano Ermon

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Stochastic Optimization ◽

Nearest Neighbors ◽

Improve Performance ◽

K Nearest Neighbors ◽

Single Item ◽

Underlying Distribution ◽

Learning Tasks ◽

Continuous Relaxation

Many machine learning tasks require sampling a subset of items from a collection based on a parameterized distribution. The Gumbel-softmax trick can be used to sample a single item, and allows for low-variance reparameterized gradients with respect to the parameters of the underlying distribution. However, stochastic optimization involving subset sampling is typically not reparameterizable. To overcome this limitation, we define a continuous relaxation of subset sampling that provides reparameterization gradients by generalizing the Gumbel-max trick. We use this approach to sample subsets of features in an instance-wise feature selection task for model interpretability, subsets of neighbors to implement a deep stochastic k-nearest neighbors model, and sub-sequences of neighbors to implement parametric t-SNE by directly comparing the identities of local neighbors. We improve performance in all these tasks by incorporating subset sampling in end-to-end training.

Download Full-text

Weighted k-nearest neighbors feature selection for high-dimensional multi-class data

2019 IEEE International Conference on Systems, Man and Cybernetics (SMC) ◽

10.1109/smc.2019.8914434 ◽

2019 ◽

Author(s):

Peter Bugata ◽

Peter Drotar

Keyword(s):

Feature Selection ◽

Nearest Neighbors ◽

High Dimensional ◽

K Nearest Neighbors ◽

Selection For

Download Full-text

Enhancement of Performance of K-Nearest Neighbors Classifiers for the Prediction of Diabetes Using Feature Selection Method

2020 IEEE 5th International Conference on Computing Communication and Automation (ICCCA) ◽

10.1109/iccca49541.2020.9250887 ◽

2020 ◽

Author(s):

Subhash Chandra Gupta ◽

Noopur Goel

Keyword(s):

Feature Selection ◽

Feature Selection Method ◽

Nearest Neighbors ◽

Selection Method ◽

K Nearest Neighbors ◽

Prediction Of Diabetes

Download Full-text

Factorial Analysis for Gas Leakage Risk Predictions from a Vehicle-Based Methane Survey

Applied Sciences ◽

10.3390/app12010115 ◽

2021 ◽

Vol 12 (1) ◽

pp. 115

Author(s):

Khongorzul Dashdondov ◽

Mi-Hwa Song

Keyword(s):

Air Pollution ◽

Feature Selection ◽

Natural Gas ◽

Open Data ◽

Nearest Neighbors ◽

Factorial Analysis ◽

Test Results ◽

Accuracy Rate ◽

K Nearest Neighbors ◽

Gas Leakage

Natural gas (NG), typically methane, is released into the air, causing significant air pollution and environmental and health problems. Nowadays, there is a need to use machine-based methods to predict gas losses widely. In this article, we proposed to predict NG leakage levels through feature selection based on a factorial analysis (FA) of the USA’s urban natural gas open data. The paper has been divided into three sections. First, we select essential features using FA. Then, the dataset is labeled by k-means clustering with OrdinalEncoder (OE)-based normalization. The final module uses five algorithms (extreme gradient boost (XGBoost), K-nearest neighbors (KNN), decision tree (DT), random forest (RF), Naive Bayes (NB), and multilayer perceptron (MLP)) to predict gas leakage levels. The proposed method is evaluated by the accuracy, F1-score, mean standard error (MSE), and area under the ROC curve (AUC). The test results indicate that the F-OE-based classification method has improved successfully. Moreover, F-OE-based XGBoost (F-OE-XGBoost) showed the best performance by giving 95.14% accuracy, an F1-score of 95.75%, an MSE of 0.028, and an AUC of 96.29%. Following these, the second-best outcomes of an accuracy rate of 95.09%, F1-score of 95.60%, MSE of 0.029, and AUC of 96.11% were achieved by the F-OE-RF model.

Download Full-text

Mapping the Growing Stem Volume of the Coniferous Plantations in North China Using Multispectral Data from Integrated GF-2 and Sentinel-2 Images and an Optimized Feature Variable Selection Method

Remote Sensing ◽

10.3390/rs13142740 ◽

2021 ◽

Vol 13 (14) ◽

pp. 2740

Author(s):

Xinyu Li ◽

Hui Lin ◽

Jiangping Long ◽

Xiaodong Xu

Keyword(s):

Feature Selection ◽

Random Forest ◽

Nearest Neighbors ◽

Stem Volume ◽

K Nearest Neighbors ◽

Combination Optimization ◽

Variable Combination ◽

Coniferous Plantations ◽

Selection Of ◽

Sentinel 2

Accurate measurement of forest growing stem volume (GSV) is important for forest resource management and ecosystem dynamics monitoring. Optical remote sensing imagery has great application prospects in forest GSV estimation on regional and global scales as it is easily accessible, has a wide coverage, and mature technology. However, their application is limited by cloud coverage, data stripes, atmospheric effects, and satellite sensor errors. Combining multi-sensor data can reduce such limitations as it increases the data availability, but also causes the multi-dimensional problem that increases the difficulty of feature selection. In this study, GaoFen-2 (GF-2) and Sentinel-2 images were integrated, and feature variables and data scenarios were derived by a proposed adaptive feature variable combination optimization (AFCO) program for estimating the GSV of coniferous plantations. The AFCO algorithm was compared to four traditional feature variable selection methods, namely, random forest (RF), stepwise random forest (SRF), fast iterative feature selection method for k-nearest neighbors (KNN-FIFS), and the feature variable screening and combination optimization procedure based on the distance correlation coefficient and k-nearest neighbors (DC-FSCK). The comparison indicated that the AFCO program not only considered the combination effect of feature variables, but also optimized the selection of the first feature variable, error threshold, and selection of the estimation model. Furthermore, we selected feature variables from three datasets (GF-2, Sentinel-2, and the integrated data) following the AFCO and four other feature selection methods and used the k-nearest neighbors (KNN) and random forest regression (RFR) to estimate the GSV of coniferous plantations in northern China. The results indicated that the integrated data improved the GSV estimation accuracy of coniferous plantations, with relative root mean square errors (RMSErs) of 15.0% and 19.6%, which were lower than those of GF-2 and Sentinel-2 data, respectively. In particular, the texture feature variables derived from GF-2 red band image have a significant impact on GSV estimation performance of the integrated dataset. For most data scenarios, the AFCO algorithm gained more accurate GSV estimates, as the RMSErs were 30.0%, 23.7%, 17.7%, and 17.5% lower than those of RF, SRF, KNN-FIFS, and DC-FSCK, respectively. The GSV distribution map obtained by the AFCO method and RFR model matched the field observations well. This study provides some insight into the application of optical images, optimization of the feature variable combination, and modeling algorithm selection for estimating the GSV of coniferous plantations.

Download Full-text