Efficient Feature Selection for Static Analysis Vulnerability Prediction

Katarzyna Filus; Paweł Boryszko; Joanna Domańska; Miltiadis Siavvas; Erol Gelenbe

doi:10.3390/s21041133

Efficient Feature Selection for Static Analysis Vulnerability Prediction

Sensors ◽

10.3390/s21041133 ◽

2021 ◽

Vol 21 (4) ◽

pp. 1133

Author(s):

Katarzyna Filus ◽

Paweł Boryszko ◽

Joanna Domańska ◽

Miltiadis Siavvas ◽

Erol Gelenbe

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Selection Process ◽

Research Effort ◽

Machine Learning Algorithms ◽

Production Cycle ◽

Software Vulnerability ◽

Security Breaches ◽

Vulnerability Prediction ◽

Software Production

Common software vulnerabilities can result in severe security breaches, financial losses, and reputation deterioration and require research effort to improve software security. The acceleration of the software production cycle, limited testing resources, and the lack of security expertise among programmers require the identification of efficient software vulnerability predictors to highlight the system components on which testing should be focused. Although static code analyzers are often used to improve software quality together with machine learning and data mining for software vulnerability prediction, the work regarding the selection and evaluation of different types of relevant vulnerability features is still limited. Thus, in this paper, we examine features generated by SonarQube and CCCC tools, to identify those that can be used for software vulnerability prediction. We investigate the suitability of thirty-three different features to train thirteen distinct machine learning algorithms to design vulnerability predictors and identify the most relevant features that should be used for training. Our evaluation is based on a comprehensive feature selection process based on the correlation analysis of the features, together with four well-known feature selection techniques. Our experiments, using a large publicly available dataset, facilitate the evaluation and result in the identification of small, but efficient sets of features for software vulnerability prediction.

Download Full-text

Sentiment Analysis of Movie Reviews: A Study of Machine Learning Algorithms with Various Feature Selection Methods

International Journal of Computer Sciences and Engineering ◽

10.26438/ijcse/v5i9.113121 ◽

2017 ◽

Vol 5 (9) ◽

Cited By ~ 1

Author(s):

Rajwinder Kaur

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Sentiment Analysis ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Selection Methods

Download Full-text

Feature-Selection and Mutual-Clustering Approaches to Improve DoS Detection and Maintain WSNs’ Lifetime

Sensors ◽

10.3390/s21144821 ◽

2021 ◽

Vol 21 (14) ◽

pp. 4821

Author(s):

Rami Ahmad ◽

Raniyah Wazirali ◽

Qusay Bsoul ◽

Tarik Abu-Ain ◽

Waleed Abu-Ain

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Open Field ◽

Network Lifetime ◽

Detection Efficiency ◽

Denial Of Service ◽

Harmony Search ◽

Machine Learning Algorithms ◽

Transport Layer ◽

Feature Selection Techniques

Wireless Sensor Networks (WSNs) continue to face two major challenges: energy and security. As a consequence, one of the WSN-related security tasks is to protect them from Denial of Service (DoS) and Distributed DoS (DDoS) attacks. Machine learning-based systems are the only viable option for these types of attacks, as traditional packet deep scan systems depend on open field inspection in transport layer security packets and the open field encryption trend. Moreover, network data traffic will become more complex due to increases in the amount of data transmitted between WSN nodes as a result of increasing usage in the future. Therefore, there is a need to use feature selection techniques with machine learning in order to determine which data in the DoS detection process are most important. This paper examined techniques for improving DoS anomalies detection along with power reservation in WSNs to balance them. A new clustering technique was introduced, called the CH_Rotations algorithm, to improve anomaly detection efficiency over a WSN’s lifetime. Furthermore, the use of feature selection techniques with machine learning algorithms in examining WSN node traffic and the effect of these techniques on the lifetime of WSNs was evaluated. The evaluation results showed that the Water Cycle (WC) feature selection displayed the best average performance accuracy of 2%, 5%, 3%, and 3% greater than Particle Swarm Optimization (PSO), Simulated Annealing (SA), Harmony Search (HS), and Genetic Algorithm (GA), respectively. Moreover, the WC with Decision Tree (DT) classifier showed 100% accuracy with only one feature. In addition, the CH_Rotations algorithm improved network lifetime by 30% compared to the standard LEACH protocol. Network lifetime using the WC + DT technique was reduced by 5% compared to other WC + DT-free scenarios.

Download Full-text

Feature Selection with Fast Correlation-Based Filter for Breast Cancer Prediction and Classification Using Machine Learning Algorithms

2018 International Symposium on Advanced Electrical and Communication Technologies (ISAECT) ◽

10.1109/isaect.2018.8618688 ◽

2018 ◽

Author(s):

Youness Khourdifi ◽

Mohamed Bahaj

Keyword(s):

Breast Cancer ◽

Machine Learning ◽

Feature Selection ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Cancer Prediction

Download Full-text

Comparative study on total nitrogen prediction in wastewater treatment plant and effect of various feature selection methods on machine learning algorithms performance

Journal of Water Process Engineering ◽

10.1016/j.jwpe.2021.102033 ◽

2021 ◽

Vol 41 ◽

pp. 102033

Author(s):

Faramarz Bagherzadeh ◽

Mohamad-Javad Mehrani ◽

Milad Basirifard ◽

Javad Roostaei

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Wastewater Treatment ◽

Comparative Study ◽

Total Nitrogen ◽

Wastewater Treatment Plant ◽

Learning Algorithms ◽

Treatment Plant ◽

Machine Learning Algorithms ◽

Selection Methods

Download Full-text

Using a Machine Learning Approach to Identify Key Prognostic Molecules for Esophageal Squamous Cell Carcinoma

10.21203/rs.3.rs-310517/v1 ◽

2021 ◽

Author(s):

Meng-Xiang Li ◽

Xiao-Meng Sun ◽

Wei-Gang Cheng ◽

Hao-Jie Ruan ◽

Ke Liu ◽

...

Keyword(s):

Machine Learning ◽

Squamous Cell Carcinoma ◽

Feature Selection ◽

Esophageal Squamous Cell Carcinoma ◽

Molecular Interaction ◽

Prognostic Biomarker ◽

Interaction Network ◽

Machine Learning Algorithms ◽

Functional Modules ◽

Molecular Interaction Network

Abstract ObjectiveA plethora of prognostic biomarkers for esophageal squamous cell carcinoma (ESCC) that have hitherto been reported are challenged with low reproducibility due to high molecular heterogeneity of ESCC. The purpose of this study is to identify the optimal biomarkers for ESCC using machine learning algorithms.MethodsBiomarkers related to clinical survival, recurrence or therapeutic response of patients with ESCC were determined through literature database searching. Forty-eight biomarkers linked to prognosis of ESCC were used to construct a molecular interaction network based on NetBox and then to identify the functional modules. Publicably available mRNA transcriptome data of ESCC downloaded from Gene Expression Omnibus (GEO) and The Cancer Genome Atlas (TCGA) datasets included GSE53625 and TCGA-ESCC. Five machine learning algorithms, including logical regression (LR), support vector machine (SVM), artificial neural network (ANN), random forest (RF) and XGBoost, were used to develop classifiers for prognostic classification for feature selection. The area under ROC curve (AUC) was used to evaluate the performance of the prognostic classifiers. The importances of these 17 molecules were ranked by their occurrence frequencies in the prognostic classifiers. Kaplan-Meier survival analysis and log-rank test were performed to determine the statistical significance of overall survival.ResultsA total of 48 clinical proven molecules associated with ESCC progression were used to construct a molecular interaction network with 3 functional modules comprising 17 component molecules. The 131071 prognostic classifiers using these 17 molecules were built for each machine learning algorithm. Using the occurrence frequencies in the prognostic classifiers with AUCs greater than the mean value of all 131,071 AUCs to rank importances of these 17 molecules, stratifin encoded by SFN was identified as the optimal prognostic biomarker for ESCC, whose performance was further validated in another 2 independent cohorts.ConclusionThe occurrence frequencies across various feature selection approaches reflect the degree of clinical importance and stratifin is an optimal prognostic biomarker for ESCC.

Download Full-text

Recognition Technology of Athlete’s Limb Movement Combined Based on the Integrated Learning Algorithm

Journal of Sensors ◽

10.1155/2021/3057557 ◽

2021 ◽

Vol 2021 ◽

pp. 1-9

Author(s):

Fei Tan ◽

Xiaoqing Xie

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Learning Algorithm ◽

Human Motion ◽

Machine Learning Algorithms ◽

Support Vector ◽

Recording Device ◽

Table Tennis ◽

Movement Recognition ◽

Random Forest Tree

Human motion recognition based on inertial sensor is a new research direction in the field of pattern recognition. It carries out preprocessing, feature selection, and feature selection by placing inertial sensors on the surface of the human body. Finally, it mainly classifies and recognizes the extracted features of human action. There are many kinds of swing movements in table tennis. Accurately identifying these movement modes is of great significance for swing movement analysis. With the development of artificial intelligence technology, human movement recognition has made many breakthroughs in recent years, from machine learning to deep learning, from wearable sensors to visual sensors. However, there is not much work on movement recognition for table tennis, and the methods are still mainly integrated into the traditional field of machine learning. Therefore, this paper uses an acceleration sensor as a motion recording device for a table tennis disc and explores the three-axis acceleration data of four common swing motions. Traditional machine learning algorithms (decision tree, random forest tree, and support vector) are used to classify the swing motion, and a classification algorithm based on the idea of integration is designed. Experimental results show that the ensemble learning algorithm developed in this paper is better than the traditional machine learning algorithm, and the average recognition accuracy is 91%.

Download Full-text

Techniques for Detecting Malware Traffic: A Comprehensive Approach to Feature Selection and Classification

International Journal for Research in Applied Science and Engineering Technology ◽

10.22214/ijraset.2021.39088 ◽

2021 ◽

Vol 9 (12) ◽

pp. 1-10

Author(s):

Harsha A K

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Random Forest ◽

Learning Algorithms ◽

Malware Detection ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Support Vector ◽

Steady Increase ◽

Extreme Gradient Boosting

Abstract: Since the advent of encryption, there has been a steady increase in malware being transmitted over encrypted networks. Traditional approaches to detect malware like packet content analysis are inefficient in dealing with encrypted data. In the absence of actual packet contents, we can make use of other features like packet size, arrival time, source and destination addresses and other such metadata to detect malware. Such information can be used to train machine learning classifiers in order to classify malicious and benign packets. In this paper, we offer an efficient malware detection approach using classification algorithms in machine learning such as support vector machine, random forest and extreme gradient boosting. We employ an extensive feature selection process to reduce the dimensionality of the chosen dataset. The dataset is then split into training and testing sets. Machine learning algorithms are trained using the training set. These models are then evaluated against the testing set in order to assess their respective performances. We further attempt to tune the hyper parameters of the algorithms, in order to achieve better results. Random forest and extreme gradient boosting algorithms performed exceptionally well in our experiments, resulting in area under the curve values of 0.9928 and 0.9998 respectively. Our work demonstrates that malware traffic can be effectively classified using conventional machine learning algorithms and also shows the importance of dimensionality reduction in such classification problems. Keywords: Malware Detection, Extreme Gradient Boosting, Random Forest, Feature Selection.

Download Full-text

Evaluating Variable Selection and Machine Learning Algorithms for Estimating Forest Heights by Combining Lidar and Hyperspectral Data

ISPRS International Journal of Geo-Information ◽

10.3390/ijgi9090507 ◽

2020 ◽

Vol 9 (9) ◽

pp. 507

Author(s):

Sanjiwana Arjasakusuma ◽

Sandiaga Swahyu Kusuma ◽

Stuart Phinn

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Learning Algorithms ◽

Principal Component ◽

Hyperspectral Data ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Support Vector ◽

Forest Height ◽

Extreme Gradient Boosting

Machine learning has been employed for various mapping and modeling tasks using input variables from different sources of remote sensing data. For feature selection involving high- spatial and spectral dimensionality data, various methods have been developed and incorporated into the machine learning framework to ensure an efficient and optimal computational process. This research aims to assess the accuracy of various feature selection and machine learning methods for estimating forest height using AISA (airborne imaging spectrometer for applications) hyperspectral bands (479 bands) and airborne light detection and ranging (lidar) height metrics (36 metrics), alone and combined. Feature selection and dimensionality reduction using Boruta (BO), principal component analysis (PCA), simulated annealing (SA), and genetic algorithm (GA) in combination with machine learning algorithms such as multivariate adaptive regression spline (MARS), extra trees (ET), support vector regression (SVR) with radial basis function, and extreme gradient boosting (XGB) with trees (XGbtree and XGBdart) and linear (XGBlin) classifiers were evaluated. The results demonstrated that the combinations of BO-XGBdart and BO-SVR delivered the best model performance for estimating tropical forest height by combining lidar and hyperspectral data, with R2 = 0.53 and RMSE = 1.7 m (18.4% of nRMSE and 0.046 m of bias) for BO-XGBdart and R2 = 0.51 and RMSE = 1.8 m (15.8% of nRMSE and −0.244 m of bias) for BO-SVR. Our study also demonstrated the effectiveness of BO for variables selection; it could reduce 95% of the data to select the 29 most important variables from the initial 516 variables from lidar metrics and hyperspectral data.

Download Full-text

Feature Selection and Polarity Classification Using Machine Learning Algorithms NB & SVM

SSRN Electronic Journal ◽

10.2139/ssrn.3419763 ◽

2019 ◽

Author(s):

Smita Bhanap ◽

Seema Babrekar

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Polarity Classification

Download Full-text

Integrated Long-Term Stock Selection Models Based on Feature Selection and Machine Learning Algorithms for China Stock Market

IEEE Access ◽

10.1109/access.2020.2969293 ◽

2020 ◽

Vol 8 ◽

pp. 22672-22685 ◽

Cited By ~ 2

Author(s):

Xianghui Yuan ◽

Jin Yuan ◽

Tianzhao Jiang ◽

Qurat Ul Ain

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Stock Market ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Stock Selection ◽

Selection Models ◽

China Stock Market

Download Full-text