Using Embedded Feature Selection and CNN for Classification on CCD-INID-V1—A New IoT Dataset

Zhipeng Liu; Niraj Thapa; Addison Shaver; Kaushik Roy; Madhuri Siddula; Xiaohong Yuan; Anna Yu

doi:10.3390/s21144834

Using Embedded Feature Selection and CNN for Classification on CCD-INID-V1—A New IoT Dataset

Sensors ◽

10.3390/s21144834 ◽

2021 ◽

Vol 21 (14) ◽

pp. 4834

Author(s):

Zhipeng Liu ◽

Niraj Thapa ◽

Addison Shaver ◽

Kaushik Roy ◽

Madhuri Siddula ◽

...

Keyword(s):

Feature Selection ◽

Detection System ◽

Attack Detection ◽

Computational Time ◽

Gradient Boosting ◽

Characteristic Operator ◽

Active Devices ◽

Network Intrusion ◽

Security Models ◽

Extreme Gradient Boosting

As Internet of Things (IoT) networks expand globally with an annual increase of active devices, providing better safeguards to threats is becoming more prominent. An intrusion detection system (IDS) is the most viable solution that mitigates the threats of cyberattacks. Given the many constraints of the ever-changing network environment of IoT devices, an effective yet lightweight IDS is required to detect cyber anomalies and categorize various cyberattacks. Additionally, most publicly available datasets used for research do not reflect the recent network behaviors, nor are they made from IoT networks. To address these issues, in this paper, we have the following contributions: (1) we create a dataset from IoT networks, namely, the Center for Cyber Defense (CCD) IoT Network Intrusion Dataset V1 (CCD-INID-V1); (2) we propose a hybrid lightweight form of IDS—an embedded model (EM) for feature selection and a convolutional neural network (CNN) for attack detection and classification. The proposed method has two models: (a) RCNN: Random Forest (RF) is combined with CNN and (b) XCNN: eXtreme Gradient Boosting (XGBoost) is combined with CNN. RF and XGBoost are the embedded models to reduce less impactful features. (3) We attempt anomaly (binary) classifications and attack-based (multiclass) classifications on CCD-INID-V1 and two other IoT datasets, the detection_of_IoT_botnet_attacks_N_BaIoT dataset (Balot) and the CIRA-CIC-DoHBrw-2020 dataset (DoH20), to explore the effectiveness of these learning-based security models. Using RCNN, we achieved an Area under the Receiver Characteristic Operator (ROC) Curve (AUC) score of 0.956 with a runtime of 32.28 s on CCD-INID-V1, 0.999 with a runtime of 71.46 s on Balot, and 0.986 with a runtime of 35.45 s on DoH20. Using XCNN, we achieved an AUC score of 0.998 with a runtime of 51.38 s for CCD-INID-V1, 0.999 with a runtime of 72.12 s for Balot, and 0.999 with a runtime of 72.91 s for DoH20. Compared to KNN, XCNN required 86.98% less computational time, and RCNN required 91.74% less computational time to achieve equal or better accurate anomaly detections. We find XCNN and RCNN are consistently efficient and handle scalability well; in particular, 1000 times faster than KNN when dealing with a relatively larger dataset-Balot. Finally, we highlight RCNN and XCNN’s ability to accurately detect anomalies with a significant reduction in computational time. This advantage grants flexibility for the IDS placement strategy. Our IDS can be placed at a central server as well as resource-constrained edge devices. Our lightweight IDS requires low train time and hence decreases reaction time to zero-day attacks.

Download Full-text

IoT Botnet Attack Detection Based on Optimized Extreme Gradient Boosting and Feature Selection

Sensors ◽

10.3390/s20216336 ◽

2020 ◽

Vol 20 (21) ◽

pp. 6336 ◽

Cited By ~ 1

Author(s):

Mnahi Alqahtani ◽

Hassan Mathkour ◽

Mohamed Maher Ben Ismail

Keyword(s):

Feature Selection ◽

Large Scale ◽

Feature Selection Method ◽

Selection Method ◽

Attack Detection ◽

Gradient Boosting ◽

Fisher Score ◽

Detection Approach ◽

Extreme Gradient Boosting ◽

Iot Devices

Nowadays, Internet of Things (IoT) technology has various network applications and has attracted the interest of many research and industrial communities. Particularly, the number of vulnerable or unprotected IoT devices has drastically increased, along with the amount of suspicious activity, such as IoT botnet and large-scale cyber-attacks. In order to address this security issue, researchers have deployed machine and deep learning methods to detect attacks targeting compromised IoT devices. Despite these efforts, developing an efficient and effective attack detection approach for resource-constrained IoT devices remains a challenging task for the security research community. In this paper, we propose an efficient and effective IoT botnet attack detection approach. The proposed approach relies on a Fisher-score-based feature selection method along with a genetic-based extreme gradient boosting (GXGBoost) model in order to determine the most relevant features and to detect IoT botnet attacks. The Fisher score is a representative filter-based feature selection method used to determine significant features and discard irrelevant features through the minimization of intra-class distance and the maximization of inter-class distance. On the other hand, GXGBoost is an optimal and effective model, used to classify the IoT botnet attacks. Several experiments were conducted on a public botnet dataset of IoT devices. The evaluation results obtained using holdout and 10-fold cross-validation techniques showed that the proposed approach had a high detection rate using only three out of the 115 data traffic features and improved the overall performance of the IoT botnet attack detection process.

Download Full-text

Genetic Algorithm Based Feature Selection Technique for Optimal Intrusion Detection

10.20944/preprints202106.0710.v1 ◽

2021 ◽

Author(s):

Sydney Mambwe Kasongo

Keyword(s):

Genetic Algorithm ◽

Feature Selection ◽

Intrusion Detection ◽

Computer Network ◽

Positive Impact ◽

Detection System ◽

Gradient Boosting ◽

Support Vector ◽

Feature Selection Technique ◽

Extreme Gradient Boosting

In recent years, several industries have registered an impressive improvement in technological advances such as Internet of Things (IoT), e-commerce, vehicular networks, etc. These advances have sparked an increase in the volume of information that gets transmitted from different nodes of a computer network (CN). As a result, it is crucial to safeguard CNs against security threats and intrusions that can compromise the integrity of those systems. In this paper, we propose a machine mearning (ML) intrusion detection system (IDS) in conjunction with the Genetic Algorithm (GA) for feature selection. To assess the effectiveness of the proposed framework, we use the NSL-KDD dataset. Furthermore, we consider the following ML methods in the modelling process: decision tree (DT), support vector machine (SVM), random forest (RF), extra-trees (ET), extreme gradient boosting (XGB), and naïve Bayes (NB). The results demonstrated that using the GA algorithm has a positive impact on the performance of the selected classifiers. Moreover, the results obtained by the proposed ML methods were superior to existing methodologies.

Download Full-text

HCRNNIDS: Hybrid Convolutional Recurrent Neural Network-Based Network Intrusion Detection System

Processes ◽

10.3390/pr9050834 ◽

2021 ◽

Vol 9 (5) ◽

pp. 834

Author(s):

Muhammad Ashfaq Khan

Keyword(s):

Neural Network ◽

Intrusion Detection ◽

Recurrent Neural Network ◽

Intrusion Detection System ◽

Detection System ◽

Attack Detection ◽

Network Intrusion Detection ◽

Complex Nature ◽

Network Intrusion ◽

Network Intrusion Detection System

Nowadays, network attacks are the most crucial problem of modern society. All networks, from small to large, are vulnerable to network threats. An intrusion detection (ID) system is critical for mitigating and identifying malicious threats in networks. Currently, deep learning (DL) and machine learning (ML) are being applied in different domains, especially information security, for developing effective ID systems. These ID systems are capable of detecting malicious threats automatically and on time. However, malicious threats are occurring and changing continuously, so the network requires a very advanced security solution. Thus, creating an effective and smart ID system is a massive research problem. Various ID datasets are publicly available for ID research. Due to the complex nature of malicious attacks with a constantly changing attack detection mechanism, publicly existing ID datasets must be modified systematically on a regular basis. So, in this paper, a convolutional recurrent neural network (CRNN) is used to create a DL-based hybrid ID framework that predicts and classifies malicious cyberattacks in the network. In the HCRNNIDS, the convolutional neural network (CNN) performs convolution to capture local features, and the recurrent neural network (RNN) captures temporal features to improve the ID system’s performance and prediction. To assess the efficacy of the hybrid convolutional recurrent neural network intrusion detection system (HCRNNIDS), experiments were done on publicly available ID data, specifically the modern and realistic CSE-CIC-DS2018 data. The simulation outcomes prove that the proposed HCRNNIDS substantially outperforms current ID methodologies, attaining a high malicious attack detection rate accuracy of up to 97.75% for CSE-CIC-IDS2018 data with 10-fold cross-validation.

Download Full-text

Improved Sampling and Feature Selection to Support Extreme Gradient Boosting For PCOS Diagnosis

2021 IEEE 11th Annual Computing and Communication Workshop and Conference (CCWC) ◽

10.1109/ccwc51732.2021.9375994 ◽

2021 ◽

Author(s):

Muhammad Sakib Khan Inan ◽

Rubaiath E Ulfath ◽

Fahim Irfan Alam ◽

Fateha Khanam Bappee ◽

Rizwan Hasan

Keyword(s):

Feature Selection ◽

Gradient Boosting ◽

Extreme Gradient Boosting

Download Full-text

Techniques for Detecting Malware Traffic: A Comprehensive Approach to Feature Selection and Classification

International Journal for Research in Applied Science and Engineering Technology ◽

10.22214/ijraset.2021.39088 ◽

2021 ◽

Vol 9 (12) ◽

pp. 1-10

Author(s):

Harsha A K

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Random Forest ◽

Learning Algorithms ◽

Malware Detection ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Support Vector ◽

Steady Increase ◽

Extreme Gradient Boosting

Abstract: Since the advent of encryption, there has been a steady increase in malware being transmitted over encrypted networks. Traditional approaches to detect malware like packet content analysis are inefficient in dealing with encrypted data. In the absence of actual packet contents, we can make use of other features like packet size, arrival time, source and destination addresses and other such metadata to detect malware. Such information can be used to train machine learning classifiers in order to classify malicious and benign packets. In this paper, we offer an efficient malware detection approach using classification algorithms in machine learning such as support vector machine, random forest and extreme gradient boosting. We employ an extensive feature selection process to reduce the dimensionality of the chosen dataset. The dataset is then split into training and testing sets. Machine learning algorithms are trained using the training set. These models are then evaluated against the testing set in order to assess their respective performances. We further attempt to tune the hyper parameters of the algorithms, in order to achieve better results. Random forest and extreme gradient boosting algorithms performed exceptionally well in our experiments, resulting in area under the curve values of 0.9928 and 0.9998 respectively. Our work demonstrates that malware traffic can be effectively classified using conventional machine learning algorithms and also shows the importance of dimensionality reduction in such classification problems. Keywords: Malware Detection, Extreme Gradient Boosting, Random Forest, Feature Selection.

Download Full-text

Evaluating Variable Selection and Machine Learning Algorithms for Estimating Forest Heights by Combining Lidar and Hyperspectral Data

ISPRS International Journal of Geo-Information ◽

10.3390/ijgi9090507 ◽

2020 ◽

Vol 9 (9) ◽

pp. 507

Author(s):

Sanjiwana Arjasakusuma ◽

Sandiaga Swahyu Kusuma ◽

Stuart Phinn

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Learning Algorithms ◽

Principal Component ◽

Hyperspectral Data ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Support Vector ◽

Forest Height ◽

Extreme Gradient Boosting

Machine learning has been employed for various mapping and modeling tasks using input variables from different sources of remote sensing data. For feature selection involving high- spatial and spectral dimensionality data, various methods have been developed and incorporated into the machine learning framework to ensure an efficient and optimal computational process. This research aims to assess the accuracy of various feature selection and machine learning methods for estimating forest height using AISA (airborne imaging spectrometer for applications) hyperspectral bands (479 bands) and airborne light detection and ranging (lidar) height metrics (36 metrics), alone and combined. Feature selection and dimensionality reduction using Boruta (BO), principal component analysis (PCA), simulated annealing (SA), and genetic algorithm (GA) in combination with machine learning algorithms such as multivariate adaptive regression spline (MARS), extra trees (ET), support vector regression (SVR) with radial basis function, and extreme gradient boosting (XGB) with trees (XGbtree and XGBdart) and linear (XGBlin) classifiers were evaluated. The results demonstrated that the combinations of BO-XGBdart and BO-SVR delivered the best model performance for estimating tropical forest height by combining lidar and hyperspectral data, with R2 = 0.53 and RMSE = 1.7 m (18.4% of nRMSE and 0.046 m of bias) for BO-XGBdart and R2 = 0.51 and RMSE = 1.8 m (15.8% of nRMSE and −0.244 m of bias) for BO-SVR. Our study also demonstrated the effectiveness of BO for variables selection; it could reduce 95% of the data to select the 29 most important variables from the initial 516 variables from lidar metrics and hyperspectral data.

Download Full-text

Influence Analysis of Feature Selection to Network Intrusion Detection System Performance Using NSL-KDD Dataset

2019 International Conference on Computer Science, Information Technology, and Electrical Engineering (ICOMITEE) ◽

10.1109/icomitee.2019.8920961 ◽

2019 ◽

Cited By ~ 2

Author(s):

Lukman Hakim ◽

Rahilla Fatma ◽

Novriandi

Keyword(s):

Feature Selection ◽

Intrusion Detection ◽

System Performance ◽

Intrusion Detection System ◽

Detection System ◽

Network Intrusion Detection ◽

Influence Analysis ◽

Network Intrusion ◽

Network Intrusion Detection System

Download Full-text

A Feature Selection Model for Network Intrusion Detection System Based on PSO, GWO, FFA and GA Algorithms

Symmetry ◽

10.3390/sym12061046 ◽

2020 ◽

Vol 12 (6) ◽

pp. 1046 ◽

Cited By ~ 3

Author(s):

Omar Almomani

Keyword(s):

Genetic Algorithm ◽

Feature Selection ◽

Intrusion Detection ◽

Intrusion Detection System ◽

Detection System ◽

Selection Model ◽

Network Intrusion Detection ◽

Network Intrusion ◽

Proposed Model ◽

Positive Rate

The network intrusion detection system (NIDS) aims to identify virulent action in a network. It aims to do that through investigating the traffic network behavior. The approaches of data mining and machine learning (ML) are extensively used in the NIDS to discover anomalies. Regarding feature selection, it plays a significant role in improving the performance of NIDSs. That is because anomaly detection employs a great number of features that require much time. Therefore, the feature selection approach affects the time needed to investigate the traffic behavior and improve the accuracy level. The researcher of the present study aimed to propose a feature selection model for NIDSs. This model is based on the particle swarm optimization (PSO), grey wolf optimizer (GWO), firefly optimization (FFA) and genetic algorithm (GA). The proposed model aims at improving the performance of NIDSs. The proposed model deploys wrapper-based methods with the GA, PSO, GWO and FFA algorithms for selecting features using Anaconda Python Open Source, and deploys filtering-based methods for the mutual information (MI) of the GA, PSO, GWO and FFA algorithms that produced 13 sets of rules. The features derived from the proposed model are evaluated based on the support vector machine (SVM) and J48 ML classifiers and the UNSW-NB15 dataset. Based on the experiment, Rule 13 (R13) reduces the features into 30 features. Rule 12 (R12) reduces the features into 13 features. Rule 13 and Rule 12 offer the best results in terms of F-measure, accuracy and sensitivity. The genetic algorithm (GA) shows good results in terms of True Positive Rate (TPR) and False Negative Rate (FNR). As for Rules 11, 9 and 8, they show good results in terms of False Positive Rate (FPR), while PSO shows good results in terms of precision and True Negative Rate (TNR). It was found that the intrusion detection system with fewer features will increase accuracy. The proposed feature selection model for NIDS is rule-based pattern recognition to discover computer network attack which is in the scope of Symmetry journal.

Download Full-text

Prediction of hot spots in protein–DNA binding interfaces based on supervised isometric feature mapping and extreme gradient boosting

BMC Bioinformatics ◽

10.1186/s12859-020-03683-3 ◽

2020 ◽

Vol 21 (S13) ◽

Cited By ~ 2

Author(s):

Ke Li ◽

Sijia Zhang ◽

Di Yan ◽

Yannan Bin ◽

Junfeng Xia

Keyword(s):

Feature Selection ◽

Manifold Learning ◽

Hot Spots ◽

Large Scale ◽

Computational Method ◽

Gradient Boosting ◽

Feature Mapping ◽

Accessible Information ◽

Extreme Gradient Boosting ◽

Isometric Feature Mapping

Abstract Background Identification of hot spots in protein-DNA interfaces provides crucial information for the research on protein-DNA interaction and drug design. As experimental methods for determining hot spots are time-consuming, labor-intensive and expensive, there is a need for developing reliable computational method to predict hot spots on a large scale. Results Here, we proposed a new method named sxPDH based on supervised isometric feature mapping (S-ISOMAP) and extreme gradient boosting (XGBoost) to predict hot spots in protein-DNA complexes. We obtained 114 features from a combination of the protein sequence, structure, network and solvent accessible information, and systematically assessed various feature selection methods and feature dimensionality reduction methods based on manifold learning. The results show that the S-ISOMAP method is superior to other feature selection or manifold learning methods. XGBoost was then used to develop hot spots prediction model sxPDH based on the three dimensionality-reduced features obtained from S-ISOMAP. Conclusion Our method sxPDH boosts prediction performance using S-ISOMAP and XGBoost. The AUC of the model is 0.773, and the F1 score is 0.713. Experimental results on benchmark dataset indicate that sxPDH can achieve generally better performance in predicting hot spots compared to the state-of-the-art methods.

Download Full-text

Bootstrap-based homogeneous ensemble feature selection for network intrusion detection system

Developments of Artificial Intelligence Technologies in Computation and Robotics ◽

10.1142/9789811223334_0004 ◽

2020 ◽

Author(s):

Yeshalem Gezahegn Damtew ◽

Hongmei Chen ◽

Burhan Mohi Yu Din

Keyword(s):

Feature Selection ◽

Intrusion Detection ◽

Intrusion Detection System ◽

Detection System ◽

Network Intrusion Detection ◽

Network Intrusion ◽

Network Intrusion Detection System ◽

Selection For ◽

Homogeneous Ensemble

Download Full-text