scholarly journals Oversampling Imbalanced Data Based on Convergent WGAN for Network Threat Detection

2021 ◽  
Vol 2021 ◽  
pp. 1-14
Author(s):  
Yanping Xu ◽  
Xiaoyu Zhang ◽  
Zhenliang Qiu ◽  
Xia Zhang ◽  
Jian Qiu ◽  
...  

Class imbalance is a common problem in network threat detection. Oversampling the minority class is regarded as a popular countermeasure by generating enough new minority samples. Generative adversarial network (GAN) is a typical generative model that can generate any number of artificial minority samples, which are close to the real data. However, it is difficult to train GAN, and the Nash equilibrium is almost impossible to achieve. Therefore, in order to improve the training stability of GAN for oversampling to detect the network threat, a convergent WGAN-based oversampling model called convergent WGAN (CWGAN) is proposed in this paper. The training process of CWGAN contains multiple iterations. In each iteration, the training epochs of the discriminator are dynamic, which is determined by the convergence of discriminator loss function in the last two iterations. When the discriminator is trained to convergence, the generator will then be trained to generate new minority samples. The experiment results show that CWGAN not only improve the training stability of WGAN on the loss smoother and closer to 0 but also improve the performance of the minority class through oversampling, which means that CWGAN can improve the performance of network threat detection.

Author(s):  
Alessio Bernardo ◽  
Emanuele Della Valle

AbstractThe world is constantly changing, and so are the massive amount of data produced. However, only a few studies deal with online class imbalance learning that combines the challenges of class-imbalanced data streams and concept drift. In this paper, we propose the very fast continuous synthetic minority oversampling technique (VFC-SMOTE). It is a novel meta-strategy to be prepended to any streaming machine learning classification algorithm aiming at oversampling the minority class using a new version of Smote and Borderline-Smote inspired by Data Sketching. We benchmarked VFC-SMOTE pipelines on synthetic and real data streams containing different concept drifts, imbalance levels, and class distributions. We bring statistical evidence that VFC-SMOTE pipelines learn models whose minority class performances are better than state-of-the-art. Moreover, we analyze the time/memory consumption and the concept drift recovery speed.


Author(s):  
Dariusz Brzezinski ◽  
Leandro L. Minku ◽  
Tomasz Pewinski ◽  
Jerzy Stefanowski ◽  
Artur Szumaczuk

AbstractClass imbalance introduces additional challenges when learning classifiers from concept drifting data streams. Most existing work focuses on designing new algorithms for dealing with the global imbalance ratio and does not consider other data complexities. Independent research on static imbalanced data has highlighted the influential role of local data difficulty factors such as minority class decomposition and presence of unsafe types of examples. Despite often being present in real-world data, the interactions between concept drifts and local data difficulty factors have not been investigated in concept drifting data streams yet. We thoroughly study the impact of such interactions on drifting imbalanced streams. For this purpose, we put forward a new categorization of concept drifts for class imbalanced problems. Through comprehensive experiments with synthetic and real data streams, we study the influence of concept drifts, global class imbalance, local data difficulty factors, and their combinations, on predictions of representative online classifiers. Experimental results reveal the high influence of new considered factors and their local drifts, as well as differences in existing classifiers’ reactions to such factors. Combinations of multiple factors are the most challenging for classifiers. Although existing classifiers are partially capable of coping with global class imbalance, new approaches are needed to address challenges posed by imbalanced data streams.


2021 ◽  
Vol 11 (5) ◽  
pp. 2166
Author(s):  
Van Bui ◽  
Tung Lam Pham ◽  
Huy Nguyen ◽  
Yeong Min Jang

In the last decade, predictive maintenance has attracted a lot of attention in industrial factories because of its wide use of the Internet of Things and artificial intelligence algorithms for data management. However, in the early phases where the abnormal and faulty machines rarely appeared in factories, there were limited sets of machine fault samples. With limited fault samples, it is difficult to perform a training process for fault classification due to the imbalance of input data. Therefore, data augmentation was required to increase the accuracy of the learning model. However, there were limited methods to generate and evaluate the data applied for data analysis. In this paper, we introduce a method of using the generative adversarial network as the fault signal augmentation method to enrich the dataset. The enhanced data set could increase the accuracy of the machine fault detection model in the training process. We also performed fault detection using a variety of preprocessing approaches and classified the models to evaluate the similarities between the generated data and authentic data. The generated fault data has high similarity with the original data and it significantly improves the accuracy of the model. The accuracy of fault machine detection reaches 99.41% with 20% original fault machine data set and 93.1% with 0% original fault machine data set (only use generate data only). Based on this, we concluded that the generated data could be used to mix with original data and improve the model performance.


Mathematics ◽  
2021 ◽  
Vol 9 (9) ◽  
pp. 936
Author(s):  
Jianli Shao ◽  
Xin Liu ◽  
Wenqing He

Imbalanced data exist in many classification problems. The classification of imbalanced data has remarkable challenges in machine learning. The support vector machine (SVM) and its variants are popularly used in machine learning among different classifiers thanks to their flexibility and interpretability. However, the performance of SVMs is impacted when the data are imbalanced, which is a typical data structure in the multi-category classification problem. In this paper, we employ the data-adaptive SVM with scaled kernel functions to classify instances for a multi-class population. We propose a multi-class data-dependent kernel function for the SVM by considering class imbalance and the spatial association among instances so that the classification accuracy is enhanced. Simulation studies demonstrate the superb performance of the proposed method, and a real multi-class prostate cancer image dataset is employed as an illustration. Not only does the proposed method outperform the competitor methods in terms of the commonly used accuracy measures such as the F-score and G-means, but also successfully detects more than 60% of instances from the rare class in the real data, while the competitors can only detect less than 20% of the rare class instances. The proposed method will benefit other scientific research fields, such as multiple region boundary detection.


2021 ◽  
Vol 25 (5) ◽  
pp. 1169-1185
Author(s):  
Deniu He ◽  
Hong Yu ◽  
Guoyin Wang ◽  
Jie Li

The problem of initialization of active learning is considered in this paper. Especially, this paper studies the problem in an imbalanced data scenario, which is called as class-imbalance active learning cold-start. The novel method is two-stage clustering-based active learning cold-start (ALCS). In the first stage, to separate the instances of minority class from that of majority class, a multi-center clustering is constructed based on a new inter-cluster tightness measure, thus the data is grouped into multiple clusters. Then, in the second stage, the initial training instances are selected from each cluster based on an adaptive candidate representative instances determination mechanism and a clusters-cyclic instance query mechanism. The comprehensive experiments demonstrate the effectiveness of the proposed method from the aspects of class coverage, classification performance, and impact on active learning.


2021 ◽  
Vol 2021 ◽  
pp. 1-14
Author(s):  
Zhangguo Tang ◽  
Junfeng Wang ◽  
Huanzhou Li ◽  
Jian Zhang ◽  
Junhao Wang

In the intelligent era of human-computer symbiosis, the use of machine learning method for covert communication confrontation has become a hot topic of network security. The existing covert communication technology focuses on the statistical abnormality of traffic behavior and does not consider the sensory abnormality of security censors, so it faces the core problem of lack of cognitive ability. In order to further improve the concealment of communication, a game method of “cognitive deception” is proposed, which is aimed at eliminating the anomaly of traffic in both behavioral and cognitive dimensions. Accordingly, a Wasserstein Generative Adversarial Network of Covert Channel (WCCGAN) model is established. The model uses the constraint sampling of cognitive priors to construct the constraint mechanism of “functional equivalence” and “cognitive equivalence” and is trained by a dynamic strategy updating learning algorithm. Among them, the generative module adopts joint expression learning which integrates network protocol knowledge to improve the expressiveness and discriminability of traffic cognitive features. The equivalent module guides the discriminant module to learn the pragmatic relevance features through the activity loss function of traffic and the application loss function of protocol for end-to-end training. The experimental results show that WCCGAN can directly synthesize traffic with comprehensive concealment ability, and its behavior concealment and cognitive deception are as high as 86.2% and 96.7%, respectively. Moreover, the model has good convergence and generalization ability and does not depend on specific assumptions and specific covert algorithms, which realizes a new paradigm of cognitive game in covert communication.


2020 ◽  
Author(s):  
Fajr Alarsan ◽  
Mamoon Younes

Abstract Generative Adversarial Networks (GANs) are most popular generative frameworks that have achieved compelling performance. They follow an adversarial approach where two deep models generator and discriminator compete with each other In this paper, we propose a Generative Adversarial Network with best hyper-parameters selection to generate fake images for digits number 1 to 9 with generator and train discriminator to decide whereas the generated images are fake or true. Using Genetic Algorithm technique to adapt GAN hyper-parameters, the final method is named GANGA:Generative Adversarial Network with Genetic Algorithm. Anaconda environment with tensorflow library facilitates was used, python as programming language also used with needed libraries. The implementation was done using MNIST dataset to validate our work. The proposed method is to let Genetic algorithm to choose best values of hyper-parameters depending on minimizing a cost function such as a loss function or maximizing accuracy function. GA was used to select values of Learning rate, Batch normalization, Number of neurons and a parameter of Dropout layer.


Author(s):  
Cara Murphy ◽  
John Kerekes

The classification of trace chemical residues through active spectroscopic sensing is challenging due to the lack of physics-based models that can accurately predict spectra. To overcome this challenge, we leveraged the field of domain adaptation to translate data from the simulated to the measured domain for training a classifier. We developed the first 1D conditional generative adversarial network (GAN) to perform spectrum-to-spectrum translation of reflectance signatures. We applied the 1D conditional GAN to a library of simulated spectra and quantified the improvement in classification accuracy on real data using the translated spectra for training the classifier. Using the GAN-translated library, the average classification accuracy increased from 0.622 to 0.723 on real chemical reflectance data, including data from chemicals not included in the GAN training set.


Author(s):  
Huifang Li ◽  
◽  
Rui Fan ◽  
Qisong Shi ◽  
Zijian Du

Recent advancements in machine learning and communication technologies have enabled new approaches to automated fault diagnosis and detection in industrial systems. Given wide variation in occurrence frequencies of different classes of faults, the class distribution of real-world industrial fault data is usually imbalanced. However, most prior machine learning-based classification methods do not take this imbalance into consideration, and thus tend to be biased toward recognizing the majority classes and result in poor accuracy for minority ones. To solve such problems, we propose a k-means clustering generative adversarial network (KM-GAN)-based fault diagnosis approach able to reduce imbalance in fault data and improve diagnostic accuracy for minority classes. First, we design a new k-means clustering algorithm and GAN-based oversampling method to generate diverse minority-class samples obeying the similar distribution to the original minority data. The k-means clustering algorithm is adopted to divide minority-class samples into k clusters, while a GAN is applied to learn the data distribution of the resulting clusters and generate a given number of minority-class samples as a supplement to the original dataset. Then, we construct a deep neural network (DNN) and deep belief network (DBN)-based heterogeneous ensemble model as a fault classifier to improve generalization, in which DNN and DBN models are trained separately on the resulting dataset, and then the outputs from both are averaged as the final diagnostic result. A series of comparative experiments are conducted to verify the effectiveness of our proposed method, and the experimental results show that our method can improve diagnostic accuracy for minority-class samples.


Sign in / Sign up

Export Citation Format

Share Document