Imbalanced Data Sets Classification Based on SVM for Sand-Dust Storm Warning

Discrete Dynamics in Nature and Society ◽

10.1155/2015/562724 ◽

2015 ◽

Vol 2015 ◽

pp. 1-8

Author(s):

Yonghua Xie ◽

Yurong Liu ◽

Qingqiu Fu

Keyword(s):

Dust Storm ◽

Adaptive Sampling ◽

Imbalanced Data ◽

Real Data ◽

Classification Performance ◽

Selection Strategy ◽

Data Sets ◽

Minority Class ◽

Redundant Data ◽

Sand Dust

In view of the SVM classification for the imbalanced sand-dust storm data sets, this paper proposes a hybrid self-adaptive sampling method named SRU-AIBSMOTE algorithm. This method can adaptively adjust neighboring selection strategy based on the internal distribution of sample sets. It produces virtual minority class instances through randomized interpolation in the spherical space which consists of minority class instances and their neighbors. The random undersampling is also applied to undersample the majority class instances for removal of redundant data in the sample sets. The comparative experimental results on the real data sets from Yanchi and Tongxin districts in Ningxia of China show that the SRU-AIBSMOTE method can obtain better classification performance than some traditional classification methods.

Download Full-text

An Improved Oversampling Algorithm Based on the Samples’ Selection Strategy for Classifying Imbalanced Data

Mathematical Problems in Engineering ◽

10.1155/2019/3526539 ◽

2019 ◽

Vol 2019 ◽

pp. 1-13 ◽

Cited By ~ 3

Author(s):

Wenhao Xie ◽

Gongqian Liang ◽

Zhonghui Dong ◽

Baoyu Tan ◽

Baosheng Zhang

Keyword(s):

Imbalanced Data ◽

Classification Performance ◽

Mean Value ◽

Selection Strategy ◽

Data Sets ◽

Minority Class ◽

Imbalanced Data Sets ◽

Imbalanced Data Classification ◽

Samples Selection ◽

Data Level

The imbalance data refers to at least one of its classes which is usually outnumbered by the other classes. The imbalanced data sets exist widely in the real world, and the classification for them has become one of the hottest issues in the field of data mining. At present, the classification solutions for imbalanced data sets are mainly based on the algorithm-level and the data-level. On the data-level, both oversampling strategies and undersampling strategies are used to realize the data balance via data reconstruction. SMOTE and Random-SMOTE are two classic oversampling algorithms, but they still possess the drawbacks such as blind interpolation and fuzzy class boundaries. In this paper, an improved oversampling algorithm based on the samples’ selection strategy for the imbalanced data classification is proposed. On the basis of the Random-SMOTE algorithm, the support vectors (SV) are extracted and are treated as the parent samples to synthesize the new examples for the minority class in order to realize the balance of the data. Lastly, the imbalanced data sets are classified with the SVM classification algorithm. F-measure value, G-mean value, ROC curve, and AUC value are selected as the performance evaluation indexes. Experimental results show that this improved algorithm demonstrates a good classification performance for the imbalanced data sets.

Download Full-text

Difficulty Factors and Preprocessing in Imbalanced Data Sets: An Experimental Study on Artificial Data

Foundations of Computing and Decision Sciences ◽

10.1515/fcds-2017-0007 ◽

2017 ◽

Vol 42 (2) ◽

pp. 149-176 ◽

Cited By ~ 7

Author(s):

Szymon Wojciechowski ◽

Szymon Wilk

Keyword(s):

Experimental Study ◽

Class Imbalance ◽

Imbalanced Data ◽

Classification Performance ◽

Data Sets ◽

Artificial Data ◽

Minority Class ◽

Imbalanced Data Sets ◽

The Impact

Abstract In this paper we describe results of an experimental study where we checked the impact of various difficulty factors in imbalanced data sets on the performance of selected classifiers applied alone or combined with several preprocessing methods. In the study we used artificial data sets in order to systematically check factors such as dimensionality, class imbalance ratio or distribution of specific types of examples (safe, borderline, rare and outliers) in the minority class. The results revealed that the latter factor was the most critical one and it exacerbated other factors (in particular class imbalance). The best classification performance was demonstrated by non-symbolic classifiers, particular by k-NN classifiers (with 1 or 3 neighbors - 1NN and 3NN, respectively) and by SVM. Moreover, they benefited from different preprocessing methods - SVM and 1NN worked best with undersampling, while oversampling was more beneficial for 3NN.

Download Full-text

A two-stage clustering-based cold-start method for active learning

Intelligent Data Analysis ◽

10.3233/ida-205393 ◽

2021 ◽

Vol 25 (5) ◽

pp. 1169-1185

Author(s):

Deniu He ◽

Hong Yu ◽

Guoyin Wang ◽

Jie Li

Keyword(s):

Active Learning ◽

Class Imbalance ◽

Imbalanced Data ◽

Cold Start ◽

Classification Performance ◽

The Novel ◽

Two Stage ◽

Minority Class ◽

Novel Method ◽

Multiple Clusters

The problem of initialization of active learning is considered in this paper. Especially, this paper studies the problem in an imbalanced data scenario, which is called as class-imbalance active learning cold-start. The novel method is two-stage clustering-based active learning cold-start (ALCS). In the first stage, to separate the instances of minority class from that of majority class, a multi-center clustering is constructed based on a new inter-cluster tightness measure, thus the data is grouped into multiple clusters. Then, in the second stage, the initial training instances are selected from each cluster based on an adaptive candidate representative instances determination mechanism and a clusters-cyclic instance query mechanism. The comprehensive experiments demonstrate the effectiveness of the proposed method from the aspects of class coverage, classification performance, and impact on active learning.

Download Full-text

Improving the classification performance on imbalanced data sets via new hybrid parameterisation model

Journal of King Saud University - Computer and Information Sciences ◽

10.1016/j.jksuci.2019.04.009 ◽

2019 ◽

Author(s):

Masurah Mohamad ◽

Ali Selamat ◽

Imam Much Subroto ◽

Ondrej Krejcar

Keyword(s):

Imbalanced Data ◽

Classification Performance ◽

Data Sets ◽

Imbalanced Data Sets

Download Full-text

Granular Classification for Imbalanced Datasets: A Minkowski Distance-Based Method

Algorithms ◽

10.3390/a14020054 ◽

2021 ◽

Vol 14 (2) ◽

pp. 54

Author(s):

Chen Fu ◽

Jianhua Yang

Keyword(s):

Imbalanced Data ◽

Main Idea ◽

Fuzzy Rule ◽

Classification Performance ◽

Distance Measures ◽

Minkowski Distance ◽

Imbalanced Datasets ◽

Minority Class ◽

Information Granules ◽

Practical Applications

The problem of classification for imbalanced datasets is frequently encountered in practical applications. The data to be classified in this problem are skewed, i.e., the samples of one class (the minority class) are much less than those of other classes (the majority class). When dealing with imbalanced datasets, most classifiers encounter a common limitation, that is, they often obtain better classification performances on the majority classes than those on the minority class. To alleviate the limitation, in this study, a fuzzy rule-based modeling approach using information granules is proposed. Information granules, as some entities derived and abstracted from data, can be used to describe and capture the characteristics (distribution and structure) of data from both majority and minority classes. Since the geometric characteristics of information granules depend on the distance measures used in the granulation process, the main idea of this study is to construct information granules on each class of imbalanced data using Minkowski distance measures and then to establish the classification models by using “If-Then” rules. The experimental results involving synthetic and publicly available datasets reflect that the proposed Minkowski distance-based method can produce information granules with a series of geometric shapes and construct granular models with satisfying classification performance for imbalanced datasets.

Download Full-text

SYNTHETIC OVERSAMPLING OF INSTANCES USING CLUSTERING

International Journal of Artificial Intelligence Tools ◽

10.1142/s0218213013500085 ◽

2013 ◽

Vol 22 (02) ◽

pp. 1350008 ◽

Cited By ~ 2

Author(s):

ATLÁNTIDA I. SÁNCHEZ ◽

EDUARDO F. MORALES ◽

JESUS A. GONZALEZ

Keyword(s):

Imbalanced Data ◽

Data Sets ◽

Minority Class ◽

Imbalanced Data Sets ◽

Tuning Parameters ◽

New Methods ◽

Real World Applications ◽

Noisy Examples ◽

F Measure ◽

Better Than

Imbalanced data sets in the class distribution is common to many real world applications. As many classifiers tend to degrade their performance over the minority class, several approaches have been proposed to deal with this problem. In this paper, we propose two new cluster-based oversampling methods, SOI-C and SOI-CJ. The proposed methods create clusters from the minority class instances and generate synthetic instances inside those clusters. In contrast with other oversampling methods, the proposed approaches avoid creating new instances in majority class regions. They are more robust to noisy examples (the number of new instances generated per cluster is proportional to the cluster's size). The clusters are automatically generated. Our new methods do not need tuning parameters, and they can deal both with numerical and nominal attributes. The two methods were tested with twenty artificial datasets and twenty three datasets from the UCI Machine Learning repository. For our experiments, we used six classifiers and results were evaluated with recall, precision, F-measure, and AUC measures, which are more suitable for class imbalanced datasets. We performed ANOVA and paired t-tests to show that the proposed methods are competitive and in many cases significantly better than the rest of the oversampling methods used during the comparison.

Download Full-text

Oversampling for Imbalanced Data via Optimal Transport

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33015605 ◽

2019 ◽

Vol 33 ◽

pp. 5605-5612 ◽

Cited By ~ 1

Author(s):

Yuguang Yan ◽

Mingkui Tan ◽

Yanwu Xu ◽

Jiezhang Cao ◽

Michael Ng ◽

...

Keyword(s):

Real World ◽

Optimal Transport ◽

Imbalanced Data ◽

Data Sets ◽

Similar Distribution ◽

Real World Data ◽

Geometric Information ◽

Minority Class ◽

Real World Applications ◽

Multiple Metrics

The issue of data imbalance occurs in many real-world applications especially in medical diagnosis, where normal cases are usually much more than the abnormal cases. To alleviate this issue, one of the most important approaches is the oversampling method, which seeks to synthesize minority class samples to balance the numbers of different classes. However, existing methods barely consider global geometric information involved in the distribution of minority class samples, and thus may incur distribution mismatching between real and synthetic samples. In this paper, relying on optimal transport (Villani 2008), we propose an oversampling method by exploiting global geometric information of data to make synthetic samples follow a similar distribution to that of minority class samples. Moreover, we introduce a novel regularization based on synthetic samples and shift the distribution of minority class samples according to loss information. Experiments on toy and real-world data sets demonstrate the efficacy of our proposed method in terms of multiple metrics.

Download Full-text

Boosting Method for Local Learning in Statistical Pattern Recognition

Neural Computation ◽

10.1162/neco.2008.06-07-549 ◽

2008 ◽

Vol 20 (11) ◽

pp. 2792-2838 ◽

Cited By ~ 6

Author(s):

Masanori Kawakita ◽

Shinto Eguchi

Keyword(s):

Estimation Error ◽

Approximation Error ◽

Real Data ◽

Classification Performance ◽

Classification Rule ◽

Likelihood Method ◽

Data Sets ◽

Bayes Risk ◽

Classification Problems ◽

Boosting Method

We propose a local boosting method in classification problems borrowing from an idea of the local likelihood method. Our proposal, local boosting, includes a simple device for localization for computational feasibility. We proved the Bayes risk consistency of the local boosting in the framework of Probably approximately correct learning. Inspection of the proof provides a useful viewpoint for comparing ordinary boosting and local boosting with respect to the estimation error and the approximation error. Both boosting methods have the Bayes risk consistency if their approximation errors decrease to zero. Compared to ordinary boosting, local boosting may perform better by controlling the trade-off between the estimation error and the approximation error. Ordinary boosting with complicated base classifiers or other strong classification methods, including kernel machines, may have classification performance comparable to local boosting with simple base classifiers, for example, decision stumps. Local boosting, however, has an advantage with respect to interpretability. Local boosting with simple base classifiers offers a simple way to specify which features are informative and how their values contribute to a classification rule even though locally. Several numerical studies on real data sets confirm these advantages of local boosting.

Download Full-text

VFC-SMOTE: very fast continuous synthetic minority oversampling for evolving data streams

Data Mining and Knowledge Discovery ◽

10.1007/s10618-021-00786-0 ◽

2021 ◽

Author(s):

Alessio Bernardo ◽

Emanuele Della Valle

Keyword(s):

Data Streams ◽

Concept Drift ◽

Class Imbalance ◽

Imbalanced Data ◽

Real Data ◽

Minority Class ◽

Machine Learning Classification ◽

Imbalance Learning ◽

Class Imbalance Learning ◽

Better Than

AbstractThe world is constantly changing, and so are the massive amount of data produced. However, only a few studies deal with online class imbalance learning that combines the challenges of class-imbalanced data streams and concept drift. In this paper, we propose the very fast continuous synthetic minority oversampling technique (VFC-SMOTE). It is a novel meta-strategy to be prepended to any streaming machine learning classification algorithm aiming at oversampling the minority class using a new version of Smote and Borderline-Smote inspired by Data Sketching. We benchmarked VFC-SMOTE pipelines on synthetic and real data streams containing different concept drifts, imbalance levels, and class distributions. We bring statistical evidence that VFC-SMOTE pipelines learn models whose minority class performances are better than state-of-the-art. Moreover, we analyze the time/memory consumption and the concept drift recovery speed.

Download Full-text

Oversampling Imbalanced Data Based on Convergent WGAN for Network Threat Detection

Security and Communication Networks ◽

10.1155/2021/9206440 ◽

2021 ◽

Vol 2021 ◽

pp. 1-14

Author(s):

Yanping Xu ◽

Xiaoyu Zhang ◽

Zhenliang Qiu ◽

Xia Zhang ◽

Jian Qiu ◽

...

Keyword(s):

Nash Equilibrium ◽

Loss Function ◽

Class Imbalance ◽

Imbalanced Data ◽

Real Data ◽

Threat Detection ◽

Training Process ◽

Generative Adversarial Network ◽

Minority Class ◽

Adversarial Network

Class imbalance is a common problem in network threat detection. Oversampling the minority class is regarded as a popular countermeasure by generating enough new minority samples. Generative adversarial network (GAN) is a typical generative model that can generate any number of artificial minority samples, which are close to the real data. However, it is difficult to train GAN, and the Nash equilibrium is almost impossible to achieve. Therefore, in order to improve the training stability of GAN for oversampling to detect the network threat, a convergent WGAN-based oversampling model called convergent WGAN (CWGAN) is proposed in this paper. The training process of CWGAN contains multiple iterations. In each iteration, the training epochs of the discriminator are dynamic, which is determined by the convergence of discriminator loss function in the last two iterations. When the discriminator is trained to convergence, the generator will then be trained to generate new minority samples. The experiment results show that CWGAN not only improve the training stability of WGAN on the loss smoother and closer to 0 but also improve the performance of the minority class through oversampling, which means that CWGAN can improve the performance of network threat detection.

Download Full-text