scholarly journals A Novel Selective Ensemble Algorithm for Imbalanced Data Classification Based on Exploratory Undersampling

2014 ◽  
Vol 2014 ◽  
pp. 1-14 ◽  
Author(s):  
Qing-Yan Yin ◽  
Jiang-She Zhang ◽  
Chun-Xia Zhang ◽  
Nan-Nan Ji

Learning with imbalanced data is one of the emergent challenging tasks in machine learning. Recently, ensemble learning has arisen as an effective solution to class imbalance problems. The combination of bagging and boosting with data preprocessing resampling, namely, the simplest and accurate exploratory undersampling, has become the most popular method for imbalanced data classification. In this paper, we propose a novel selective ensemble construction method based on exploratory undersampling,RotEasy, with the advantage of improving storage requirement and computational efficiency by ensemble pruning technology. Our methodology aims to enhance the diversity between individual classifiers through feature extraction and diversity regularized ensemble pruning. We made a comprehensive comparison between our method and some state-of-the-art imbalanced learning methods. Experimental results on 20 real-world imbalanced data sets show thatRotEasypossesses a significant increase in performance, contrasted by a nonparametric statistical test and various evaluation criteria.

Some true applications, for example, content arrangement and sub-cell confinement of protein successions, include multi-mark grouping with imbalanced information. Different types of traditional approaches are introduced to describe the relation of hubristic and undertaking formations, classification of different attributes with imbalanced for different uncertain data sets. Here this addresses the issues by utilizing the min-max particular system. The min-max measured system can break down a multi-mark issue into a progression of little two-class sub-issues, which would then be able to be consolidated by two straightforward standards. Additionally present a few decay procedures to improve the presentation of min-max particular systems. Trial results on sub-cellular restriction demonstrate that our strategy has preferable speculation execution over customary SVMs in settling the multi-name and imbalanced information issues. In addition, it is additionally a lot quicker than customary SVMs


2018 ◽  
Vol 2018 ◽  
pp. 1-13 ◽  
Author(s):  
Jianhong Yan ◽  
Suqing Han

Learning with imbalanced data sets is considered as one of the key topics in machine learning community. Stacking ensemble is an efficient algorithm for normal balance data sets. However, stacking ensemble was seldom applied in imbalance data. In this paper, we proposed a novel RE-sample and Cost-Sensitive Stacked Generalization (RECSG) method based on 2-layer learning models. The first step is Level 0 model generalization including data preprocessing and base model training. The second step is Level 1 model generalization involving cost-sensitive classifier and logistic regression algorithm. In the learning phase, preprocessing techniques can be embedded in imbalance data learning methods. In the cost-sensitive algorithm, cost matrix is combined with both data characters and algorithms. In the RECSG method, ensemble algorithm is combined with imbalance data techniques. According to the experiment results obtained with 17 public imbalanced data sets, as indicated by various evaluation metrics (AUC, GeoMean, and AGeoMean), the proposed method showed the better classification performances than other ensemble and single algorithms. The proposed method is especially more efficient when the performance of base classifier is low. All these demonstrated that the proposed method could be applied in the class imbalance problem.


2013 ◽  
Vol 2013 ◽  
pp. 1-12 ◽  
Author(s):  
Qing-Yan Yin ◽  
Jiang-She Zhang ◽  
Chun-Xia Zhang ◽  
Sheng-Cai Liu

Cost-sensitive boosting algorithms have proven successful for solving the difficult class imbalance problems. However, the influence of misclassification costs and imbalance level on the algorithm performance is still not clear. The present paper aims to conduct an empirical comparison of six representative cost-sensitive boosting algorithms, including AdaCost, CSB1, CSB2, AdaC1, AdaC2, and AdaC3. These algorithms are thoroughly evaluated by a comprehensive suite of experiments, in which nearly fifty thousands classification models are trained on 17 real-world imbalanced data sets. Experimental results show that AdaC serial algorithms generally outperform AdaCost and CSB when dealing with different imbalance level data sets. Furthermore, the optimality of AdaC2 algorithm stands out around the misclassification costs setting:CN=0.7,CP=1, especially for dealing with strongly imbalanced data sets. In the case of data sets with a low-level imbalance, there is no significant difference between the AdaC serial algorithms. In addition, the results indicate that AdaC1 is comparatively insensitive to the misclassification costs, which is consistent with the finding of the preceding research work.


Author(s):  
Himani Tiwari

Abstract: Class Imbalance problem is one of the most challenging problems faced by the machine learning community. As we refer the imbalance to various instances in class of being relatively low as compare to other data. A number of over - sampling and under-sampling approaches have been applied in an attempt to balance the classes. This study provides an overview of the issue of class imbalance and attempts to examine various balancing methods for dealing with this problem. In order to illustrate the differences, an experiment is conducted using multiple simulated data sets for comparing the performance of these oversampling methods on different classifiers based on various evaluation criteria. In addition, the effect of different parameters, such as number of features and imbalance ratio, on the classifier performance is also evaluated. Keywords: Imbalanced learning, Over-sampling methods, Under-sampling methods, Classifier performances, Evaluationmetrices


2017 ◽  
Vol 42 (2) ◽  
pp. 149-176 ◽  
Author(s):  
Szymon Wojciechowski ◽  
Szymon Wilk

Abstract In this paper we describe results of an experimental study where we checked the impact of various difficulty factors in imbalanced data sets on the performance of selected classifiers applied alone or combined with several preprocessing methods. In the study we used artificial data sets in order to systematically check factors such as dimensionality, class imbalance ratio or distribution of specific types of examples (safe, borderline, rare and outliers) in the minority class. The results revealed that the latter factor was the most critical one and it exacerbated other factors (in particular class imbalance). The best classification performance was demonstrated by non-symbolic classifiers, particular by k-NN classifiers (with 1 or 3 neighbors - 1NN and 3NN, respectively) and by SVM. Moreover, they benefited from different preprocessing methods - SVM and 1NN worked best with undersampling, while oversampling was more beneficial for 3NN.


Author(s):  
Bo Huang ◽  
Yimin Zhu ◽  
Zhongzhen Wang ◽  
Zhijun Fang

The class-imbalance learning is one of the most significant research topics in the data mining and machine learning. Imbalance problem means that one of the classes has much more samples than that of other classes. To deal with the issues of low classification accuracy and high time complexity, this paper proposes an novel imbalance data classification algorithm based on clustering and SVM. The algorithm suggests under-sampling in majority samples based on the distribution characteristics of minority samples. First, specific clusters are detected by cluster analysis on the minority. Second, a cluster boundary strategy is proposed to eliminate the bad influence of noise samples. To structure a balanced dataset for imbalance data, this paper proposes three principles of under-sampling on majority samples according to the characteristic of samples in the cluster. Finally, the optimal classification model from the linear combination of hybrid-kernel SVM is obtained. The experiments based on datasets in UCI and KEEL database show that our algorithm effectively decreases the interference of noise samples. Compared with the SMOTE and Fast-CBUS, the proposed algorithm not only reduces the feature dimension, but also improves the precision of the minor classes under the different labeled sample rates generally.


2019 ◽  
Vol 2019 ◽  
pp. 1-13 ◽  
Author(s):  
Wenhao Xie ◽  
Gongqian Liang ◽  
Zhonghui Dong ◽  
Baoyu Tan ◽  
Baosheng Zhang

The imbalance data refers to at least one of its classes which is usually outnumbered by the other classes. The imbalanced data sets exist widely in the real world, and the classification for them has become one of the hottest issues in the field of data mining. At present, the classification solutions for imbalanced data sets are mainly based on the algorithm-level and the data-level. On the data-level, both oversampling strategies and undersampling strategies are used to realize the data balance via data reconstruction. SMOTE and Random-SMOTE are two classic oversampling algorithms, but they still possess the drawbacks such as blind interpolation and fuzzy class boundaries. In this paper, an improved oversampling algorithm based on the samples’ selection strategy for the imbalanced data classification is proposed. On the basis of the Random-SMOTE algorithm, the support vectors (SV) are extracted and are treated as the parent samples to synthesize the new examples for the minority class in order to realize the balance of the data. Lastly, the imbalanced data sets are classified with the SVM classification algorithm. F-measure value, G-mean value, ROC curve, and AUC value are selected as the performance evaluation indexes. Experimental results show that this improved algorithm demonstrates a good classification performance for the imbalanced data sets.


2014 ◽  
Vol 989-994 ◽  
pp. 1756-1761 ◽  
Author(s):  
Wei Duan ◽  
Liang Jing ◽  
Xiang Yang Lu

As a supervised classification algorithm, Support Vector Machine (SVM) has an excellent ability in solving small samples, nonlinear and high dimensional classification problems. However, SVM is inefficient for imbalanced data sets classification. Therefore, a cost sensitive SVM (CSSVM) should be designed for imbalanced data sets classification. This paper proposes a method which constructed CSSVM based on information entropy, and in this method the information entropies of different classes of data set are used to determine the values of penalty factor of CSSVM.


Sign in / Sign up

Export Citation Format

Share Document