A Novel Selective Ensemble Algorithm for Imbalanced Data Classification Based on Exploratory Undersampling

Mathematical Problems in Engineering ◽

10.1155/2014/358942 ◽

2014 ◽

Vol 2014 ◽

pp. 1-14 ◽

Cited By ~ 5

Author(s):

Qing-Yan Yin ◽

Jiang-She Zhang ◽

Chun-Xia Zhang ◽

Nan-Nan Ji

Keyword(s):

Evaluation Criteria ◽

Class Imbalance ◽

Imbalanced Data ◽

Data Classification ◽

Data Sets ◽

Imbalanced Data Sets ◽

Ensemble Pruning ◽

Imbalanced Data Classification ◽

Selective Ensemble ◽

Nonparametric Statistical

Learning with imbalanced data is one of the emergent challenging tasks in machine learning. Recently, ensemble learning has arisen as an effective solution to class imbalance problems. The combination of bagging and boosting with data preprocessing resampling, namely, the simplest and accurate exploratory undersampling, has become the most popular method for imbalanced data classification. In this paper, we propose a novel selective ensemble construction method based on exploratory undersampling,RotEasy, with the advantage of improving storage requirement and computational efficiency by ensemble pruning technology. Our methodology aims to enhance the diversity between individual classifiers through feature extraction and diversity regularized ensemble pruning. We made a comprehensive comparison between our method and some state-of-the-art imbalanced learning methods. Experimental results on 20 real-world imbalanced data sets show thatRotEasypossesses a significant increase in performance, contrasted by a nonparametric statistical test and various evaluation criteria.

Download Full-text

Leveraging Ensemble Pruning for Imbalanced Data Classification

2018 IEEE International Conference on Systems, Man, and Cybernetics (SMC) ◽

10.1109/smc.2018.00084 ◽

2018 ◽

Author(s):

Bartosz Krawczyk ◽

Michal Wozniak

Keyword(s):

Imbalanced Data ◽

Data Classification ◽

Ensemble Pruning ◽

Imbalanced Data Classification

Download Full-text

Multi Labeled Imbalanced Data Classification Based on Advanced Min-Max Machine Learning

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.l3718.119119 ◽

2019 ◽

Vol 9 (1) ◽

pp. 1776-1778

Keyword(s):

Machine Learning ◽

Uncertain Data ◽

Imbalanced Data ◽

Data Classification ◽

Data Sets ◽

Imbalanced Data Classification ◽

Different Types ◽

Measured System ◽

Traditional Approaches

Some true applications, for example, content arrangement and sub-cell confinement of protein successions, include multi-mark grouping with imbalanced information. Different types of traditional approaches are introduced to describe the relation of hubristic and undertaking formations, classification of different attributes with imbalanced for different uncertain data sets. Here this addresses the issues by utilizing the min-max particular system. The min-max measured system can break down a multi-mark issue into a progression of little two-class sub-issues, which would then be able to be consolidated by two straightforward standards. Additionally present a few decay procedures to improve the presentation of min-max particular systems. Trial results on sub-cellular restriction demonstrate that our strategy has preferable speculation execution over customary SVMs in settling the multi-name and imbalanced information issues. In addition, it is additionally a lot quicker than customary SVMs

Download Full-text

Classifying Imbalanced Data Sets by a Novel RE-Sample and Cost-Sensitive Stacked Generalization Method

Mathematical Problems in Engineering ◽

10.1155/2018/5036710 ◽

2018 ◽

Vol 2018 ◽

pp. 1-13 ◽

Cited By ~ 8

Author(s):

Jianhong Yan ◽

Suqing Han

Keyword(s):

Learning Community ◽

Class Imbalance ◽

Imbalanced Data ◽

Data Sets ◽

Class Imbalance Problem ◽

Imbalanced Data Sets ◽

Imbalance Data ◽

Imbalance Problem ◽

Stacked Generalization ◽

Model Generalization

Learning with imbalanced data sets is considered as one of the key topics in machine learning community. Stacking ensemble is an efficient algorithm for normal balance data sets. However, stacking ensemble was seldom applied in imbalance data. In this paper, we proposed a novel RE-sample and Cost-Sensitive Stacked Generalization (RECSG) method based on 2-layer learning models. The first step is Level 0 model generalization including data preprocessing and base model training. The second step is Level 1 model generalization involving cost-sensitive classifier and logistic regression algorithm. In the learning phase, preprocessing techniques can be embedded in imbalance data learning methods. In the cost-sensitive algorithm, cost matrix is combined with both data characters and algorithms. In the RECSG method, ensemble algorithm is combined with imbalance data techniques. According to the experiment results obtained with 17 public imbalanced data sets, as indicated by various evaluation metrics (AUC, GeoMean, and AGeoMean), the proposed method showed the better classification performances than other ensemble and single algorithms. The proposed method is especially more efficient when the performance of base classifier is low. All these demonstrated that the proposed method could be applied in the class imbalance problem.

Download Full-text

An Empirical Study on the Performance of Cost-Sensitive Boosting Algorithms with Different Levels of Class Imbalance

Mathematical Problems in Engineering ◽

10.1155/2013/761814 ◽

2013 ◽

Vol 2013 ◽

pp. 1-12 ◽

Cited By ~ 6

Author(s):

Qing-Yan Yin ◽

Jiang-She Zhang ◽

Chun-Xia Zhang ◽

Sheng-Cai Liu

Keyword(s):

Research Work ◽

Class Imbalance ◽

Imbalanced Data ◽

Data Sets ◽

Imbalanced Data Sets ◽

Misclassification Costs ◽

Level Data ◽

Significant Difference ◽

Boosting Algorithms ◽

Serial Algorithms

Cost-sensitive boosting algorithms have proven successful for solving the difficult class imbalance problems. However, the influence of misclassification costs and imbalance level on the algorithm performance is still not clear. The present paper aims to conduct an empirical comparison of six representative cost-sensitive boosting algorithms, including AdaCost, CSB1, CSB2, AdaC1, AdaC2, and AdaC3. These algorithms are thoroughly evaluated by a comprehensive suite of experiments, in which nearly fifty thousands classification models are trained on 17 real-world imbalanced data sets. Experimental results show that AdaC serial algorithms generally outperform AdaCost and CSB when dealing with different imbalance level data sets. Furthermore, the optimality of AdaC2 algorithm stands out around the misclassification costs setting:CN=0.7,CP=1, especially for dealing with strongly imbalanced data sets. In the case of data sets with a low-level imbalance, there is no significant difference between the AdaC serial algorithms. In addition, the results indicate that AdaC1 is comparatively insensitive to the misclassification costs, which is consistent with the finding of the preceding research work.

Download Full-text

Clustering-Based Ensemble Pruning in the Imbalanced Data Classification

Computational Science – ICCS 2021 - Lecture Notes in Computer Science ◽

10.1007/978-3-030-77967-2_14 ◽

2021 ◽

pp. 156-171

Author(s):

Paweł Zyblewski

Keyword(s):

Imbalanced Data ◽

Data Classification ◽

Ensemble Pruning ◽

Imbalanced Data Classification

Download Full-text

Improvising Balancing Methods for Classifying Imbalanced Data

International Journal for Research in Applied Science and Engineering Technology ◽

10.22214/ijraset.2021.38225 ◽

2021 ◽

Vol 9 (9) ◽

pp. 1535-1543

Author(s):

Himani Tiwari

Keyword(s):

Learning Community ◽

Sampling Methods ◽

Evaluation Criteria ◽

Class Imbalance ◽

Imbalanced Data ◽

Simulated Data ◽

Data Sets ◽

Class Imbalance Problem ◽

Imbalance Problem ◽

Under Sampling

Abstract: Class Imbalance problem is one of the most challenging problems faced by the machine learning community. As we refer the imbalance to various instances in class of being relatively low as compare to other data. A number of over - sampling and under-sampling approaches have been applied in an attempt to balance the classes. This study provides an overview of the issue of class imbalance and attempts to examine various balancing methods for dealing with this problem. In order to illustrate the differences, an experiment is conducted using multiple simulated data sets for comparing the performance of these oversampling methods on different classifiers based on various evaluation criteria. In addition, the effect of different parameters, such as number of features and imbalance ratio, on the classifier performance is also evaluated. Keywords: Imbalanced learning, Over-sampling methods, Under-sampling methods, Classifier performances, Evaluationmetrices

Download Full-text

Difficulty Factors and Preprocessing in Imbalanced Data Sets: An Experimental Study on Artificial Data

Foundations of Computing and Decision Sciences ◽

10.1515/fcds-2017-0007 ◽

2017 ◽

Vol 42 (2) ◽

pp. 149-176 ◽

Cited By ~ 7

Author(s):

Szymon Wojciechowski ◽

Szymon Wilk

Keyword(s):

Experimental Study ◽

Class Imbalance ◽

Imbalanced Data ◽

Classification Performance ◽

Data Sets ◽

Artificial Data ◽

Minority Class ◽

Imbalanced Data Sets ◽

The Impact

Abstract In this paper we describe results of an experimental study where we checked the impact of various difficulty factors in imbalanced data sets on the performance of selected classifiers applied alone or combined with several preprocessing methods. In the study we used artificial data sets in order to systematically check factors such as dimensionality, class imbalance ratio or distribution of specific types of examples (safe, borderline, rare and outliers) in the minority class. The results revealed that the latter factor was the most critical one and it exacerbated other factors (in particular class imbalance). The best classification performance was demonstrated by non-symbolic classifiers, particular by k-NN classifiers (with 1 or 3 neighbors - 1NN and 3NN, respectively) and by SVM. Moreover, they benefited from different preprocessing methods - SVM and 1NN worked best with undersampling, while oversampling was more beneficial for 3NN.

Download Full-text

Imbalanced Data Classification Algorithm Based on Clustering and SVM

Journal of Circuits System and Computers ◽

10.1142/s0218126621500365 ◽

2020 ◽

pp. 2150036

Author(s):

Bo Huang ◽

Yimin Zhu ◽

Zhongzhen Wang ◽

Zhijun Fang

Keyword(s):

Class Imbalance ◽

Imbalanced Data ◽

Data Classification ◽

Classification Algorithm ◽

Classification Model ◽

Imbalance Data ◽

Imbalance Problem ◽

Imbalanced Data Classification ◽

Under Sampling ◽

Feature Dimension

The class-imbalance learning is one of the most significant research topics in the data mining and machine learning. Imbalance problem means that one of the classes has much more samples than that of other classes. To deal with the issues of low classification accuracy and high time complexity, this paper proposes an novel imbalance data classification algorithm based on clustering and SVM. The algorithm suggests under-sampling in majority samples based on the distribution characteristics of minority samples. First, specific clusters are detected by cluster analysis on the minority. Second, a cluster boundary strategy is proposed to eliminate the bad influence of noise samples. To structure a balanced dataset for imbalance data, this paper proposes three principles of under-sampling on majority samples according to the characteristic of samples in the cluster. Finally, the optimal classification model from the linear combination of hybrid-kernel SVM is obtained. The experiments based on datasets in UCI and KEEL database show that our algorithm effectively decreases the interference of noise samples. Compared with the SMOTE and Fast-CBUS, the proposed algorithm not only reduces the feature dimension, but also improves the precision of the minor classes under the different labeled sample rates generally.

Download Full-text

An Improved Oversampling Algorithm Based on the Samples’ Selection Strategy for Classifying Imbalanced Data

Mathematical Problems in Engineering ◽

10.1155/2019/3526539 ◽

2019 ◽

Vol 2019 ◽

pp. 1-13 ◽

Cited By ~ 3

Author(s):

Wenhao Xie ◽

Gongqian Liang ◽

Zhonghui Dong ◽

Baoyu Tan ◽

Baosheng Zhang

Keyword(s):

Imbalanced Data ◽

Classification Performance ◽

Mean Value ◽

Selection Strategy ◽

Data Sets ◽

Minority Class ◽

Imbalanced Data Sets ◽

Imbalanced Data Classification ◽

Samples Selection ◽

Data Level

The imbalance data refers to at least one of its classes which is usually outnumbered by the other classes. The imbalanced data sets exist widely in the real world, and the classification for them has become one of the hottest issues in the field of data mining. At present, the classification solutions for imbalanced data sets are mainly based on the algorithm-level and the data-level. On the data-level, both oversampling strategies and undersampling strategies are used to realize the data balance via data reconstruction. SMOTE and Random-SMOTE are two classic oversampling algorithms, but they still possess the drawbacks such as blind interpolation and fuzzy class boundaries. In this paper, an improved oversampling algorithm based on the samples’ selection strategy for the imbalanced data classification is proposed. On the basis of the Random-SMOTE algorithm, the support vectors (SV) are extracted and are treated as the parent samples to synthesize the new examples for the minority class in order to realize the balance of the data. Lastly, the imbalanced data sets are classified with the SVM classification algorithm. F-measure value, G-mean value, ROC curve, and AUC value are selected as the performance evaluation indexes. Experimental results show that this improved algorithm demonstrates a good classification performance for the imbalanced data sets.

Download Full-text

Imbalanced Data Classification Using Cost-Sensitive Support Vector Machine Based on Information Entropy

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.989-994.1756 ◽

2014 ◽

Vol 989-994 ◽

pp. 1756-1761 ◽

Cited By ~ 3

Author(s):

Wei Duan ◽

Liang Jing ◽

Xiang Yang Lu

Keyword(s):

Support Vector Machine ◽

Information Entropy ◽

Imbalanced Data ◽

Support Vector ◽

Data Sets ◽

Classification Problems ◽

Data Set ◽

Imbalanced Data Sets ◽

Penalty Factor ◽

Imbalanced Data Classification

As a supervised classification algorithm, Support Vector Machine (SVM) has an excellent ability in solving small samples, nonlinear and high dimensional classification problems. However, SVM is inefficient for imbalanced data sets classification. Therefore, a cost sensitive SVM (CSSVM) should be designed for imbalanced data sets classification. This paper proposes a method which constructed CSSVM based on information entropy, and in this method the information entropies of different classes of data set are used to determine the values of penalty factor of CSSVM.

Download Full-text