Impact of class-imbalance on multi-class high-dimensional class prediction

Rok Blagus; Lara Lusa

doi:10.51936/grxm1445

Impact of class-imbalance on multi-class high-dimensional class prediction

Advances in Methodology and Statistics ◽

10.51936/grxm1445 ◽

2012 ◽

Vol 9 (1) ◽

Author(s):

Rok Blagus ◽

Lara Lusa

Keyword(s):

Sample Size ◽

Binary Classification ◽

Class Imbalance ◽

Imbalanced Data ◽

Real Data ◽

High Dimensional ◽

Class Imbalance Problem ◽

Class Prediction ◽

Linear Discriminant ◽

Imbalance Problem

The goal of multi-class supervised classification is to develop a rule that accurately predicts the class membership of new samples when the number of classes is larger than two. In this paper we consider high-dimensional class-imbalanced data: the number of variables greatly exceeds the number of samples and the number of samples in each class is not equal. We focus on Friedman's one-versus-one approach for three-class problems and show how its class probabilities depend on the class probabilities from the binary classification sub-problems. We further explore its performance using diagonal linear discriminant analysis (DLDA) as a base classifier and compare its performance with multi-class DLDA, using simulated and real data. Our results show that the class-imbalance has a significant effect on the classification results: the classification is biased towards the majority class as in the two-class problems and the problem is magnified when the number of variables is large. The amount of the bias depends also, jointly, on the magnitude of the differences between the classes and on the sample size: the bias diminishes when the difference between the classes is larger or the sample size is increased. Also variable selection plays an important role in the class-imbalance problem and the most effective strategy depends on the type of differences that exist between classes. DLDA seems to be among the least sensible classifiers to class-imbalance and its use is recommended also for multi-class problems. Whenever possible the experiments should be planned using balanced data in order to avoid the class-imbalance problem.

Download Full-text

The Class-Imbalance Problem for High-Dimensional Class Prediction

2012 11th International Conference on Machine Learning and Applications ◽

10.1109/icmla.2012.223 ◽

2012 ◽

Cited By ~ 3

Author(s):

Lara Lusa ◽

Rok Blagus

Keyword(s):

Class Imbalance ◽

High Dimensional ◽

Class Imbalance Problem ◽

Class Prediction ◽

Imbalance Problem

Download Full-text

Tackling Class Imbalance Problem in Binary Classification using Augmented Neighborhood Cleaning Algorithm

Lecture Notes in Electrical Engineering - Information Science and Applications ◽

10.1007/978-3-662-46578-3_98 ◽

2015 ◽

pp. 827-834 ◽

Cited By ~ 1

Author(s):

Nadyah Obaid Al Abdouli ◽

Zeyar Aung ◽

Wei Lee Woon ◽

Davor Svetinovic

Keyword(s):

Binary Classification ◽

Class Imbalance ◽

Class Imbalance Problem ◽

Imbalance Problem

Download Full-text

Credibility Based Imbalance Boosting Method for Software Defect Proneness Prediction

Applied Sciences ◽

10.3390/app10228059 ◽

2020 ◽

Vol 10 (22) ◽

pp. 8059

Author(s):

Haonan Tong ◽

Shihai Wang ◽

Guangling Li

Keyword(s):

Class Imbalance ◽

Area Under The Curve ◽

Imbalanced Data ◽

Class Imbalance Problem ◽

Synthetic Sample ◽

Promising Alternative ◽

Imbalance Problem ◽

Software Defect ◽

Boosting Method ◽

High Credibility

Imbalanced data are a major factor for degrading the performance of software defect models. Software defect dataset is imbalanced in nature, i.e., the number of non-defect-prone modules is far more than that of defect-prone ones, which results in the bias of classifiers on the majority class samples. In this paper, we propose a novel credibility-based imbalance boosting (CIB) method in order to address the class-imbalance problem in software defect proneness prediction. The method measures the credibility of synthetic samples based on their distribution by introducing a credit factor to every synthetic sample, and proposes a weight updating scheme to make the base classifiers focus on synthetic samples with high credibility and real samples. Experiments are performed on 11 NASA datasets and nine PROMISE datasets by comparing CIB with MAHAKIL, AdaC2, AdaBoost, SMOTE, RUS, No sampling method in terms of four performance measures, i.e., area under the curve (AUC), F1, AGF, and Matthews correlation coefficient (MCC). Wilcoxon sign-ranked test and Cliff’s δ are separately used to perform statistical test and calculate effect size. The experimental results show that CIB is a more promising alternative for addressing the class-imbalance problem in software defect-prone prediction as compared with previous methods.

Download Full-text

CoGBUS- Center of Gravity based under Sampling Method for Imbalanced Data Classification

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.b2077.078219 ◽

2019 ◽

Vol 8 (2) ◽

pp. 2463-2468

Keyword(s):

Learning Community ◽

Sampling Method ◽

Class Imbalance ◽

Imbalanced Data ◽

Center Of Gravity ◽

Classification Algorithms ◽

Class Imbalance Problem ◽

Imbalance Problem ◽

Imbalanced Data Classification ◽

Under Sampling

Learning of class imbalanced data becomes a challenging issue in the machine learning community as all classification algorithms are designed to work for balanced datasets. Several methods are available to tackle this issue, among which the resampling techniques- undersampling and oversampling are more flexible and versatile. This paper introduces a new concept for undersampling based on Center of Gravity principle which helps to reduce the excess instances of majority class. This work is suited for binary class problems. The proposed technique –CoGBUS- overcomes the class imbalance problem and brings best results in the study. We take F-Score, GMean and ROC for the performance evaluation of the method.

Download Full-text

Classifying Imbalanced Data Sets by a Novel RE-Sample and Cost-Sensitive Stacked Generalization Method

Mathematical Problems in Engineering ◽

10.1155/2018/5036710 ◽

2018 ◽

Vol 2018 ◽

pp. 1-13 ◽

Cited By ~ 8

Author(s):

Jianhong Yan ◽

Suqing Han

Keyword(s):

Learning Community ◽

Class Imbalance ◽

Imbalanced Data ◽

Data Sets ◽

Class Imbalance Problem ◽

Imbalanced Data Sets ◽

Imbalance Data ◽

Imbalance Problem ◽

Stacked Generalization ◽

Model Generalization

Learning with imbalanced data sets is considered as one of the key topics in machine learning community. Stacking ensemble is an efficient algorithm for normal balance data sets. However, stacking ensemble was seldom applied in imbalance data. In this paper, we proposed a novel RE-sample and Cost-Sensitive Stacked Generalization (RECSG) method based on 2-layer learning models. The first step is Level 0 model generalization including data preprocessing and base model training. The second step is Level 1 model generalization involving cost-sensitive classifier and logistic regression algorithm. In the learning phase, preprocessing techniques can be embedded in imbalance data learning methods. In the cost-sensitive algorithm, cost matrix is combined with both data characters and algorithms. In the RECSG method, ensemble algorithm is combined with imbalance data techniques. According to the experiment results obtained with 17 public imbalanced data sets, as indicated by various evaluation metrics (AUC, GeoMean, and AGeoMean), the proposed method showed the better classification performances than other ensemble and single algorithms. The proposed method is especially more efficient when the performance of base classifier is low. All these demonstrated that the proposed method could be applied in the class imbalance problem.

Download Full-text

CLASSIFICATION OF IMBALANCED DATA: A REVIEW

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001409007326 ◽

2009 ◽

Vol 23 (04) ◽

pp. 687-719 ◽

Cited By ~ 534

Author(s):

YANMIN SUN ◽

ANDREW K. C. WONG ◽

MOHAMED S. KAMEL

Keyword(s):

Learning Algorithms ◽

Class Imbalance ◽

Imbalanced Data ◽

Class Imbalance Problem ◽

Class Distribution ◽

Imbalance Problem ◽

Misclassification Costs ◽

Imbalanced Class Distribution ◽

Classifier Learning

Classification of data with imbalanced class distribution has encountered a significant drawback of the performance attainable by most standard classifier learning algorithms which assume a relatively balanced class distribution and equal misclassification costs. This paper provides a review of the classification of imbalanced data regarding: the application domains; the nature of the problem; the learning difficulties with standard classifier learning algorithms; the learning objectives and evaluation measures; the reported research solutions; and the class imbalance problem in the presence of multiple classes.

Download Full-text

Exploiting Correlation Subspace to Predict Heterogeneous Cross-Project Defects

International Journal of Software Engineering and Knowledge Engineering ◽

10.1142/s0218194016710017 ◽

2016 ◽

Vol 26 (09n10) ◽

pp. 1571-1580 ◽

Cited By ~ 6

Author(s):

Ming Cheng ◽

Guoqing Wu ◽

Hongyan Wan ◽

Guoan You ◽

Mengting Yuan ◽

...

Keyword(s):

Class Imbalance ◽

Imbalanced Data ◽

Feature Space ◽

Support Vector ◽

Class Imbalance Problem ◽

Classifier Design ◽

Imbalance Problem ◽

Project Data ◽

The Impact ◽

Cross Project

Cross-project defect prediction trains a prediction model using historical data from source projects and applies the model to target projects. Most previous efforts assumed the cross-project data have the same metrics set, which means the metrics used and the size of metrics set are the same. However, this assumption may not hold in practical scenarios. In addition, software defect datasets have the class-imbalance problem which increases the difficulty for the learner to predict defects. In this paper, we advance canonical correlation analysis by deriving a joint feature space for associating cross-project data. We also propose a novel support vector machine algorithm which incorporates the correlation transfer information into classifier design for cross-project prediction. Moreover, we take different misclassification costs into consideration to make the classification inclining to classify a module as a defective one, alleviating the impact of imbalanced data. The experimental results show that our method is more effective compared to state-of-the-art methods.

Download Full-text

Tackling the Imbalanced Data in Software Maintainability Prediction Using Ensembles for Class Imbalance Problem

Advances in Interdisciplinary Research in Engineering and Business Management - Asset Analytics ◽

10.1007/978-981-16-0037-1_31 ◽

2021 ◽

pp. 391-399

Author(s):

Ruchika Malhotra ◽

Kusum Lata

Keyword(s):

Class Imbalance ◽

Imbalanced Data ◽

Class Imbalance Problem ◽

Imbalance Problem ◽

Software Maintainability

Download Full-text

KGEARSRG: Kernel Graph Embedding on Attributed Relational SIFT-Based Regions Graph

Machine Learning and Knowledge Extraction ◽

10.3390/make1030055 ◽

2019 ◽

Vol 1 (3) ◽

pp. 962-973 ◽

Cited By ~ 1

Author(s):

Mario Manzo

Keyword(s):

Binary Classification ◽

Class Imbalance ◽

Graph Embedding ◽

Support Vector ◽

Class Imbalance Problem ◽

Scale Invariant ◽

Imbalance Problem ◽

Series Of Experiments ◽

Imbalanced Classes ◽

Scale Invariant Feature

In real world applications, binary classification is often affected by imbalanced classes. In this paper, a new methodology to solve the class imbalance problem that occurs in image classification is proposed. A digital image is described through a novel vector-based representation called Kernel Graph Embedding on Attributed Relational Scale-Invariant Feature Transform-based Regions Graph (KGEARSRG). A classification stage using a procedure based on support vector machines (SVMs) is organized. Methodology is evaluated through a series of experiments performed on art painting dataset images, affected by varying imbalance percentages. Experimental results show that the proposed approach consistently outperforms the competitors.

Download Full-text

Research on expansion and classification of imbalanced data based on SMOTE algorithm

Scientific Reports ◽

10.1038/s41598-021-03430-5 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Shujuan Wang ◽

Yuntao Dai ◽

Jihong Shen ◽

Jingxue Xuan

Keyword(s):

Big Data ◽

Class Imbalance ◽

Imbalanced Data ◽

Original Data ◽

Classification Performance ◽

Parameter Selection ◽

Sample Collection ◽

Class Imbalance Problem ◽

Imbalance Problem ◽

Sample Points

AbstractWith the development of artificial intelligence, big data classification technology provides the advantageous help for the medicine auxiliary diagnosis research. While due to the different conditions in the different sample collection, the medical big data is often imbalanced. The class-imbalance problem has been reported as a serious obstacle to the classification performance of many standard learning algorithms. SMOTE algorithm could be used to generate sample points randomly to improve imbalance rate, but its application is affected by the marginalization generation and blindness of parameter selection. Focusing on this problem, an improved SMOTE algorithm based on Normal distribution is proposed in this paper, so that the new sample points are distributed closer to the center of the minority sample with a higher probability to avoid the marginalization of the expanded data. Experiments show that the classification effect is better when use proposed algorithm to expand the imbalanced dataset of Pima, WDBC, WPBC, Ionosphere and Breast-cancer-wisconsin than the original SMOTE algorithm. In addition, the parameter selection of the proposed algorithm is analyzed and it is found that the classification effect is the best when the distribution characteristics of the original data was maintained best by selecting appropriate parameters in our designed experiments.

Download Full-text