Credibility Based Imbalance Boosting Method for Software Defect Proneness Prediction

Haonan Tong; Shihai Wang; Guangling Li

doi:10.3390/app10228059

Credibility Based Imbalance Boosting Method for Software Defect Proneness Prediction

Applied Sciences ◽

10.3390/app10228059 ◽

2020 ◽

Vol 10 (22) ◽

pp. 8059

Author(s):

Haonan Tong ◽

Shihai Wang ◽

Guangling Li

Keyword(s):

Class Imbalance ◽

Area Under The Curve ◽

Imbalanced Data ◽

Class Imbalance Problem ◽

Synthetic Sample ◽

Promising Alternative ◽

Imbalance Problem ◽

Software Defect ◽

Boosting Method ◽

High Credibility

Imbalanced data are a major factor for degrading the performance of software defect models. Software defect dataset is imbalanced in nature, i.e., the number of non-defect-prone modules is far more than that of defect-prone ones, which results in the bias of classifiers on the majority class samples. In this paper, we propose a novel credibility-based imbalance boosting (CIB) method in order to address the class-imbalance problem in software defect proneness prediction. The method measures the credibility of synthetic samples based on their distribution by introducing a credit factor to every synthetic sample, and proposes a weight updating scheme to make the base classifiers focus on synthetic samples with high credibility and real samples. Experiments are performed on 11 NASA datasets and nine PROMISE datasets by comparing CIB with MAHAKIL, AdaC2, AdaBoost, SMOTE, RUS, No sampling method in terms of four performance measures, i.e., area under the curve (AUC), F1, AGF, and Matthews correlation coefficient (MCC). Wilcoxon sign-ranked test and Cliff’s δ are separately used to perform statistical test and calculate effect size. The experimental results show that CIB is a more promising alternative for addressing the class-imbalance problem in software defect-prone prediction as compared with previous methods.

Download Full-text

Software defect prediction using a cost sensitive decision forest and voting, and a potential solution to the class imbalance problem

Information Systems ◽

10.1016/j.is.2015.02.006 ◽

2015 ◽

Vol 51 ◽

pp. 62-71 ◽

Cited By ~ 56

Author(s):

Michael J. Siers ◽

Md Zahidul Islam

Keyword(s):

Class Imbalance ◽

Defect Prediction ◽

Software Defect Prediction ◽

Potential Solution ◽

Class Imbalance Problem ◽

Imbalance Problem ◽

Software Defect ◽

Decision Forest

Download Full-text

A New Big Data Model Using Distributed Cluster-Based Resampling for Class-Imbalance Problem

Applied Computer Systems ◽

10.2478/acss-2019-0013 ◽

2019 ◽

Vol 24 (2) ◽

pp. 104-110

Author(s):

Duygu Sinanc Terzi ◽

Seref Sagiroglu

Keyword(s):

Big Data ◽

Class Imbalance ◽

Area Under The Curve ◽

Data Sets ◽

Class Imbalance Problem ◽

Imbalance Problem ◽

The Common ◽

Public Datasets ◽

Distributed Cluster

Abstract The class imbalance problem, one of the common data irregularities, causes the development of under-represented models. To resolve this issue, the present study proposes a new cluster-based MapReduce design, entitled Distributed Cluster-based Resampling for Imbalanced Big Data (DIBID). The design aims at modifying the existing dataset to increase the classification success. Within the study, DIBID has been implemented on public datasets under two strategies. The first strategy has been designed to present the success of the model on data sets with different imbalanced ratios. The second strategy has been designed to compare the success of the model with other imbalanced big data solutions in the literature. According to the results, DIBID outperformed other imbalanced big data solutions in the literature and increased area under the curve values between 10 % and 24 % through the case study.

Download Full-text

CoGBUS- Center of Gravity based under Sampling Method for Imbalanced Data Classification

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.b2077.078219 ◽

2019 ◽

Vol 8 (2) ◽

pp. 2463-2468

Keyword(s):

Learning Community ◽

Sampling Method ◽

Class Imbalance ◽

Imbalanced Data ◽

Center Of Gravity ◽

Classification Algorithms ◽

Class Imbalance Problem ◽

Imbalance Problem ◽

Imbalanced Data Classification ◽

Under Sampling

Learning of class imbalanced data becomes a challenging issue in the machine learning community as all classification algorithms are designed to work for balanced datasets. Several methods are available to tackle this issue, among which the resampling techniques- undersampling and oversampling are more flexible and versatile. This paper introduces a new concept for undersampling based on Center of Gravity principle which helps to reduce the excess instances of majority class. This work is suited for binary class problems. The proposed technique –CoGBUS- overcomes the class imbalance problem and brings best results in the study. We take F-Score, GMean and ROC for the performance evaluation of the method.

Download Full-text

COSTE: Complexity-based OverSampling TEchnique to alleviate the class imbalance problem in software defect prediction

Information and Software Technology ◽

10.1016/j.infsof.2020.106432 ◽

2021 ◽

Vol 129 ◽

pp. 106432 ◽

Cited By ~ 1

Author(s):

Shuo Feng ◽

Jacky Keung ◽

Xiao Yu ◽

Yan Xiao ◽

Kwabena Ebo Bennin ◽

...

Keyword(s):

Class Imbalance ◽

Defect Prediction ◽

Software Defect Prediction ◽

Class Imbalance Problem ◽

Imbalance Problem ◽

Software Defect

Download Full-text

Classifying Imbalanced Data Sets by a Novel RE-Sample and Cost-Sensitive Stacked Generalization Method

Mathematical Problems in Engineering ◽

10.1155/2018/5036710 ◽

2018 ◽

Vol 2018 ◽

pp. 1-13 ◽

Cited By ~ 8

Author(s):

Jianhong Yan ◽

Suqing Han

Keyword(s):

Learning Community ◽

Class Imbalance ◽

Imbalanced Data ◽

Data Sets ◽

Class Imbalance Problem ◽

Imbalanced Data Sets ◽

Imbalance Data ◽

Imbalance Problem ◽

Stacked Generalization ◽

Model Generalization

Learning with imbalanced data sets is considered as one of the key topics in machine learning community. Stacking ensemble is an efficient algorithm for normal balance data sets. However, stacking ensemble was seldom applied in imbalance data. In this paper, we proposed a novel RE-sample and Cost-Sensitive Stacked Generalization (RECSG) method based on 2-layer learning models. The first step is Level 0 model generalization including data preprocessing and base model training. The second step is Level 1 model generalization involving cost-sensitive classifier and logistic regression algorithm. In the learning phase, preprocessing techniques can be embedded in imbalance data learning methods. In the cost-sensitive algorithm, cost matrix is combined with both data characters and algorithms. In the RECSG method, ensemble algorithm is combined with imbalance data techniques. According to the experiment results obtained with 17 public imbalanced data sets, as indicated by various evaluation metrics (AUC, GeoMean, and AGeoMean), the proposed method showed the better classification performances than other ensemble and single algorithms. The proposed method is especially more efficient when the performance of base classifier is low. All these demonstrated that the proposed method could be applied in the class imbalance problem.

Download Full-text

CLASSIFICATION OF IMBALANCED DATA: A REVIEW

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001409007326 ◽

2009 ◽

Vol 23 (04) ◽

pp. 687-719 ◽

Cited By ~ 534

Author(s):

YANMIN SUN ◽

ANDREW K. C. WONG ◽

MOHAMED S. KAMEL

Keyword(s):

Learning Algorithms ◽

Class Imbalance ◽

Imbalanced Data ◽

Class Imbalance Problem ◽

Class Distribution ◽

Imbalance Problem ◽

Misclassification Costs ◽

Imbalanced Class Distribution ◽

Classifier Learning

Classification of data with imbalanced class distribution has encountered a significant drawback of the performance attainable by most standard classifier learning algorithms which assume a relatively balanced class distribution and equal misclassification costs. This paper provides a review of the classification of imbalanced data regarding: the application domains; the nature of the problem; the learning difficulties with standard classifier learning algorithms; the learning objectives and evaluation measures; the reported research solutions; and the class imbalance problem in the presence of multiple classes.

Download Full-text

Exploiting Correlation Subspace to Predict Heterogeneous Cross-Project Defects

International Journal of Software Engineering and Knowledge Engineering ◽

10.1142/s0218194016710017 ◽

2016 ◽

Vol 26 (09n10) ◽

pp. 1571-1580 ◽

Cited By ~ 6

Author(s):

Ming Cheng ◽

Guoqing Wu ◽

Hongyan Wan ◽

Guoan You ◽

Mengting Yuan ◽

...

Keyword(s):

Class Imbalance ◽

Imbalanced Data ◽

Feature Space ◽

Support Vector ◽

Class Imbalance Problem ◽

Classifier Design ◽

Imbalance Problem ◽

Project Data ◽

The Impact ◽

Cross Project

Cross-project defect prediction trains a prediction model using historical data from source projects and applies the model to target projects. Most previous efforts assumed the cross-project data have the same metrics set, which means the metrics used and the size of metrics set are the same. However, this assumption may not hold in practical scenarios. In addition, software defect datasets have the class-imbalance problem which increases the difficulty for the learner to predict defects. In this paper, we advance canonical correlation analysis by deriving a joint feature space for associating cross-project data. We also propose a novel support vector machine algorithm which incorporates the correlation transfer information into classifier design for cross-project prediction. Moreover, we take different misclassification costs into consideration to make the classification inclining to classify a module as a defective one, alleviating the impact of imbalanced data. The experimental results show that our method is more effective compared to state-of-the-art methods.

Download Full-text

Impact of class-imbalance on multi-class high-dimensional class prediction

Advances in Methodology and Statistics ◽

10.51936/grxm1445 ◽

2012 ◽

Vol 9 (1) ◽

Author(s):

Rok Blagus ◽

Lara Lusa

Keyword(s):

Sample Size ◽

Binary Classification ◽

Class Imbalance ◽

Imbalanced Data ◽

Real Data ◽

High Dimensional ◽

Class Imbalance Problem ◽

Class Prediction ◽

Linear Discriminant ◽

Imbalance Problem

The goal of multi-class supervised classification is to develop a rule that accurately predicts the class membership of new samples when the number of classes is larger than two. In this paper we consider high-dimensional class-imbalanced data: the number of variables greatly exceeds the number of samples and the number of samples in each class is not equal. We focus on Friedman's one-versus-one approach for three-class problems and show how its class probabilities depend on the class probabilities from the binary classification sub-problems. We further explore its performance using diagonal linear discriminant analysis (DLDA) as a base classifier and compare its performance with multi-class DLDA, using simulated and real data. Our results show that the class-imbalance has a significant effect on the classification results: the classification is biased towards the majority class as in the two-class problems and the problem is magnified when the number of variables is large. The amount of the bias depends also, jointly, on the magnitude of the differences between the classes and on the sample size: the bias diminishes when the difference between the classes is larger or the sample size is increased. Also variable selection plays an important role in the class-imbalance problem and the most effective strategy depends on the type of differences that exist between classes. DLDA seems to be among the least sensible classifiers to class-imbalance and its use is recommended also for multi-class problems. Whenever possible the experiments should be planned using balanced data in order to avoid the class-imbalance problem.

Download Full-text

Feature Clustering and Ensemble Learning Based Approach for Software Defect Prediction

Recent Advances in Computer Science and Communications ◽

10.2174/2666255813999201109201259 ◽

2020 ◽

Vol 13 ◽

Author(s):

R. Srivastava ◽

Aman Kumar Jain

Keyword(s):

Clustering Algorithm ◽

Class Imbalance ◽

Software Defect Prediction ◽

Class Imbalance Problem ◽

Feature Clustering ◽

Significance Level ◽

Software Products ◽

Imbalance Problem ◽

Software Defect ◽

Software Modules

Objective:: Defects in delivered software products not only have financial implications but also blemish the reputation of the organisation and lead to wastage of time and human resource. This paper aims to detect defects in software modules. Methods:: Our approach sequentially combines SMOTE algorithm to deal with class imbalance problem, K - means clustering algorithm to obtain a set of key features based on inter-class and intra-class coefficient of correlation and ensemble modelling to predict defects in software modules. After cautious examination, an ensemble framework of XGBoost, Decision Tree and Random Forest is used for prediction of software defects owing to numerous merits of ensembling approach. Results:: We have used five open-source datasets from NASA Promise Repository for Software Engineering. The result obtained from our approach has been compared with that of individual algorithms used in ensemble. A confidence interval for the accuracy of our approach with respect to performance evaluation metrics namely Accuracy, Precision, Recall, F1 score and AUC score has also been constructed at a significance level of 0.01. Conclusion:: Results have been depicted pictographically.

Download Full-text

Tackling the Imbalanced Data in Software Maintainability Prediction Using Ensembles for Class Imbalance Problem

Advances in Interdisciplinary Research in Engineering and Business Management - Asset Analytics ◽

10.1007/978-981-16-0037-1_31 ◽

2021 ◽

pp. 391-399

Author(s):

Ruchika Malhotra ◽

Kusum Lata

Keyword(s):

Class Imbalance ◽

Imbalanced Data ◽

Class Imbalance Problem ◽

Imbalance Problem ◽

Software Maintainability

Download Full-text