A New Framework Consisted of Data Preprocessing and Classifier Modelling for Software Defect Prediction

This article describes how classification methods on software defect prediction is widely researched due to the need to increase the software quality and decrease testing efforts. However, findings of past researches done on this issue has not shown any classifier which proves to be superior to the other. Additionally, there is a lack of research that studies the effects and accuracy of genetic programming on software defect prediction. To find solutions for this problem, a comparative software defect prediction experiment between genetic programming and neural networks are performed on four datasets from the NASA Metrics Data repository. Generally, an interesting degree of accuracy is detected, which shows how the metric-based classification is useful. Nevertheless, this article specifies that the application and usage of genetic programming is highly recommended due to the detailed analysis it provides, as well as an important feature in this classification method which allows the viewing of each attributes impact in the dataset.

Download Full-text

An effective rank approach to software defect prediction using software metrics

2016 10th International Conference on Intelligent Systems and Control (ISCO) ◽

10.1109/isco.2016.7727030 ◽

2016 ◽

Author(s):

P Lakshmi ◽

T. Latha Maheswari

Keyword(s):

Software Metrics ◽

Defect Prediction ◽

Software Defect Prediction ◽

Software Defect ◽

Effective Rank

Download Full-text

Performance Analysis of Feature Selection Methods in Software Defect Prediction: A Search Method Approach

Applied Sciences ◽

10.3390/app9132764 ◽

2019 ◽

Vol 9 (13) ◽

pp. 2764 ◽

Cited By ~ 8

Author(s):

Abdullateef Oluwagbemiga Balogun ◽

Shuib Basri ◽

Said Jadid Abdulkadir ◽

Ahmad Sobri Hashim

Keyword(s):

Software Metrics ◽

Prediction Models ◽

Predictive Performance ◽

Search Method ◽

Feature Subset Selection ◽

Defect Prediction ◽

Feature Subset ◽

Software Defect Prediction ◽

Software Defect

Software Defect Prediction (SDP) models are built using software metrics derived from software systems. The quality of SDP models depends largely on the quality of software metrics (dataset) used to build the SDP models. High dimensionality is one of the data quality problems that affect the performance of SDP models. Feature selection (FS) is a proven method for addressing the dimensionality problem. However, the choice of FS method for SDP is still a problem, as most of the empirical studies on FS methods for SDP produce contradictory and inconsistent quality outcomes. Those FS methods behave differently due to different underlining computational characteristics. This could be due to the choices of search methods used in FS because the impact of FS depends on the choice of search method. It is hence imperative to comparatively analyze the FS methods performance based on different search methods in SDP. In this paper, four filter feature ranking (FFR) and fourteen filter feature subset selection (FSS) methods were evaluated using four different classifiers over five software defect datasets obtained from the National Aeronautics and Space Administration (NASA) repository. The experimental analysis showed that the application of FS improves the predictive performance of classifiers and the performance of FS methods can vary across datasets and classifiers. In the FFR methods, Information Gain demonstrated the greatest improvements in the performance of the prediction models. In FSS methods, Consistency Feature Subset Selection based on Best First Search had the best influence on the prediction models. However, prediction models based on FFR proved to be more stable than those based on FSS methods. Hence, we conclude that FS methods improve the performance of SDP models, and that there is no single best FS method, as their performance varied according to datasets and the choice of the prediction model. However, we recommend the use of FFR methods as the prediction models based on FFR are more stable in terms of predictive performance.

Download Full-text

Software Defect Prediction Using Propositionalization Based Data Preprocessing: An Empirical Study

2018 2nd International Conference on Data Science and Business Analytics (ICDSBA) ◽

10.1109/icdsba.2018.00021 ◽

2018 ◽

Author(s):

CholMyong Pak ◽

Tiantian Wang ◽

Xiaohong Su

Keyword(s):

Empirical Study ◽

Data Preprocessing ◽

Defect Prediction ◽

Software Defect Prediction ◽

Software Defect

Download Full-text

The Use of Ensemble-Based Data Preprocessing Techniques for Software Defect Prediction

International Journal of Software Engineering and Knowledge Engineering ◽

10.1142/s0218194014400105 ◽

2014 ◽

Vol 24 (09) ◽

pp. 1229-1253 ◽

Cited By ~ 3

Author(s):

Kehan Gao ◽

Taghi M. Khoshgoftaar ◽

Amri Napolitano

Keyword(s):

Feature Selection ◽

Prediction Models ◽

Measurement Data ◽

Class Imbalance ◽

Data Preprocessing ◽

High Dimensionality ◽

Training Dataset ◽

Defect Prediction ◽

Software Defect Prediction ◽

Software Defect

Software defect prediction models that use software metrics such as code-level measurements and defect data to build classification models are useful tools for identifying potentially-problematic program modules. Effectiveness of detecting such modules is affected by the software measurements used, making data preprocessing an important step during software quality prediction. Generally, there are two problems affecting software measurement data: high dimensionality (where a training dataset has an extremely large number of independent attributes, or features) and class imbalance (where a training dataset has one class with relatively many more members than the other class). In this paper, we present a novel form of ensemble learning based on boosting that incorporates data sampling to alleviate class imbalance and feature (software metric) selection to address high dimensionality. As we adopt two different sampling methods (Random Undersampling (RUS) and Synthetic Minority Oversampling (SMOTE)) in the technique, we have two forms of our new ensemble-based approach: selectRUSBoost and selectSMOTEBoost. To evaluate the effectiveness of these new techniques, we apply them to two groups of datasets from two real-world software systems. In the experiments, four learners and nine feature selection techniques are employed to build our models. We also consider versions of the technique which do not incorporate feature selection, and compare all four techniques (the two different ensemble-based approaches which utilize feature selection and the two versions which use sampling only). The experimental results demonstrate that selectRUSBoost is generally more effective in improving defect prediction performance than selectSMOTEBoost, and that the techniques with feature selection do help for getting better prediction than the techniques without feature selection.

Download Full-text

Combining Data Preprocessing Methods With Imputation Techniques for Software Defect Prediction

International Journal of Open Source Software and Processes ◽

10.4018/ijossp.2018010101 ◽

2018 ◽

Vol 9 (1) ◽

pp. 1-19 ◽

Cited By ~ 1

Author(s):

Misha Kakkar ◽

Sarika Jain ◽

Abhay Bansal ◽

P.S. Grover

Keyword(s):

Linear Regression ◽

Missing Values ◽

Data Preprocessing ◽

Machine Learning Algorithms ◽

Defect Prediction ◽

Software Defect Prediction ◽

Software Defect ◽

Combining Data ◽

Feature Selector ◽

Traditional Performance

Software Defect Prediction (SDP) models are used to predict, whether software is clean or buggy using the historical data collected from various software repositories. The data collected from such repositories may contain some missing values. In order to estimate missing values, imputation techniques are used, which utilizes the complete observed values in the dataset. The objective of this study is to identify the best-suited imputation technique for handling missing values in SDP dataset. In addition to identifying the imputation technique, the authors have investigated for the most appropriate combination of imputation technique and data preprocessing method for building SDP model. In this study, four combinations of imputation technique and data preprocessing methods are examined using the improved NASA datasets. These combinations are used along with five different machine-learning algorithms to develop models. The performance of these SDP models are then compared using traditional performance indicators. Experiment results show that among different imputation techniques, linear regression gives the most accurate imputed value. The combination of linear regression with correlation based feature selector outperforms all other combinations. To validate the significance of data preprocessing methods with imputation the findings are applied to open source projects. It was concluded that the result is in consistency with the above conclusion.

Download Full-text

Software Defect Prediction Using Software Metrics with Naïve Bayes and Rule Mining Association Methods

2019 5th International Conference on Science and Technology (ICST) ◽

10.1109/icst47872.2019.9166448 ◽

2019 ◽

Author(s):

Fernando Maruli Tua ◽

Wikan Danar Sunindyo

Keyword(s):

Software Metrics ◽

Naive Bayes ◽

Naïve Bayes ◽

Defect Prediction ◽

Software Defect Prediction ◽

Rule Mining ◽

Software Defect

Download Full-text

Combining Data Preprocessing Methods With Imputation Techniques for Software Defect Prediction

Research Anthology on Recent Trends, Tools, and Implications of Computer Programming ◽

10.4018/978-1-7998-3016-0.ch081 ◽

2021 ◽

pp. 1792-1811

Author(s):

Misha Kakkar ◽

Sarika Jain ◽

Abhay Bansal ◽

P.S. Grover

Keyword(s):

Linear Regression ◽

Missing Values ◽

Data Preprocessing ◽

Machine Learning Algorithms ◽

Defect Prediction ◽

Software Defect Prediction ◽

Software Defect ◽

Combining Data ◽

Feature Selector ◽

Traditional Performance

Software Defect Prediction (SDP) models are used to predict, whether software is clean or buggy using the historical data collected from various software repositories. The data collected from such repositories may contain some missing values. In order to estimate missing values, imputation techniques are used, which utilizes the complete observed values in the dataset. The objective of this study is to identify the best-suited imputation technique for handling missing values in SDP dataset. In addition to identifying the imputation technique, the authors have investigated for the most appropriate combination of imputation technique and data preprocessing method for building SDP model. In this study, four combinations of imputation technique and data preprocessing methods are examined using the improved NASA datasets. These combinations are used along with five different machine-learning algorithms to develop models. The performance of these SDP models are then compared using traditional performance indicators. Experiment results show that among different imputation techniques, linear regression gives the most accurate imputed value. The combination of linear regression with correlation based feature selector outperforms all other combinations. To validate the significance of data preprocessing methods with imputation the findings are applied to open source projects. It was concluded that the result is in consistency with the above conclusion.

Download Full-text

A Study on Software Metrics based Software Defect Prediction using Data Mining and Machine Learning Techniques

International Journal of Database Theory and Application ◽

10.14257/ijdta.2015.8.3.15 ◽

2015 ◽

Vol 8 (3) ◽

pp. 179-190 ◽

Cited By ~ 15

Author(s):

Manjula.C.M. Prasad ◽

Lilly Florence Florence ◽

Arti Arya3

Keyword(s):

Machine Learning ◽

Data Mining ◽

Software Metrics ◽

Machine Learning Techniques ◽

Defect Prediction ◽

Software Defect Prediction ◽

Software Defect ◽

Learning Techniques ◽

Using Data

Download Full-text

A New Data Mining-Based Framework to Test Case Prioritization Using Software Defect Prediction

International Journal of Open Source Software and Processes ◽

10.4018/ijossp.2017010102 ◽

2017 ◽

Vol 8 (1) ◽

pp. 21-41

Author(s):

Emad Alsukhni ◽

Ahmad A. Saifan ◽

Hanadi Alawneh

Keyword(s):

Data Mining ◽

Software Metrics ◽

Fault Prediction ◽

Test Case ◽

Test Cases ◽

Test Case Prioritization ◽

Software Defect Prediction ◽

Average Percentage ◽

Software Defect ◽

New Framework

Test cases do not have the same importance when used to detect faults in software; therefore, it is more efficient to test the system with the test cases that have the ability to detect the faults. This research proposes a new framework that combines data mining techniques to prioritize the test cases. It enhances fault prediction and detection using two different techniques: 1) the data mining regression classifier that depends on software metrics to predict defective modules, and 2) the k-means clustering technique that is used to select and prioritize test cases to identify the fault early. Our approach of test case prioritization yields good results in comparison with other studies. The authors used the Average Percentage of Faults Detection (APFD) metric to evaluate the proposed framework, which results in 19.9% for all system modules and 25.7% for defective ones. Our results give us an indication that it is effective to start the testing process with the most defective modules instead of testing all modules arbitrary arbitrarily.

Download Full-text