Combining Synthetic Minority Oversampling Technique and Subset Feature Selection Technique For Class Imbalance Problem

Author(s):  
Pawan Lachheta ◽  
Seema Bawa
2016 ◽  
Vol 2016 ◽  
pp. 1-9 ◽  
Author(s):  
Jing Bian ◽  
Xin-guang Peng ◽  
Ying Wang ◽  
Hai Zhang

In the era of big data, feature selection is an essential process in machine learning. Although the class imbalance problem has recently attracted a great deal of attention, little effort has been undertaken to develop feature selection techniques. In addition, most applications involving feature selection focus on classification accuracy but not cost, although costs are important. To cope with imbalance problems, we developed a cost-sensitive feature selection algorithm that adds the cost-based evaluation function of a filter feature selection using a chaos genetic algorithm, referred to as CSFSG. The evaluation function considers both feature-acquiring costs (test costs) and misclassification costs in the field of network security, thereby weakening the influence of many instances from the majority of classes in large-scale datasets. The CSFSG algorithm reduces the total cost of feature selection and trades off both factors. The behavior of the CSFSG algorithm is tested on a large-scale dataset of network security, using two kinds of classifiers: C4.5 andk-nearest neighbor (KNN). The results of the experimental research show that the approach is efficient and able to effectively improve classification accuracy and to decrease classification time. In addition, the results of our method are more promising than the results of other cost-sensitive feature selection algorithms.


Author(s):  
Hartono Hartono ◽  
Erianto Ongko ◽  
Yeni Risyani

<span>In the classification process that contains class imbalance problems. In addition to the uneven distribution of instances which causes poor performance, overlapping problems also cause performance degradation. This paper proposes a method that combining feature selection and hybrid approach redefinition (HAR) method in handling class imbalance and overlapping for multi-class imbalanced. HAR was a hybrid ensembles method in handling class imbalance problem. The main contribution of this work is to produce a new method that can overcome the problem of class imbalance and overlapping in the multi-class imbalance problem.  This method must be able to give better results in terms of classifier performance and overlap degrees in multi-class problems. This is achieved by improving an ensemble learning algorithm and a preprocessing technique in HAR <span>using minimizing overlapping selection under SMOTE (MOSS). MOSS was known as a very popular feature selection method in handling overlapping. To validate the accuracy of the proposed method, this research use augmented R-Value, Mean AUC, Mean F-Measure, Mean G-Mean, and Mean Precision. The performance of the model is evaluated against the hybrid method (MBP+CGE) as a popular method in handling class imbalance and overlapping for multi-class imbalanced. It is found that the proposed method is superior when subjected to classifier performance as indicate with better Mean AUC, F-Measure, G-Mean, and precision.</span></span>


Author(s):  
Kewen Li ◽  
Mingxiao Yu ◽  
Lu Liu ◽  
Timing Li ◽  
Jiannan Zhai

The class imbalance problem has negative effects on the performance of feature selection in imbalanced data. Traditional feature selection algorithms always study on the balanced class distribution of the data and improve the overall classification accuracy for the optimization goal, which tends to be overwhelmed by the large classes, ignoring the small ones. This paper proposes a novel feature selection method based on the weighted mutual information (WMI) for the imbalanced data, defined as WMI algorithm. The WMI algorithm assigns different weights to the samples based on the fuzzy c-means (FCM) clustering algorithm and then calculates the mutual information based on the weight of each sample. This paper used the AUC as the evaluation criterion of the selected feature. At last, four unbalanced datasets from NASA software defect datasets are used to validate the proposed approach. Experimental results show that the proposed method achieves higher prediction accuracy of both minority class and majority class.


Healthcare ◽  
2021 ◽  
Vol 9 (3) ◽  
pp. 260
Author(s):  
Christos Kokkotis ◽  
Serafeim Moustakidis ◽  
Vasilios Baltzopoulos ◽  
Giannis Giakas ◽  
Dimitrios Tsaopoulos

Knee osteoarthritis (KOA) is a multifactorial disease which is responsible for more than 80% of the osteoarthritis disease’s total burden. KOA is heterogeneous in terms of rates of progression with several different phenotypes and a large number of risk factors, which often interact with each other. A number of modifiable and non-modifiable systemic and mechanical parameters along with comorbidities as well as pain-related factors contribute to the development of KOA. Although models exist to predict the onset of the disease or discriminate between asymptotic and OA patients, there are just a few studies in the recent literature that focused on the identification of risk factors associated with KOA progression. This paper contributes to the identification of risk factors for KOA progression via a robust feature selection (FS) methodology that overcomes two crucial challenges: (i) the observed high dimensionality and heterogeneity of the available data that are obtained from the Osteoarthritis Initiative (OAI) database and (ii) a severe class imbalance problem posed by the fact that the KOA progressors class is significantly smaller than the non-progressors’ class. The proposed feature selection methodology relies on a combination of evolutionary algorithms and machine learning (ML) models, leading to the selection of a relatively small feature subset of 35 risk factors that generalizes well on the whole dataset (mean accuracy of 71.25%). We investigated the effectiveness of the proposed approach in a comparative analysis with well-known FS techniques with respect to metrics related to both prediction accuracy and generalization capability. The impact of the selected risk factors on the prediction output was further investigated using SHapley Additive exPlanations (SHAP). The proposed FS methodology may contribute to the development of new, efficient risk stratification strategies and identification of risk phenotypes of each KOA patient to enable appropriate interventions.


Sign in / Sign up

Export Citation Format

Share Document