Combining Synthetic Minority Oversampling Technique and Subset Feature Selection Technique For Class Imbalance Problem

In the era of big data, feature selection is an essential process in machine learning. Although the class imbalance problem has recently attracted a great deal of attention, little effort has been undertaken to develop feature selection techniques. In addition, most applications involving feature selection focus on classification accuracy but not cost, although costs are important. To cope with imbalance problems, we developed a cost-sensitive feature selection algorithm that adds the cost-based evaluation function of a filter feature selection using a chaos genetic algorithm, referred to as CSFSG. The evaluation function considers both feature-acquiring costs (test costs) and misclassification costs in the field of network security, thereby weakening the influence of many instances from the majority of classes in large-scale datasets. The CSFSG algorithm reduces the total cost of feature selection and trades off both factors. The behavior of the CSFSG algorithm is tested on a large-scale dataset of network security, using two kinds of classifiers: C4.5 andk-nearest neighbor (KNN). The results of the experimental research show that the approach is efficient and able to effectively improve classification accuracy and to decrease classification time. In addition, the results of our method are more promising than the results of other cost-sensitive feature selection algorithms.

Download Full-text

Combining feature selection and hybrid approach redefinition in handling class imbalance and overlapping for multi-class imbalanced

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v21.i3.pp1513-1522 ◽

2021 ◽

Vol 21 (3) ◽

pp. 1513

Author(s):

Hartono Hartono ◽

Erianto Ongko ◽

Yeni Risyani

Keyword(s):

Feature Selection ◽

Learning Algorithm ◽

Hybrid Approach ◽

Feature Selection Method ◽

Class Imbalance ◽

Poor Performance ◽

Class Imbalance Problem ◽

Imbalance Problem ◽

Classifier Performance ◽

F Measure

<span>In the classification process that contains class imbalance problems. In addition to the uneven distribution of instances which causes poor performance, overlapping problems also cause performance degradation. This paper proposes a method that combining feature selection and hybrid approach redefinition (HAR) method in handling class imbalance and overlapping for multi-class imbalanced. HAR was a hybrid ensembles method in handling class imbalance problem. The main contribution of this work is to produce a new method that can overcome the problem of class imbalance and overlapping in the multi-class imbalance problem. This method must be able to give better results in terms of classifier performance and overlap degrees in multi-class problems. This is achieved by improving an ensemble learning algorithm and a preprocessing technique in HAR <span>using minimizing overlapping selection under SMOTE (MOSS). MOSS was known as a very popular feature selection method in handling overlapping. To validate the accuracy of the proposed method, this research use augmented R-Value, Mean AUC, Mean F-Measure, Mean G-Mean, and Mean Precision. The performance of the model is evaluated against the hybrid method (MBP+CGE) as a popular method in handling class imbalance and overlapping for multi-class imbalanced. It is found that the proposed method is superior when subjected to classifier performance as indicate with better Mean AUC, F-Measure, G-Mean, and precision.</span></span>

Download Full-text

Feature Selection Method from Multiclass Text with Class Imbalance Problem

Journal of the Korean Institute of Industrial Engineers ◽

10.7232/jkiie.2019.45.2.093 ◽

2019 ◽

Vol 45 (2) ◽

pp. 93-100

Author(s):

Minji Seo ◽

Gilseung Ahn ◽

Sun Hur

Keyword(s):

Feature Selection ◽

Feature Selection Method ◽

Class Imbalance ◽

Selection Method ◽

Class Imbalance Problem ◽

Imbalance Problem

Download Full-text

Combating the Small Sample Class Imbalance Problem Using Feature Selection

IEEE Transactions on Knowledge and Data Engineering ◽

10.1109/tkde.2009.187 ◽

2010 ◽

Vol 22 (10) ◽

pp. 1388-1400 ◽

Cited By ~ 193

Author(s):

Mike Wasikowski ◽

Xue-wen Chen

Keyword(s):

Feature Selection ◽

Class Imbalance ◽

Small Sample ◽

Class Imbalance Problem ◽

Imbalance Problem

Download Full-text

Feature Selection Method Based on Weighted Mutual Information for Imbalanced Data

International Journal of Software Engineering and Knowledge Engineering ◽

10.1142/s0218194018500341 ◽

2018 ◽

Vol 28 (08) ◽

pp. 1177-1194 ◽

Cited By ~ 2

Author(s):

Kewen Li ◽

Mingxiao Yu ◽

Lu Liu ◽

Timing Li ◽

Jiannan Zhai

Keyword(s):

Feature Selection ◽

Mutual Information ◽

Clustering Algorithm ◽

Feature Selection Method ◽

Class Imbalance ◽

Imbalanced Data ◽

Selection Method ◽

Class Imbalance Problem ◽

Minority Class ◽

Imbalance Problem

The class imbalance problem has negative effects on the performance of feature selection in imbalanced data. Traditional feature selection algorithms always study on the balanced class distribution of the data and improve the overall classification accuracy for the optimization goal, which tends to be overwhelmed by the large classes, ignoring the small ones. This paper proposes a novel feature selection method based on the weighted mutual information (WMI) for the imbalanced data, defined as WMI algorithm. The WMI algorithm assigns different weights to the samples based on the fuzzy c-means (FCM) clustering algorithm and then calculates the mutual information based on the weight of each sample. This paper used the AUC as the evaluation criterion of the selected feature. At last, four unbalanced datasets from NASA software defect datasets are used to validate the proposed approach. Experimental results show that the proposed method achieves higher prediction accuracy of both minority class and majority class.

Download Full-text

Semi Supervised Under-Sampling: A Solution to the Class Imbalance Problem for Classification and Feature Selection

Transactions on Engineering Technologies ◽

10.1007/978-94-017-8832-8_44 ◽

2014 ◽

pp. 611-625 ◽

Cited By ~ 1

Author(s):

M. Mostafizur Rahman ◽

Darryl N. Davis

Keyword(s):

Feature Selection ◽

Class Imbalance ◽

Class Imbalance Problem ◽

Imbalance Problem ◽

Under Sampling

Download Full-text

Identifying Robust Risk Factors for Knee Osteoarthritis Progression: An Evolutionary Machine Learning Approach

Healthcare ◽

10.3390/healthcare9030260 ◽

2021 ◽

Vol 9 (3) ◽

pp. 260

Author(s):

Christos Kokkotis ◽

Serafeim Moustakidis ◽

Vasilios Baltzopoulos ◽

Giannis Giakas ◽

Dimitrios Tsaopoulos

Keyword(s):

Machine Learning ◽

Risk Factors ◽

Feature Selection ◽

Knee Osteoarthritis ◽

Class Imbalance ◽

Feature Subset ◽

Class Imbalance Problem ◽

Related Factors ◽

Imbalance Problem ◽

The Impact

Knee osteoarthritis (KOA) is a multifactorial disease which is responsible for more than 80% of the osteoarthritis disease’s total burden. KOA is heterogeneous in terms of rates of progression with several different phenotypes and a large number of risk factors, which often interact with each other. A number of modifiable and non-modifiable systemic and mechanical parameters along with comorbidities as well as pain-related factors contribute to the development of KOA. Although models exist to predict the onset of the disease or discriminate between asymptotic and OA patients, there are just a few studies in the recent literature that focused on the identification of risk factors associated with KOA progression. This paper contributes to the identification of risk factors for KOA progression via a robust feature selection (FS) methodology that overcomes two crucial challenges: (i) the observed high dimensionality and heterogeneity of the available data that are obtained from the Osteoarthritis Initiative (OAI) database and (ii) a severe class imbalance problem posed by the fact that the KOA progressors class is significantly smaller than the non-progressors’ class. The proposed feature selection methodology relies on a combination of evolutionary algorithms and machine learning (ML) models, leading to the selection of a relatively small feature subset of 35 risk factors that generalizes well on the whole dataset (mean accuracy of 71.25%). We investigated the effectiveness of the proposed approach in a comparative analysis with well-known FS techniques with respect to metrics related to both prediction accuracy and generalization capability. The impact of the selected risk factors on the prediction output was further investigated using SHapley Additive exPlanations (SHAP). The proposed FS methodology may contribute to the development of new, efficient risk stratification strategies and identification of risk phenotypes of each KOA patient to enable appropriate interventions.

Download Full-text