COSM: Controlled Over-Sampling Method

Gaetano Zazzaro

doi:10.14738/tmlai.82.7925

COSM: Controlled Over-Sampling Method

Transactions on Machine Learning and Artificial Intelligence ◽

10.14738/tmlai.82.7925 ◽

2020 ◽

Vol 8 (2) ◽

pp. 42-51

Author(s):

Gaetano Zazzaro

Keyword(s):

Data Mining ◽

Sampling Method ◽

Class Imbalance ◽

Classification Model ◽

Machine Learning Method ◽

Learning Method ◽

Class Imbalance Problem ◽

Minority Class ◽

Imbalance Problem ◽

The Common

The class imbalance problem is widespread in Data Mining and it can reduce the general performance of a classification model. Many techniques have been proposed in order to overcome it, thanks to which a model able to handling rare events can be trained. The methodology presented in this paper, called Controlled Over-Sampling Method (COSM), includes a controller model able to reject new synthetic elements for which there is no certainty of belonging to the minority class. It combines the common Machine Learning method for holdout with an oversampling algorithm, for example the classic SMOTE algorithm. The proposal explained and designed here represents a guideline for the application of oversampling algorithms, but also a brief overview on techniques for overcoming the problem of the unbalanced class in Data Mining.

Download Full-text

Combining Hybrid Approach Redefinition-Multiclass Imbalance (HAR-MI) and Hybrid Sampling in Handling Multi-Class Imbalance and Overlapping

JOIV International Journal on Informatics Visualization ◽

10.30630/joiv.5.1.420 ◽

2021 ◽

Vol 5 (1) ◽

Author(s):

Hartono Hartono ◽

Erianto Ongko

Keyword(s):

Sampling Method ◽

Hybrid Approach ◽

Class Imbalance ◽

Classifier Ensembles ◽

Class Imbalance Problem ◽

Minority Class ◽

Imbalance Problem ◽

Classifier Performance ◽

R Value ◽

Hybrid Sampling

Class imbalance is one of the main problems in classification because the number of samples in majority class is far more than the number of samples in minority class. The class imbalance problem in the multi-class dataset is much more difficult to handle than the problem in the two class dataset. This multi-class imbalance problem is even more complicated if it is accompanied by overlapping. One method that has proven reliable in dealing with this problem is the Hybrid Approach Redefinition-Multiclass Imbalance (HAR-MI) method which is classified as a hybrid approach which combines sampling and classifier ensembles. However, in terms of diversity among classifiers, hybrid approach that combine sampling and classifier ensembles will give better results. HAR-MI delivers excellent results in handling multi-class imbalances. The HAR-MI method uses SMOTE to increase the number of sample in minority class. However, this SMOTE also has a weakness where if there is an extremely imbalanced dataset and a large number of attributes there will be over-fitting. To overcome the problem of over-fitting, the Hybrid Sampling method was proposed. HAR-MI combination with Hybrid Sampling is done to increase the number of samples in the minority class and at the same time reduce the number of noise samples in the majority class. The preprocessing stages at HAR-MI will use the Minimizing Overlapping Selection under Hybrid Sazmpling (MOSHS) method and the processing stages will use Different Contribution Sampling. The results obtained will be compared with the results using Neighbourhood-based undersampling. Overlapping and Classifier Performance will be measured using Augmented R-Value, the Matthews Correlation Coefficient (MCC), Precision, Recall, and F-Value. The results showed that HAR-MI with Hybrid Sampling gave better results in terms of Augmented R-Value, Precision, Recall, and F-Value.

Download Full-text

Hybridization of ring theory-based evolutionary algorithm and particle swarm optimization to solve class imbalance problem

Complex & Intelligent Systems ◽

10.1007/s40747-021-00314-z ◽

2021 ◽

Author(s):

Sayan Surya Shaw ◽

Shameem Ahmed ◽

Samir Malakar ◽

Laura Garcia-Hernandez ◽

Ajith Abraham ◽

...

Keyword(s):

Particle Swarm Optimization ◽

Real Life ◽

Class Imbalance ◽

Ring Theory ◽

Class Imbalance Problem ◽

Minority Class ◽

Swarm Optimization ◽

Imbalance Problem ◽

Representative Samples ◽

Selection Of

AbstractMany real-life datasets are imbalanced in nature, which implies that the number of samples present in one class (minority class) is exceptionally less compared to the number of samples found in the other class (majority class). Hence, if we directly fit these datasets to a standard classifier for training, then it often overlooks the minority class samples while estimating class separating hyperplane(s) and as a result of that it missclassifies the minority class samples. To solve this problem, over the years, many researchers have followed different approaches. However the selection of the true representative samples from the majority class is still considered as an open research problem. A better solution for this problem would be helpful in many applications like fraud detection, disease prediction and text classification. Also, the recent studies show that it needs not only analyzing disproportion between classes, but also other difficulties rooted in the nature of different data and thereby it needs more flexible, self-adaptable, computationally efficient and real-time method for selection of majority class samples without loosing much of important data from it. Keeping this fact in mind, we have proposed a hybrid model constituting Particle Swarm Optimization (PSO), a popular swarm intelligence-based meta-heuristic algorithm, and Ring Theory (RT)-based Evolutionary Algorithm (RTEA), a recently proposed physics-based meta-heuristic algorithm. We have named the algorithm as RT-based PSO or in short RTPSO. RTPSO can select the most representative samples from the majority class as it takes advantage of the efficient exploration and the exploitation phases of its parent algorithms for strengthening the search process. We have used AdaBoost classifier to observe the final classification results of our model. The effectiveness of our proposed method has been evaluated on 15 standard real-life datasets having low to extreme imbalance ratio. The performance of the RTPSO has been compared with PSO, RTEA and other standard undersampling methods. The obtained results demonstrate the superiority of RTPSO over state-of-the-art class imbalance problem-solvers considered here for comparison. The source code of this work is available in https://github.com/Sayansurya/RTPSO_Class_imbalance.

Download Full-text

Data Mining for Potential Customer Segmentation in the Marketing Bank Dataset

JUITA Jurnal Informatika ◽

10.30595/juita.v9i1.7983 ◽

2021 ◽

Vol 9 (1) ◽

pp. 25

Author(s):

Maulida Ayu Fitriani ◽

Dany Candra Febrianto

Keyword(s):

Data Mining ◽

Random Forest ◽

Direct Marketing ◽

Class Imbalance ◽

Processing Technique ◽

Customer Segmentation ◽

Class Imbalance Problem ◽

Imbalance Problem ◽

Mining Methods ◽

Bank Marketing

Direct marketing is an effort made by the Bank to increase sales of its products and services, but the Bank sometimes has to contact a customer or prospective customer more than once to ascertain whether the customer or prospective customer is willing to subscribe to a product or service. To overcome this ineffective process several data mining methods are proposed. This study compares several data mining methods such as Naïve Bayes, K-NN, Random Forest, SVM, J48, AdaBoost J48 which prior to classification the SMOTE pre-processing technique was done in order to eliminate the class imbalance problem in the Bank Marketing dataset instance. The SMOTE + Random Forest method in this study produced the highest accuracy value of 92.61%.

Download Full-text

Using Ensembles for Class-Imbalance Problem to Predict Maintainability of Open Source Software

International Journal of Reliability Quality and Safety Engineering ◽

10.1142/s0218539320400112 ◽

2020 ◽

Vol 27 (05) ◽

pp. 2040011

Author(s):

Ruchika Malhotra ◽

Kusum Lata

Keyword(s):

Software Maintenance ◽

Software Metrics ◽

Performance Metrics ◽

Class Imbalance ◽

Maintenance Cost ◽

Class Imbalance Problem ◽

Imbalanced Datasets ◽

Minority Class ◽

Imbalance Problem ◽

Ensemble Techniques

To facilitate software maintenance and save the maintenance cost, numerous machine learning (ML) techniques have been studied to predict the maintainability of software modules or classes. An abundant amount of effort has been put by the research community to develop software maintainability prediction (SMP) models by relating software metrics to the maintainability of modules or classes. When software classes demanding the high maintainability effort (HME) are less as compared to the low maintainability effort (LME) classes, the situation leads to imbalanced datasets for training the SMP models. The imbalanced class distribution in SMP datasets could be a dilemma for various ML techniques because, in the case of an imbalanced dataset, minority class instances are either misclassified by the ML techniques or get discarded as noise. The recent development in predictive modeling has ascertained that ensemble techniques can boost the performance of ML techniques by collating their predictions. Ensembles themselves do not solve the class-imbalance problem much. However, aggregation of ensemble techniques with the certain techniques to handle class-imbalance problem (e.g., data resampling) has led to several proposals in research. This paper evaluates the performance of ensembles for the class-imbalance in the domain of SMP. The ensembles for class-imbalance problem (ECIP) are the modification of ensembles which pre-process the imbalanced data using data resampling before the learning process. This study experimentally compares the performance of several ECIP using performance metrics Balance and g-Mean over eight Apache software datasets. The results of the study advocate that for imbalanced datasets, ECIP improves the performance of SMP models as compared to classic ensembles.

Download Full-text

A learning method for the class imbalance problem with medical data sets

Computers in Biology and Medicine ◽

10.1016/j.compbiomed.2010.03.005 ◽

2010 ◽

Vol 40 (5) ◽

pp. 509-518 ◽

Cited By ~ 89

Author(s):

Der-Chiang Li ◽

Chiao-Wen Liu ◽

Susan C. Hu

Keyword(s):

Class Imbalance ◽

Medical Data ◽

Data Sets ◽

Learning Method ◽

Class Imbalance Problem ◽

Imbalance Problem

Download Full-text

A New Big Data Model Using Distributed Cluster-Based Resampling for Class-Imbalance Problem

Applied Computer Systems ◽

10.2478/acss-2019-0013 ◽

2019 ◽

Vol 24 (2) ◽

pp. 104-110

Author(s):

Duygu Sinanc Terzi ◽

Seref Sagiroglu

Keyword(s):

Big Data ◽

Class Imbalance ◽

Area Under The Curve ◽

Data Sets ◽

Class Imbalance Problem ◽

Imbalance Problem ◽

The Common ◽

Public Datasets ◽

Distributed Cluster

Abstract The class imbalance problem, one of the common data irregularities, causes the development of under-represented models. To resolve this issue, the present study proposes a new cluster-based MapReduce design, entitled Distributed Cluster-based Resampling for Imbalanced Big Data (DIBID). The design aims at modifying the existing dataset to increase the classification success. Within the study, DIBID has been implemented on public datasets under two strategies. The first strategy has been designed to present the success of the model on data sets with different imbalanced ratios. The second strategy has been designed to compare the success of the model with other imbalanced big data solutions in the literature. According to the results, DIBID outperformed other imbalanced big data solutions in the literature and increased area under the curve values between 10 % and 24 % through the case study.

Download Full-text

CoGBUS- Center of Gravity based under Sampling Method for Imbalanced Data Classification

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.b2077.078219 ◽

2019 ◽

Vol 8 (2) ◽

pp. 2463-2468

Keyword(s):

Learning Community ◽

Sampling Method ◽

Class Imbalance ◽

Imbalanced Data ◽

Center Of Gravity ◽

Classification Algorithms ◽

Class Imbalance Problem ◽

Imbalance Problem ◽

Imbalanced Data Classification ◽

Under Sampling

Learning of class imbalanced data becomes a challenging issue in the machine learning community as all classification algorithms are designed to work for balanced datasets. Several methods are available to tackle this issue, among which the resampling techniques- undersampling and oversampling are more flexible and versatile. This paper introduces a new concept for undersampling based on Center of Gravity principle which helps to reduce the excess instances of majority class. This work is suited for binary class problems. The proposed technique –CoGBUS- overcomes the class imbalance problem and brings best results in the study. We take F-Score, GMean and ROC for the performance evaluation of the method.

Download Full-text

A Comprehensive Analysis of Handling Imbalanced Dataset

International Journal of Advanced Trends in Computer Science and Engineering ◽

10.30534/ijatcse/2021/031022021 ◽

2021 ◽

Vol 10 (2) ◽

pp. 454-463

Keyword(s):

Machine Learning ◽

Class Imbalance ◽

Evaluation Process ◽

Data Sampling ◽

Class Imbalance Problem ◽

Minority Class ◽

Class Distribution ◽

Imbalance Problem ◽

Conventional Manner ◽

Imbalance Dataset

Classification is a major obstacle in Machine Learning generally and also specific when tackling class imbalance problem. A dataset is said to be imbalanced if a class we are interested in falls to the minority class and appears scanty when compared to the majority class, the minority class is also known as the positive class while the majority class is also known as the negative class. Class imbalance has been a major bottleneck for Machine Learning scientist as it often leads to using wrong model for different purposes, this Survey will lead researchers to choose the right model and the best strategies to handle imbalance dataset in the course of tackling machine learning problems. Proper handling of class imbalance dataset could leads to accurate and good result. Handling class imbalance data in a conventional manner, especially when the level of imbalance is high may leads to accuracy paradox (an assumption of realizing 99% accuracy during evaluation process when the class distribution is highly imbalanced), hence imbalance class distribution requires special consideration, and for this purpose we dealt extensively on handling and solving imbalanced class problem in machine learning, such as Data Sampling Approach, Cost sensitive learning approach and Ensemble Approach.

Download Full-text

Unbalanced sentiment classification: an assessment of ANN in the context of sampling the majority class

10.7287/peerj.preprints.26618 ◽

2018 ◽

Author(s):

Rodrigo Moraes ◽

João Francisco Valiati ◽

Wilson Pires Gavião Neto

Keyword(s):

Sampling Method ◽

Class Imbalance ◽

Sentiment Classification ◽

Support Vector ◽

Class Imbalance Problem ◽

Imbalance Problem ◽

Learning Techniques ◽

Vector Machines ◽

Weighting Methods ◽

Document Level

Many people make their opinions available on the Internet nowadays, and researchers have been proposing methods to automate the task of classifying textual reviews as positive or negative. Usual supervised learning techniques have been adopted to accomplish such a task. In practice, positive reviews are abundant in comparison to negative's. This context poses challenges to learning-based methods and data undersampling/oversampling are popular preprocessing techniques to overcome the problem. A combination of sampling techniques and learning methods, like Artificial Neural Networks (ANN) or Support Vector Machines (SVM), has been successfully adopted as a classification approach in many areas, while the sentiment classification literature has not explored ANN in studies that involve sampling methods to balance data. Even the performance of SVM, which is widely used as a sentiment learner, has been rarely addressed under the context of a preceding sampling method. This paper addresses document-level sentiment analysis with unbalanced data and focus on empirically assessing the performance of ANN in the context of undersampling the (majority) set of positive reviews. We adopted the performance of SVM as a baseline, since some studies have indicated SVM as being less subject to the class imbalance problem. Results are produced in terms of a traditional bag-of-words model with popular feature selection and weighting methods. Our experiments indicated that SVM are more stable than ANN in highly unbalanced (80%) data scenarios. However, under the discarding of information generated by random undersampling, ANN outperform SVM or produce comparable results.

Download Full-text

Artificial Immune System-Based Classification in Extremely Imbalanced Classification Problems

International Journal of Artificial Intelligence Tools ◽

10.1142/s0218213017500099 ◽

2017 ◽

Vol 26 (03) ◽

pp. 1750009 ◽

Cited By ~ 1

Author(s):

Dionisios N. Sotiropoulos ◽

George A. Tsihrintzis

Keyword(s):

Machine Learning ◽

Immune System ◽

Artificial Immune System ◽

Class Imbalance ◽

Classification Algorithm ◽

Support Vector ◽

Artificial Immune ◽

Class Imbalance Problem ◽

Minority Class ◽

Imbalance Problem

This paper focuses on a special category of machine learning problems arising in cases where the set of available training instances is significantly biased towards a particular class of patterns. Our work addresses the so-called Class Imbalance Problem through the utilization of an Artificial Immune System-(AIS)based classification algorithm which encodes the inherent ability of the Adaptive Immune System to mediate the exceptionally imbalanced “self” / “non-self” discrimination process. From a computational point of view, this process constitutes an extremely imbalanced pattern classification task since the vast majority of molecular patterns pertain to the “non-self” space. Our work focuses on investigating the effect of the class imbalance problem on the AIS-based classification algorithm by assessing its relative ability to deal with extremely skewed datasets when compared against two state-of-the-art machine learning paradigms such as Support Vector Machines (SVMs) and Multi-Layer Perceptrons (MLPs). To this end, we conducted a series of experiments on a music-related dataset where a small fraction of positive samples was to be recognized against the vast volume of negative samples. The results obtained indicate that the utilized bio-inspired classifier outperforms SVMs in detecting patterns from the minority class while its performance on the same task is competently close to the one exhibited by MLPs. Our findings suggest that the AIS-based classifier relies on its intrinsic resampling and class-balancing functionality in order to address the class imbalance problem.

Download Full-text