scholarly journals COSM: Controlled Over-Sampling Method

2020 ◽  
Vol 8 (2) ◽  
pp. 42-51
Author(s):  
Gaetano Zazzaro

The class imbalance problem is widespread in Data Mining and it can reduce the general performance of a classification model. Many techniques have been proposed in order to overcome it, thanks to which a model able to handling rare events can be trained. The methodology presented in this paper, called Controlled Over-Sampling Method (COSM), includes a controller model able to reject new synthetic elements for which there is no certainty of belonging to the minority class. It combines the common Machine Learning method for holdout with an oversampling algorithm, for example the classic SMOTE algorithm. The proposal explained and designed here represents a guideline for the application of oversampling algorithms, but also a brief overview on techniques for overcoming the problem of the unbalanced class in Data Mining.

Author(s):  
Hartono Hartono ◽  
Erianto Ongko

Class imbalance is one of the main problems in classification because the number of samples in majority class is far more than the number of samples in minority class.  The class imbalance problem in the multi-class dataset is much more difficult to handle than the problem in the two class dataset. This multi-class imbalance problem is even more complicated if it is accompanied by overlapping. One method that has proven reliable in dealing with this problem is the Hybrid Approach Redefinition-Multiclass Imbalance (HAR-MI) method which is classified as a hybrid approach which combines sampling and classifier ensembles. However, in terms of diversity among classifiers, hybrid approach that combine sampling and classifier ensembles will give better results. HAR-MI delivers excellent results in handling multi-class imbalances. The HAR-MI method uses SMOTE to increase the number of sample in minority class. However, this SMOTE also has a weakness where if there is an extremely imbalanced dataset and a large number of attributes there will be over-fitting. To overcome the problem of over-fitting, the Hybrid Sampling method was proposed. HAR-MI combination with Hybrid Sampling is done to increase the number of samples in the minority class and at the same time reduce the number of noise samples in the majority class. The preprocessing stages at HAR-MI will use the Minimizing Overlapping Selection under Hybrid Sazmpling (MOSHS) method and the processing stages will use Different Contribution Sampling. The results obtained will be compared with the results using Neighbourhood-based undersampling. Overlapping and Classifier Performance will be measured using Augmented R-Value, the Matthews Correlation Coefficient (MCC), Precision, Recall, and F-Value. The results showed that HAR-MI with Hybrid Sampling gave better results in terms of Augmented R-Value, Precision, Recall, and F-Value.


Author(s):  
Sayan Surya Shaw ◽  
Shameem Ahmed ◽  
Samir Malakar ◽  
Laura Garcia-Hernandez ◽  
Ajith Abraham ◽  
...  

AbstractMany real-life datasets are imbalanced in nature, which implies that the number of samples present in one class (minority class) is exceptionally less compared to the number of samples found in the other class (majority class). Hence, if we directly fit these datasets to a standard classifier for training, then it often overlooks the minority class samples while estimating class separating hyperplane(s) and as a result of that it missclassifies the minority class samples. To solve this problem, over the years, many researchers have followed different approaches. However the selection of the true representative samples from the majority class is still considered as an open research problem. A better solution for this problem would be helpful in many applications like fraud detection, disease prediction and text classification. Also, the recent studies show that it needs not only analyzing disproportion between classes, but also other difficulties rooted in the nature of different data and thereby it needs more flexible, self-adaptable, computationally efficient and real-time method for selection of majority class samples without loosing much of important data from it. Keeping this fact in mind, we have proposed a hybrid model constituting Particle Swarm Optimization (PSO), a popular swarm intelligence-based meta-heuristic algorithm, and Ring Theory (RT)-based Evolutionary Algorithm (RTEA), a recently proposed physics-based meta-heuristic algorithm. We have named the algorithm as RT-based PSO or in short RTPSO. RTPSO can select the most representative samples from the majority class as it takes advantage of the efficient exploration and the exploitation phases of its parent algorithms for strengthening the search process. We have used AdaBoost classifier to observe the final classification results of our model. The effectiveness of our proposed method has been evaluated on 15 standard real-life datasets having low to extreme imbalance ratio. The performance of the RTPSO has been compared with PSO, RTEA and other standard undersampling methods. The obtained results demonstrate the superiority of RTPSO over state-of-the-art class imbalance problem-solvers considered here for comparison. The source code of this work is available in https://github.com/Sayansurya/RTPSO_Class_imbalance.


2021 ◽  
Vol 9 (1) ◽  
pp. 25
Author(s):  
Maulida Ayu Fitriani ◽  
Dany Candra Febrianto

Direct marketing is an effort made by the Bank to increase sales of its products and services, but the Bank sometimes has to contact a customer or prospective customer more than once to ascertain whether the customer or prospective customer is willing to subscribe to a product or service. To overcome this ineffective process several data mining methods are proposed. This study compares several data mining methods such as Naïve Bayes, K-NN, Random Forest, SVM, J48, AdaBoost J48 which prior to classification the SMOTE pre-processing technique was done in order to eliminate the class imbalance problem in the Bank Marketing dataset instance. The SMOTE + Random Forest method in this study produced the highest accuracy value of 92.61%.


Author(s):  
Ruchika Malhotra ◽  
Kusum Lata

To facilitate software maintenance and save the maintenance cost, numerous machine learning (ML) techniques have been studied to predict the maintainability of software modules or classes. An abundant amount of effort has been put by the research community to develop software maintainability prediction (SMP) models by relating software metrics to the maintainability of modules or classes. When software classes demanding the high maintainability effort (HME) are less as compared to the low maintainability effort (LME) classes, the situation leads to imbalanced datasets for training the SMP models. The imbalanced class distribution in SMP datasets could be a dilemma for various ML techniques because, in the case of an imbalanced dataset, minority class instances are either misclassified by the ML techniques or get discarded as noise. The recent development in predictive modeling has ascertained that ensemble techniques can boost the performance of ML techniques by collating their predictions. Ensembles themselves do not solve the class-imbalance problem much. However, aggregation of ensemble techniques with the certain techniques to handle class-imbalance problem (e.g., data resampling) has led to several proposals in research. This paper evaluates the performance of ensembles for the class-imbalance in the domain of SMP. The ensembles for class-imbalance problem (ECIP) are the modification of ensembles which pre-process the imbalanced data using data resampling before the learning process. This study experimentally compares the performance of several ECIP using performance metrics Balance and g-Mean over eight Apache software datasets. The results of the study advocate that for imbalanced datasets, ECIP improves the performance of SMP models as compared to classic ensembles.


2019 ◽  
Vol 24 (2) ◽  
pp. 104-110
Author(s):  
Duygu Sinanc Terzi ◽  
Seref Sagiroglu

Abstract The class imbalance problem, one of the common data irregularities, causes the development of under-represented models. To resolve this issue, the present study proposes a new cluster-based MapReduce design, entitled Distributed Cluster-based Resampling for Imbalanced Big Data (DIBID). The design aims at modifying the existing dataset to increase the classification success. Within the study, DIBID has been implemented on public datasets under two strategies. The first strategy has been designed to present the success of the model on data sets with different imbalanced ratios. The second strategy has been designed to compare the success of the model with other imbalanced big data solutions in the literature. According to the results, DIBID outperformed other imbalanced big data solutions in the literature and increased area under the curve values between 10 % and 24 % through the case study.


2019 ◽  
Vol 8 (2) ◽  
pp. 2463-2468

Learning of class imbalanced data becomes a challenging issue in the machine learning community as all classification algorithms are designed to work for balanced datasets. Several methods are available to tackle this issue, among which the resampling techniques- undersampling and oversampling are more flexible and versatile. This paper introduces a new concept for undersampling based on Center of Gravity principle which helps to reduce the excess instances of majority class. This work is suited for binary class problems. The proposed technique –CoGBUS- overcomes the class imbalance problem and brings best results in the study. We take F-Score, GMean and ROC for the performance evaluation of the method.


Classification is a major obstacle in Machine Learning generally and also specific when tackling class imbalance problem. A dataset is said to be imbalanced if a class we are interested in falls to the minority class and appears scanty when compared to the majority class, the minority class is also known as the positive class while the majority class is also known as the negative class. Class imbalance has been a major bottleneck for Machine Learning scientist as it often leads to using wrong model for different purposes, this Survey will lead researchers to choose the right model and the best strategies to handle imbalance dataset in the course of tackling machine learning problems. Proper handling of class imbalance dataset could leads to accurate and good result. Handling class imbalance data in a conventional manner, especially when the level of imbalance is high may leads to accuracy paradox (an assumption of realizing 99% accuracy during evaluation process when the class distribution is highly imbalanced), hence imbalance class distribution requires special consideration, and for this purpose we dealt extensively on handling and solving imbalanced class problem in machine learning, such as Data Sampling Approach, Cost sensitive learning approach and Ensemble Approach.


2018 ◽  
Author(s):  
Rodrigo Moraes ◽  
João Francisco Valiati ◽  
Wilson Pires Gavião Neto

Many people make their opinions available on the Internet nowadays, and researchers have been proposing methods to automate the task of classifying textual reviews as positive or negative. Usual supervised learning techniques have been adopted to accomplish such a task. In practice, positive reviews are abundant in comparison to negative's. This context poses challenges to learning-based methods and data undersampling/oversampling are popular preprocessing techniques to overcome the problem. A combination of sampling techniques and learning methods, like Artificial Neural Networks (ANN) or Support Vector Machines (SVM), has been successfully adopted as a classification approach in many areas, while the sentiment classification literature has not explored ANN in studies that involve sampling methods to balance data. Even the performance of SVM, which is widely used as a sentiment learner, has been rarely addressed under the context of a preceding sampling method. This paper addresses document-level sentiment analysis with unbalanced data and focus on empirically assessing the performance of ANN in the context of undersampling the (majority) set of positive reviews. We adopted the performance of SVM as a baseline, since some studies have indicated SVM as being less subject to the class imbalance problem. Results are produced in terms of a traditional bag-of-words model with popular feature selection and weighting methods. Our experiments indicated that SVM are more stable than ANN in highly unbalanced (80%) data scenarios. However, under the discarding of information generated by random undersampling, ANN outperform SVM or produce comparable results.


2017 ◽  
Vol 26 (03) ◽  
pp. 1750009 ◽  
Author(s):  
Dionisios N. Sotiropoulos ◽  
George A. Tsihrintzis

This paper focuses on a special category of machine learning problems arising in cases where the set of available training instances is significantly biased towards a particular class of patterns. Our work addresses the so-called Class Imbalance Problem through the utilization of an Artificial Immune System-(AIS)based classification algorithm which encodes the inherent ability of the Adaptive Immune System to mediate the exceptionally imbalanced “self” / “non-self” discrimination process. From a computational point of view, this process constitutes an extremely imbalanced pattern classification task since the vast majority of molecular patterns pertain to the “non-self” space. Our work focuses on investigating the effect of the class imbalance problem on the AIS-based classification algorithm by assessing its relative ability to deal with extremely skewed datasets when compared against two state-of-the-art machine learning paradigms such as Support Vector Machines (SVMs) and Multi-Layer Perceptrons (MLPs). To this end, we conducted a series of experiments on a music-related dataset where a small fraction of positive samples was to be recognized against the vast volume of negative samples. The results obtained indicate that the utilized bio-inspired classifier outperforms SVMs in detecting patterns from the minority class while its performance on the same task is competently close to the one exhibited by MLPs. Our findings suggest that the AIS-based classifier relies on its intrinsic resampling and class-balancing functionality in order to address the class imbalance problem.


Sign in / Sign up

Export Citation Format

Share Document