Probability Density Machine: A New Solution of Class Imbalance Learning

Scientific Programming ◽

10.1155/2021/7555587 ◽

2021 ◽

Vol 2021 ◽

pp. 1-14

Author(s):

Ruihan Cheng ◽

Longfei Zhang ◽

Shiqi Wu ◽

Sen Xu ◽

Shang Gao ◽

...

Keyword(s):

Probability Density ◽

Predictive Model ◽

Data Distribution ◽

Class Imbalance ◽

Imbalanced Data ◽

Training Data ◽

Probability Density Estimation ◽

Imbalance Learning ◽

Class Imbalance Learning ◽

The Impact

Class imbalance learning (CIL) is an important branch of machine learning as, in general, it is difficult for classification models to learn from imbalanced data; meanwhile, skewed data distribution frequently exists in various real-world applications. In this paper, we introduce a novel solution of CIL called Probability Density Machine (PDM). First, in the context of Gaussian Naive Bayes (GNB) predictive model, we analyze the reason why imbalanced data distribution makes the performance of predictive model decline in theory and draw a conclusion regarding the impact of class imbalance that is only associated with the prior probability, but does not relate to the conditional probability of training data. Then, in such context, we show the rationality of several traditional CIL techniques. Furthermore, we indicate the drawback of combining GNB with these traditional CIL techniques. Next, profiting from the idea of K-nearest neighbors probability density estimation (KNN-PDE), we propose the PDM which is an improved GNB-based CIL algorithm. Finally, we conduct experiments on lots of class imbalance data sets, and the proposed PDM algorithm shows the promising results.

Download Full-text

An Empirical Study of Boosting Methods on Severely Imbalanced Data

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.513-517.2510 ◽

2014 ◽

Vol 513-517 ◽

pp. 2510-2513 ◽

Cited By ~ 1

Author(s):

Xu Ying Liu

Keyword(s):

Empirical Study ◽

Real World ◽

Class Imbalance ◽

Imbalanced Data ◽

Real World Applications ◽

Under Sampling ◽

The Difference ◽

Imbalance Learning ◽

Class Imbalance Learning ◽

F Measure

Nowadays there are large volumes of data in real-world applications, which poses great challenge to class-imbalance learning: the large amount of the majority class examples and severe class-imbalance. Previous studies on class-imbalance learning mainly focused on relatively small or moderate class-imbalance. In this paper we conduct an empirical study to explore the difference between learning with small or moderate class-imbalance and learning with severe class-imbalance. The experimental results show that: (1) Traditional methods cannot handle severe class-imbalance effectively. (2) AUC, G-mean and F-measure can be very inconsistent for severe class-imbalance, which seldom appears when class-imbalance is moderate. And G-mean is not appropriate for severe class-imbalance learning because it is not sensitive to the change of imbalance ratio. (3) When AUC and G-mean are evaluation metrics, EasyEnsemble is the best method, followed by BalanceCascade and under-sampling. (4) A little under-full balance is better for under-sampling to handle severe class-imbalance. And it is important to handle false positives when design methods for severe class-imbalance.

Download Full-text

A Multi-Objective Ensemble Method for Class Imbalance Learning

International Journal of Big Data and Analytics in Healthcare ◽

10.4018/ijbdah.2017010102 ◽

2017 ◽

Vol 2 (1) ◽

pp. 16-34

Author(s):

Sajad Emamipour ◽

Rasoul Sali ◽

Zahra Yousefi

Keyword(s):

Class Imbalance ◽

Imbalanced Data ◽

Classification Performance ◽

Ensemble Classifiers ◽

Feature Selection Technique ◽

Multi Objective ◽

Proposed Model ◽

Training Examples ◽

Imbalance Learning ◽

Class Imbalance Learning

This article describes how class imbalance learning has attracted great attention in recent years as many real world domain applications suffer from this problem. Imbalanced class distribution occurs when the number of training examples for one class far surpasses the training examples of the other class often the one that is of more interest. This problem may produce an important deterioration of the classifier performance, in particular with patterns belonging to the less represented classes. Toward this end, the authors developed a hybrid model to address the class imbalance learning with focus on binary class problems. This model combines benefits of the ensemble classifiers with a multi objective feature selection technique to achieve higher classification performance. The authors' model also proposes non-dominated sets of features. Then they evaluate the performance of the proposed model by comparing its results with notable algorithms for solving imbalanced data problem. Finally, the authors utilize the proposed model in medical domain of predicting life expectancy in post-operative of thoracic surgery patients.

Download Full-text

VFC-SMOTE: very fast continuous synthetic minority oversampling for evolving data streams

Data Mining and Knowledge Discovery ◽

10.1007/s10618-021-00786-0 ◽

2021 ◽

Author(s):

Alessio Bernardo ◽

Emanuele Della Valle

Keyword(s):

Data Streams ◽

Concept Drift ◽

Class Imbalance ◽

Imbalanced Data ◽

Real Data ◽

Minority Class ◽

Machine Learning Classification ◽

Imbalance Learning ◽

Class Imbalance Learning ◽

Better Than

AbstractThe world is constantly changing, and so are the massive amount of data produced. However, only a few studies deal with online class imbalance learning that combines the challenges of class-imbalanced data streams and concept drift. In this paper, we propose the very fast continuous synthetic minority oversampling technique (VFC-SMOTE). It is a novel meta-strategy to be prepended to any streaming machine learning classification algorithm aiming at oversampling the minority class using a new version of Smote and Borderline-Smote inspired by Data Sketching. We benchmarked VFC-SMOTE pipelines on synthetic and real data streams containing different concept drifts, imbalance levels, and class distributions. We bring statistical evidence that VFC-SMOTE pipelines learn models whose minority class performances are better than state-of-the-art. Moreover, we analyze the time/memory consumption and the concept drift recovery speed.

Download Full-text

Early Warning of Gas Concentration in Coal Mines Production Based on Probability Density Machine

Sensors ◽

10.3390/s21175730 ◽

2021 ◽

Vol 21 (17) ◽

pp. 5730

Author(s):

Yadong Cai ◽

Shiqi Wu ◽

Ming Zhou ◽

Shang Gao ◽

Hualong Yu

Keyword(s):

Probability Density ◽

Coal Mine ◽

Early Warning ◽

Learning Algorithms ◽

Class Imbalance ◽

Gas Explosion ◽

Concentration Data ◽

Gas Concentration ◽

Imbalance Learning ◽

Class Imbalance Learning

Gas explosion has always been an important factor restricting coal mine production safety. The application of machine learning techniques in coal mine gas concentration prediction and early warning can effectively prevent gas explosion accidents. Nearly all traditional prediction models use a regression technique to predict gas concentration. Considering there exist very few instances of high gas concentration, the instance distribution of gas concentration would be extremely imbalanced. Therefore, such regression models generally perform poorly in predicting high gas concentration instances. In this study, we consider early warning of gas concentration as a binary-class problem, and divide gas concentration data into warning class and non-warning class according to the concentration threshold. We proposed the probability density machine (PDM) algorithm with excellent adaptability to imbalanced data distribution. In this study, we use the original gas concentration data collected from several monitoring points in a coal mine in Datong city, Shanxi Province, China, to train the PDM model and to compare the model with several class imbalance learning algorithms. The results show that the PDM algorithm is superior to the traditional and state-of-the-art class imbalance learning algorithms, and can produce more accurate early warning results for gas explosion.

Download Full-text

Selecting the Optimal Combination Model of FSSVM for the Imbalance Datasets

Mathematical Problems in Engineering ◽

10.1155/2014/539430 ◽

2014 ◽

Vol 2014 ◽

pp. 1-6

Author(s):

Chuandong Qin ◽

Huixia Zhao

Keyword(s):

Support Vector Machines ◽

Class Imbalance ◽

Imbalanced Data ◽

Support Vector ◽

Smooth Functions ◽

Separating Hyperplane ◽

Vector Machines ◽

Imbalance Learning ◽

Class Imbalance Learning ◽

Imbalanced Data Learning

Imbalanced data learning is one of the most active and important fields in machine learning research. The existing class imbalance learning methods can make Support Vector Machines (SVMs) less sensitive to class imbalance; they still suffer from the disturbance of outliers and noise present in the datasets. A kind of Fuzzy Smooth Support Vector Machines (FSSVMs) are proposed based on the Smooth Support Vector Machine (SSVM) of O. L. Mangasarian. SSVM can be computed by the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm or the Newton-Armijo algorithm easily. Two kinds of fuzzy memberships and three smooth functions can be chosen in the algorithms. The fuzzy memberships consider the contribution rate of each sample to the optimal separating hyperplane. The polynomial smooth functions can make the optimization problem more accurate at the inflection point. Those changes play the active effects on trials. The results of the experiments show that the FSSVMs can gain the better accuracy and the shorter time than the SSVMs and some of the other methods.

Download Full-text

Feature Ranking and Screening for Class-Imbalanced Metabolomics Data Based on Rank Aggregation Coupled with Re-Balance

Metabolites ◽

10.3390/metabo11060389 ◽

2021 ◽

Vol 11 (6) ◽

pp. 389

Author(s):

Guang-Hui Fu ◽

Jia-Bao Wang ◽

Min-Jie Zong ◽

Lun-Zhao Yi

Keyword(s):

Class Imbalance ◽

Imbalanced Data ◽

Rank Aggregation ◽

Screening Methods ◽

Metabolomics Data ◽

Feature Screening ◽

Sampling Procedures ◽

Imbalance Learning ◽

Class Imbalance Learning ◽

Filtering Techniques

Feature screening is an important and challenging topic in current class-imbalance learning. Most of the existing feature screening algorithms in class-imbalance learning are based on filtering techniques. However, the variable rankings obtained by various filtering techniques are generally different, and this inconsistency among different variable ranking methods is usually ignored in practice. To address this problem, we propose a simple strategy called rank aggregation with re-balance (RAR) for finding key variables from class-imbalanced data. RAR fuses each rank to generate a synthetic rank that takes every ranking into account. The class-imbalanced data are modified via different re-sampling procedures, and RAR is performed in this balanced situation. Five class-imbalanced real datasets and their re-balanced ones are employed to test the RAR’s performance, and RAR is compared with several popular feature screening methods. The result shows that RAR is highly competitive and almost better than single filtering screening in terms of several assessing metrics. Performing re-balanced pretreatment is hugely effective in rank aggregation when the data are class-imbalanced.

Download Full-text

Class Imbalance Learning in Data Mining – A Survey

International Journal of Communication Technology for Social Networking Services ◽

10.21742/ijctsns.2015.3.2.02 ◽

2015 ◽

Vol 3 (2) ◽

pp. 17-36 ◽

Cited By ~ 1

Author(s):

Ali Mirza Mahmood ◽

Keyword(s):

Data Mining ◽

Class Imbalance ◽

Imbalance Learning ◽

Class Imbalance Learning

Download Full-text

Categorizing the feature space for two-class imbalance learning

2020 25th International Conference on Pattern Recognition (ICPR) ◽

10.1109/icpr48806.2021.9413015 ◽

2021 ◽

Author(s):

Rosa Sicilia ◽

Ermanno Cordelli ◽

Paolo Soda

Keyword(s):

Class Imbalance ◽

Feature Space ◽

Imbalance Learning ◽

Class Imbalance Learning

Download Full-text

Scalable kernel-based SVM classification algorithm on imbalance air quality data for proficient healthcare

Complex & Intelligent Systems ◽

10.1007/s40747-021-00435-5 ◽

2021 ◽

Author(s):

Shwet Ketu ◽

Pramod Kumar Mishra

Keyword(s):

Air Pollution ◽

Air Quality ◽

Class Imbalance ◽

Imbalanced Data ◽

Classification Algorithm ◽

Quality Data ◽

Pollution Level ◽

Classification Problems ◽

Chi Square ◽

The Impact

AbstractIn the last decade, we have seen drastic changes in the air pollution level, which has become a critical environmental issue. It should be handled carefully towards making the solutions for proficient healthcare. Reducing the impact of air pollution on human health is possible only if the data is correctly classified. In numerous classification problems, we are facing the class imbalance issue. Learning from imbalanced data is always a challenging task for researchers, and from time to time, possible solutions have been developed by researchers. In this paper, we are focused on dealing with the imbalanced class distribution in a way that the classification algorithm will not compromise its performance. The proposed algorithm is based on the concept of the adjusting kernel scaling (AKS) method to deal with the multi-class imbalanced dataset. The kernel function's selection has been evaluated with the help of weighting criteria and the chi-square test. All the experimental evaluation has been performed on sensor-based Indian Central Pollution Control Board (CPCB) dataset. The proposed algorithm with the highest accuracy of 99.66% wins the race among all the classification algorithms i.e. Adaboost (59.72%), Multi-Layer Perceptron (95.71%), GaussianNB (80.87%), and SVM (96.92). The results of the proposed algorithm are also better than the existing literature methods. It is also clear from these results that our proposed algorithm is efficient for dealing with class imbalance problems along with enhanced performance. Thus, accurate classification of air quality through our proposed algorithm will be useful for improving the existing preventive policies and will also help in enhancing the capabilities of effective emergency response in the worst pollution situation.

Download Full-text

A New Performance Measure for Class Imbalance Learning. Application to Bioinformatics Problems

2009 International Conference on Machine Learning and Applications ◽

10.1109/icmla.2009.126 ◽

2009 ◽

Cited By ~ 16

Author(s):

Rukshan Batuwita ◽

Vasile Palade

Keyword(s):

Class Imbalance ◽

Performance Measure ◽

Imbalance Learning ◽

Class Imbalance Learning

Download Full-text