Semi-supervised Classification Based Mixed Sampling for Imbalanced Data

Jianhua Zhao; Ning Liu

doi:10.1515/phys-2019-0103

Semi-supervised Classification Based Mixed Sampling for Imbalanced Data

Open Physics ◽

10.1515/phys-2019-0103 ◽

2019 ◽

Vol 17 (1) ◽

pp. 975-983

Author(s):

Jianhua Zhao ◽

Ning Liu

Keyword(s):

Supervised Learning ◽

Learning Algorithm ◽

Imbalanced Data ◽

Classification Performance ◽

Experimental Result ◽

Data Set ◽

Sampling Algorithm ◽

Unlabeled Sample ◽

Imbalanced Data Classification ◽

Under Sampling

Abstract In practical application, there are a large amount of imbalanced data containing only a small number of labeled data. In order to improve the classification performance of this kind of problem, this paper proposes a semi-supervised learning algorithm based on mixed sampling for imbalanced data classification (S2MAID), which combines semi-supervised learning, over sampling, under sampling and ensemble learning. Firstly, a kind of under sampling algorithm UD-density is provided to select samples with high information content from majority class set for semi-supervised learning. Secondly, a safe supervised-learning method is used to mark unlabeled sample and expand the labeled sample. Thirdly, a kind of over sampling algorithm SMOTE-density is provided to make the imbalanced data set become balance set. Fourthly, an ensemble technology is used to generate a strong classifier. Finally, the experiment is carried out on imbalanced data with containing only a few labeled samples, and semi-supervised learning process is simulated. The proposed S2MAID is verified and the experimental result shows that the proposed S2MAID has a better classification performance.

Download Full-text

An Under-Sampling Method with Support Vectors in Multi-class Imbalanced Data Classification

2019 13th International Conference on Software, Knowledge, Information Management and Applications (SKIMA) ◽

10.1109/skima47702.2019.8982391 ◽

2019 ◽

Cited By ~ 1

Author(s):

Md. Yasir Arafat ◽

Sabera Hoque ◽

Shuxiang Xu ◽

Dewan Md. Farid

Keyword(s):

Sampling Method ◽

Imbalanced Data ◽

Data Classification ◽

Support Vectors ◽

Imbalanced Data Classification ◽

Under Sampling

Download Full-text

A Cluster-Based Under-Sampling Algorithm for Class-Imbalanced Data

Lecture Notes in Computer Science - Hybrid Artificial Intelligent Systems ◽

10.1007/978-3-030-61705-9_25 ◽

2020 ◽

pp. 299-311

Author(s):

A. Guzmán-Ponce ◽

R. M. Valdovinos ◽

J. S. Sánchez

Keyword(s):

Imbalanced Data ◽

Sampling Algorithm ◽

Under Sampling

Download Full-text

An under-sampling technique for imbalanced data classification based on DBSCAN algorithm

2020 8th Iranian Joint Congress on Fuzzy and intelligent Systems (CFIS) ◽

10.1109/cfis49607.2020.9238718 ◽

2020 ◽

Author(s):

Behzad Mirzaei ◽

Bahareh Nikpour ◽

Hossein Nezamabadi-Pour

Keyword(s):

Imbalanced Data ◽

Data Classification ◽

Sampling Technique ◽

Dbscan Algorithm ◽

Imbalanced Data Classification ◽

Under Sampling

Download Full-text

Early prediction of chronic disease using an efficient machine learning algorithm through adaptive probabilistic divergence based feature selection approach

International Journal of Pervasive Computing and Communications ◽

10.1108/ijpcc-04-2020-0018 ◽

2020 ◽

Vol ahead-of-print (ahead-of-print) ◽

Cited By ~ 2

Author(s):

Sandeepkumar Hegde ◽

Monica R. Mundada

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Chronic Disease ◽

Learning Algorithm ◽

Predictive Ability ◽

Experimental Result ◽

Data Set ◽

Content Type ◽

Learning Classifier ◽

Feature Selection Approach

Purpose According to the World Health Organization, by 2025, the contribution of chronic disease is expected to rise by 73% compared to all deaths and it is considered as global burden of disease with a rate of 60%. These diseases persist for a longer duration of time, which are almost incurable and can only be controlled. Cardiovascular disease, chronic kidney disease (CKD) and diabetes mellitus are considered as three major chronic diseases that will increase the risk among the adults, as they get older. CKD is considered a major disease among all these chronic diseases, which will increase the risk among the adults as they get older. Overall 10% of the population of the world is affected by CKD and it is likely to double in the year 2030. The paper aims to propose novel feature selection approach in combination with the machine-learning algorithm which can early predict the chronic disease with utmost accuracy. Hence, a novel feature selection adaptive probabilistic divergence-based feature selection (APDFS) algorithm is proposed in combination with the hyper-parameterized logistic regression model (HLRM) for the early prediction of chronic disease. Design/methodology/approach A novel feature selection APDFS algorithm is proposed which explicitly handles the feature associated with the class label by relevance and redundancy analysis. The algorithm applies the statistical divergence-based information theory to identify the relationship between the distant features of the chronic disease data set. The data set required to experiment is obtained from several medical labs and hospitals in India. The HLRM is used as a machine-learning classifier. The predictive ability of the framework is compared with the various algorithm and also with the various chronic disease data set. The experimental result illustrates that the proposed framework is efficient and achieved competitive results compared to the existing work in most of the cases. Findings The performance of the proposed framework is validated by using the metric such as recall, precision, F1 measure and ROC. The predictive performance of the proposed framework is analyzed by passing the data set belongs to various chronic disease such as CKD, diabetes and heart disease. The diagnostic ability of the proposed approach is demonstrated by comparing its result with existing algorithms. The experimental figures illustrated that the proposed framework performed exceptionally well in prior prediction of CKD disease with an accuracy of 91.6. Originality/value The capability of the machine learning algorithms depends on feature selection (FS) algorithms in identifying the relevant traits from the data set, which impact the predictive result. It is considered as a process of choosing the relevant features from the data set by removing redundant and irrelevant features. Although there are many approaches that have been already proposed toward this objective, they are computationally complex because of the strategy of following a one-step scheme in selecting the features. In this paper, a novel feature selection APDFS algorithm is proposed which explicitly handles the feature associated with the class label by relevance and redundancy analysis. The proposed algorithm handles the process of feature selection in two separate indices. Hence, the computational complexity of the algorithm is reduced to O(nk+1). The algorithm applies the statistical divergence-based information theory to identify the relationship between the distant features of the chronic disease data set. The data set required to experiment is obtained from several medical labs and hospitals of karkala taluk ,India. The HLRM is used as a machine learning classifier. The predictive ability of the framework is compared with the various algorithm and also with the various chronic disease data set. The experimental result illustrates that the proposed framework is efficient and achieved competitive results are compared to the existing work in most of the cases.

Download Full-text

A Novel Model for Imbalanced Data Classification

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.6145 ◽

2020 ◽

Vol 34 (04) ◽

pp. 6680-6687

Author(s):

Jian Yin ◽

Chunjing Gan ◽

Kaiqi Zhao ◽

Xuan Lin ◽

Zhe Quan ◽

...

Keyword(s):

Imbalanced Data ◽

Data Classification ◽

Classification Performance ◽

Classification Model ◽

Proposed Model ◽

Imbalanced Data Classification ◽

Public Datasets ◽

Distribution Cost ◽

Novel Model ◽

Learning Data

Recently, imbalanced data classification has received much attention due to its wide applications. In the literature, existing researches have attempted to improve the classification performance by considering various factors such as the imbalanced distribution, cost-sensitive learning, data space improvement, and ensemble learning. Nevertheless, most of the existing methods focus on only part of these main aspects/factors. In this work, we propose a novel imbalanced data classification model that considers all these main aspects. To evaluate the performance of our proposed model, we have conducted experiments based on 14 public datasets. The results show that our model outperforms the state-of-the-art methods in terms of recall, G-mean, F-measure and AUC.

Download Full-text

A New Under-Sampling Method Using Genetic Algorithm for Imbalanced Data Classification

Proceedings of the 10th International Conference on Ubiquitous Information Management and Communication - IMCOM '16 ◽

10.1145/2857546.2857643 ◽

2016 ◽

Cited By ~ 9

Author(s):

Jihyun Ha ◽

Jong-Seok Lee

Keyword(s):

Genetic Algorithm ◽

Sampling Method ◽

Imbalanced Data ◽

Data Classification ◽

Imbalanced Data Classification ◽

Under Sampling

Download Full-text

An Active Under-Sampling Approach for Imbalanced Data Classification

2012 Fifth International Symposium on Computational Intelligence and Design ◽

10.1109/iscid.2012.219 ◽

2012 ◽

Cited By ~ 7

Author(s):

Zeping Yang ◽

Daqi Gao

Keyword(s):

Imbalanced Data ◽

Data Classification ◽

Imbalanced Data Classification ◽

Under Sampling ◽

Sampling Approach

Download Full-text

CoGBUS- Center of Gravity based under Sampling Method for Imbalanced Data Classification

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.b2077.078219 ◽

2019 ◽

Vol 8 (2) ◽

pp. 2463-2468

Keyword(s):

Learning Community ◽

Sampling Method ◽

Class Imbalance ◽

Imbalanced Data ◽

Center Of Gravity ◽

Classification Algorithms ◽

Class Imbalance Problem ◽

Imbalance Problem ◽

Imbalanced Data Classification ◽

Under Sampling

Learning of class imbalanced data becomes a challenging issue in the machine learning community as all classification algorithms are designed to work for balanced datasets. Several methods are available to tackle this issue, among which the resampling techniques- undersampling and oversampling are more flexible and versatile. This paper introduces a new concept for undersampling based on Center of Gravity principle which helps to reduce the excess instances of majority class. This work is suited for binary class problems. The proposed technique –CoGBUS- overcomes the class imbalance problem and brings best results in the study. We take F-Score, GMean and ROC for the performance evaluation of the method.

Download Full-text

Over-sampling algorithm for imbalanced data classification

Journal of Systems Engineering and Electronics ◽

10.21629/jsee.2019.06.12 ◽

2019 ◽

Vol 30 (6) ◽

pp. 1182-1191 ◽

Cited By ~ 7

Author(s):

Xiaolong XU ◽

Wen CHEN ◽

Yanfei SUN

Keyword(s):

Imbalanced Data ◽

Data Classification ◽

Sampling Algorithm ◽

Imbalanced Data Classification

Download Full-text

Supervised Learning in Multilayer Spiking Neural Networks

Neural Computation ◽

10.1162/neco_a_00396 ◽

2013 ◽

Vol 25 (2) ◽

pp. 473-509 ◽

Cited By ~ 60

Author(s):

Ioana Sporea ◽

André Grüning

Keyword(s):

Neural Networks ◽

Supervised Learning ◽

Neuron Model ◽

Learning Algorithm ◽

Spiking Neural Networks ◽

Reservoir Computing ◽

Data Set ◽

Coding Schemes ◽

Xor Problem ◽

Iris Data

We introduce a supervised learning algorithm for multilayer spiking neural networks. The algorithm overcomes a limitation of existing learning algorithms: it can be applied to neurons firing multiple spikes in artificial neural networks with hidden layers. It can also, in principle, be used with any linearizable neuron model and allows different coding schemes of spike train patterns. The algorithm is applied successfully to classic linearly nonseparable benchmarks such as the XOR problem and the Iris data set, as well as to more complex classification and mapping problems. The algorithm has been successfully tested in the presence of noise, requires smaller networks than reservoir computing, and results in faster convergence than existing algorithms for similar tasks such as SpikeProp.

Download Full-text