scholarly journals Semi-supervised Classification Based Mixed Sampling for Imbalanced Data

Open Physics ◽  
2019 ◽  
Vol 17 (1) ◽  
pp. 975-983
Author(s):  
Jianhua Zhao ◽  
Ning Liu

Abstract In practical application, there are a large amount of imbalanced data containing only a small number of labeled data. In order to improve the classification performance of this kind of problem, this paper proposes a semi-supervised learning algorithm based on mixed sampling for imbalanced data classification (S2MAID), which combines semi-supervised learning, over sampling, under sampling and ensemble learning. Firstly, a kind of under sampling algorithm UD-density is provided to select samples with high information content from majority class set for semi-supervised learning. Secondly, a safe supervised-learning method is used to mark unlabeled sample and expand the labeled sample. Thirdly, a kind of over sampling algorithm SMOTE-density is provided to make the imbalanced data set become balance set. Fourthly, an ensemble technology is used to generate a strong classifier. Finally, the experiment is carried out on imbalanced data with containing only a few labeled samples, and semi-supervised learning process is simulated. The proposed S2MAID is verified and the experimental result shows that the proposed S2MAID has a better classification performance.

2020 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Sandeepkumar Hegde ◽  
Monica R. Mundada

Purpose According to the World Health Organization, by 2025, the contribution of chronic disease is expected to rise by 73% compared to all deaths and it is considered as global burden of disease with a rate of 60%. These diseases persist for a longer duration of time, which are almost incurable and can only be controlled. Cardiovascular disease, chronic kidney disease (CKD) and diabetes mellitus are considered as three major chronic diseases that will increase the risk among the adults, as they get older. CKD is considered a major disease among all these chronic diseases, which will increase the risk among the adults as they get older. Overall 10% of the population of the world is affected by CKD and it is likely to double in the year 2030. The paper aims to propose novel feature selection approach in combination with the machine-learning algorithm which can early predict the chronic disease with utmost accuracy. Hence, a novel feature selection adaptive probabilistic divergence-based feature selection (APDFS) algorithm is proposed in combination with the hyper-parameterized logistic regression model (HLRM) for the early prediction of chronic disease. Design/methodology/approach A novel feature selection APDFS algorithm is proposed which explicitly handles the feature associated with the class label by relevance and redundancy analysis. The algorithm applies the statistical divergence-based information theory to identify the relationship between the distant features of the chronic disease data set. The data set required to experiment is obtained from several medical labs and hospitals in India. The HLRM is used as a machine-learning classifier. The predictive ability of the framework is compared with the various algorithm and also with the various chronic disease data set. The experimental result illustrates that the proposed framework is efficient and achieved competitive results compared to the existing work in most of the cases. Findings The performance of the proposed framework is validated by using the metric such as recall, precision, F1 measure and ROC. The predictive performance of the proposed framework is analyzed by passing the data set belongs to various chronic disease such as CKD, diabetes and heart disease. The diagnostic ability of the proposed approach is demonstrated by comparing its result with existing algorithms. The experimental figures illustrated that the proposed framework performed exceptionally well in prior prediction of CKD disease with an accuracy of 91.6. Originality/value The capability of the machine learning algorithms depends on feature selection (FS) algorithms in identifying the relevant traits from the data set, which impact the predictive result. It is considered as a process of choosing the relevant features from the data set by removing redundant and irrelevant features. Although there are many approaches that have been already proposed toward this objective, they are computationally complex because of the strategy of following a one-step scheme in selecting the features. In this paper, a novel feature selection APDFS algorithm is proposed which explicitly handles the feature associated with the class label by relevance and redundancy analysis. The proposed algorithm handles the process of feature selection in two separate indices. Hence, the computational complexity of the algorithm is reduced to O(nk+1). The algorithm applies the statistical divergence-based information theory to identify the relationship between the distant features of the chronic disease data set. The data set required to experiment is obtained from several medical labs and hospitals of karkala taluk ,India. The HLRM is used as a machine learning classifier. The predictive ability of the framework is compared with the various algorithm and also with the various chronic disease data set. The experimental result illustrates that the proposed framework is efficient and achieved competitive results are compared to the existing work in most of the cases.


2020 ◽  
Vol 34 (04) ◽  
pp. 6680-6687
Author(s):  
Jian Yin ◽  
Chunjing Gan ◽  
Kaiqi Zhao ◽  
Xuan Lin ◽  
Zhe Quan ◽  
...  

Recently, imbalanced data classification has received much attention due to its wide applications. In the literature, existing researches have attempted to improve the classification performance by considering various factors such as the imbalanced distribution, cost-sensitive learning, data space improvement, and ensemble learning. Nevertheless, most of the existing methods focus on only part of these main aspects/factors. In this work, we propose a novel imbalanced data classification model that considers all these main aspects. To evaluate the performance of our proposed model, we have conducted experiments based on 14 public datasets. The results show that our model outperforms the state-of-the-art methods in terms of recall, G-mean, F-measure and AUC.


2019 ◽  
Vol 8 (2) ◽  
pp. 2463-2468

Learning of class imbalanced data becomes a challenging issue in the machine learning community as all classification algorithms are designed to work for balanced datasets. Several methods are available to tackle this issue, among which the resampling techniques- undersampling and oversampling are more flexible and versatile. This paper introduces a new concept for undersampling based on Center of Gravity principle which helps to reduce the excess instances of majority class. This work is suited for binary class problems. The proposed technique –CoGBUS- overcomes the class imbalance problem and brings best results in the study. We take F-Score, GMean and ROC for the performance evaluation of the method.


2013 ◽  
Vol 25 (2) ◽  
pp. 473-509 ◽  
Author(s):  
Ioana Sporea ◽  
André Grüning

We introduce a supervised learning algorithm for multilayer spiking neural networks. The algorithm overcomes a limitation of existing learning algorithms: it can be applied to neurons firing multiple spikes in artificial neural networks with hidden layers. It can also, in principle, be used with any linearizable neuron model and allows different coding schemes of spike train patterns. The algorithm is applied successfully to classic linearly nonseparable benchmarks such as the XOR problem and the Iris data set, as well as to more complex classification and mapping problems. The algorithm has been successfully tested in the presence of noise, requires smaller networks than reservoir computing, and results in faster convergence than existing algorithms for similar tasks such as SpikeProp.


Sign in / Sign up

Export Citation Format

Share Document