New Fuzzy Support Vector Machine for the Class Imbalance Problem in Medical Datasets Classification

The Scientific World JOURNAL ◽

10.1155/2014/536434 ◽

2014 ◽

Vol 2014 ◽

pp. 1-12 ◽

Cited By ~ 7

Author(s):

Xiaoqing Gu ◽

Tongguang Ni ◽

Hongyuan Wang

Keyword(s):

Support Vector Machine ◽

Real World ◽

Class Imbalance ◽

Support Vector ◽

Manifold Regularization ◽

Medical Database ◽

Class Imbalance Problem ◽

Fuzzy Support Vector Machine ◽

Imbalance Problem ◽

Misclassification Costs

In medical datasets classification, support vector machine (SVM) is considered to be one of the most successful methods. However, most of the real-world medical datasets usually contain some outliers/noise and data often have class imbalance problems. In this paper, a fuzzy support machine (FSVM) for the class imbalance problem (called FSVM-CIP) is presented, which can be seen as a modified class of FSVM by extending manifold regularization and assigning two misclassification costs for two classes. The proposed FSVM-CIP can be used to handle the class imbalance problem in the presence of outliers/noise, and enhance the locality maximum margin. Five real-world medical datasets, breast, heart, hepatitis, BUPA liver, and pima diabetes, from the UCI medical database are employed to illustrate the method presented in this paper. Experimental results on these datasets show the outperformed or comparable effectiveness of FSVM-CIP.

Download Full-text

Biased support vector machine and weighted-smote in handling class imbalance problem

International Journal of Advances in Intelligent Informatics ◽

10.26555/ijain.v4i1.146 ◽

2018 ◽

Vol 4 (1) ◽

pp. 21 ◽

Cited By ~ 21

Author(s):

Hartono Hartono ◽

Opim Salim Sitompul ◽

Tulus Tulus ◽

Erna Budhiarti Nababan

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Data Distribution ◽

Class Imbalance ◽

Support Vector ◽

Class Imbalance Problem ◽

Precise Method ◽

Imbalance Problem

Class imbalance occurs when instances in a class are much higher than in other classes. This machine learning major problem can affect the predicted accuracy. Support Vector Machine (SVM) is robust and precise method in handling class imbalance problem but weak in the bias data distribution, Biased Support Vector Machine (BSVM) became popular choice to solve the problem. BSVM provide better control sensitivity yet lack accuracy compared to general SVM. This study proposes the integration of BSVM and SMOTEBoost to handle class imbalance problem. Non Support Vector (NSV) sets from negative samples and Support Vector (SV) sets from positive samples will undergo a Weighted-SMOTE process. The results indicate that implementation of Biased Support Vector Machine and Weighted-SMOTE achieve better accuracy and sensitivity.

Download Full-text

Oversample Based Large Scale Support Vector Machine for Online Class Imbalance Problem

Big Data Analytics - Lecture Notes in Computer Science ◽

10.1007/978-3-030-04780-1_24 ◽

2018 ◽

pp. 348-362

Author(s):

D. Himaja ◽

T. Maruthi Padmaja ◽

P. Radha Krishna

Keyword(s):

Support Vector Machine ◽

Large Scale ◽

Class Imbalance ◽

Support Vector ◽

Class Imbalance Problem ◽

Online Class ◽

Imbalance Problem

Download Full-text

A New Method of Support Vector Machine for Class Imbalance Problem

2009 International Joint Conference on Computational Sciences and Optimization ◽

10.1109/cso.2009.169 ◽

2009 ◽

Cited By ~ 1

Author(s):

Li Yan ◽

Danrui Xei ◽

Zhe Du

Keyword(s):

Support Vector Machine ◽

Class Imbalance ◽

New Method ◽

Support Vector ◽

Class Imbalance Problem ◽

Imbalance Problem

Download Full-text

Identify Lysine Neddylation Sites Using Bi-profile Bayes Feature Extraction via the Chou’s 5-steps Rule and General Pseudo Components

Current Genomics ◽

10.2174/1389202921666191223154629 ◽

2020 ◽

Vol 20 (8) ◽

pp. 592-601

Author(s):

Zhe Ju ◽

Shi-Yun Wang

Keyword(s):

Support Vector Machine ◽

Feature Extraction ◽

Operating Characteristic ◽

Characteristic Curve ◽

Class Imbalance ◽

Support Vector ◽

Fuzzy Support Vector Machine ◽

Operating Characteristic Curve ◽

Matthew’S Correlation Coefficient ◽

User Friendly

Introduction: Neddylation is a highly dynamic and reversible post-translatiNeddylation is a highly dynamic and reversible post-translational modification. The abnormality of neddylation has previously been shown to be closely related to some human diseases. The detection of neddylation sites is essential for elucidating the regulation mechanisms of protein neddylation.onal modification which has been found to be involved in various biological processes and closely associated with many diseases. The accurate identification of neddylation sites is necessary to elucidate the underlying molecular mechanisms of neddylation. As the traditional experimental methods are time consuming and expensive, it is desired to develop computational methods to predict neddylation sites. In this study, a novel predictor named NeddPred is proposed to predict lysine neddylation sites. An effective feature extraction method, bi-profile bayes encoding, is employed to encode neddylation sites. Moreover, a fuzzy support vector machine algorithm is proposed to solve the class imbalance and noise problem in the prediction of neddylation sites. As illustrated by 10-fold cross-validation, NeddPred achieves an excellent performance with a Matthew's correlation coefficient of 0.7082 and an area under receiver operating characteristic curve of 0.9769. Independent tests show that NeddPred significantly outperforms existing neddylation sites predictor NeddyPreddy. Therefore, NeddPred can be a complement to the existing tools for the prediction of neddylation sites. A user-friendly web-server for NeddPred is established at 123.206.31.171/NeddPred/. Objective: As the detection of the lysine neddylation sites by the traditional experimental method is often expensive and time-consuming, it is imperative to design computational methods to identify neddylation sites. Methods: In this study, a bioinformatics tool named NeddPred is developed to identify underlying protein neddylation sites. A bi-profile bayes feature extraction is used to encode neddylation sites and a fuzzy support vector machine model is utilized to overcome the problem of noise and class imbalance in the prediction. Results: Matthew's correlation coefficient of NeddPred achieved 0.7082 and an area under the receiver operating characteristic curve of 0.9769. Independent tests show that NeddPred significantly outperforms existing lysine neddylation sites predictor NeddyPreddy. Conclusion: Therefore, NeddPred can be a complement to the existing tools for the prediction of neddylation sites. A user-friendly webserver for NeddPred is accessible at 123.206.31.171/NeddPred/.

Download Full-text

Pulsar candidate classification using generative adversary networks

Monthly Notices of the Royal Astronomical Society ◽

10.1093/mnras/stz2975 ◽

2019 ◽

Vol 490 (4) ◽

pp. 5424-5439 ◽

Cited By ~ 3

Author(s):

Ping Guo ◽

Fuqing Duan ◽

Pei Wang ◽

Yao Yao ◽

Qian Yin ◽

...

Keyword(s):

Class Imbalance ◽

Feature Learning ◽

Computer Experiments ◽

Support Vector ◽

Data Sets ◽

Class Imbalance Problem ◽

Generative Adversarial Network ◽

Feature Representations ◽

Adversarial Network ◽

Imbalance Problem

ABSTRACT Discovering pulsars is a significant and meaningful research topic in the field of radio astronomy. With the advent of astronomical instruments, the volume and rate of data acquisition have grown exponentially. This development necessitates a focus on artificial intelligence (AI) technologies that can mine large astronomical data sets. Automatic pulsar candidate identification (APCI) can be considered as a task determining potential candidates for further investigation and eliminating the noise of radio-frequency interference and other non-pulsar signals. As reported in the existing literature, AI techniques, especially convolutional neural network (CNN)-based techniques, have been adopted for APCI. However, it is challenging to enhance the performance of CNN-based pulsar identification because only an extremely limited number of real pulsar samples exist, which results in a crucial class imbalance problem. To address these problems, we propose a framework that combines a deep convolution generative adversarial network (DCGAN) with a support vector machine (SVM). The DCGAN is used as a sample generation and feature learning model, and the SVM is adopted as the classifier for predicting the label of a candidate at the inference stage. The proposed framework is a novel technique, which not only can solve the class imbalance problem but also can learn the discriminative feature representations of pulsar candidates instead of computing hand-crafted features in the pre-processing steps. The proposed method can enhance the accuracy of the APCI, and the computer experiments performed on two pulsar data sets verified the effectiveness and efficiency of the proposed method.

Download Full-text

HCBST: An Efficient Hybrid Sampling Technique for Class Imbalance Problems

ACM Transactions on Knowledge Discovery from Data ◽

10.1145/3488280 ◽

2022 ◽

Vol 16 (3) ◽

pp. 1-37

Author(s):

Robert A. Sowah ◽

Bernard Kuditchar ◽

Godfrey A. Mills ◽

Amevi Acakpovi ◽

Raphael A. Twum ◽

...

Keyword(s):

Geometric Mean ◽

Class Imbalance ◽

Sampling Technique ◽

Data Repository ◽

Support Vector ◽

Classification Algorithms ◽

Class Imbalance Problem ◽

Imbalance Problem ◽

High Degree ◽

Hybrid Sampling

Class imbalance problem is prevalent in many real-world domains. It has become an active area of research. In binary classification problems, imbalance learning refers to learning from a dataset with a high degree of skewness to the negative class. This phenomenon causes classification algorithms to perform woefully when predicting positive classes with new examples. Data resampling, which involves manipulating the training data before applying standard classification techniques, is among the most commonly used techniques to deal with the class imbalance problem. This article presents a new hybrid sampling technique that improves the overall performance of classification algorithms for solving the class imbalance problem significantly. The proposed method called the Hybrid Cluster-Based Undersampling Technique (HCBST) uses a combination of the cluster undersampling technique to under-sample the majority instances and an oversampling technique derived from Sigma Nearest Oversampling based on Convex Combination, to oversample the minority instances to solve the class imbalance problem with a high degree of accuracy and reliability. The performance of the proposed algorithm was tested using 11 datasets from the National Aeronautics and Space Administration Metric Data Program data repository and University of California Irvine Machine Learning data repository with varying degrees of imbalance. Results were compared with classification algorithms such as the K-nearest neighbours, support vector machines, decision tree, random forest, neural network, AdaBoost, naïve Bayes, and quadratic discriminant analysis. Tests results revealed that for the same datasets, the HCBST performed better with average performances of 0.73, 0.67, and 0.35 in terms of performance measures of area under curve, geometric mean, and Matthews Correlation Coefficient, respectively, across all the classifiers used for this study. The HCBST has the potential of improving the performance of the class imbalance problem, which by extension, will improve on the various applications that rely on the concept for a solution.

Download Full-text

CLASSIFICATION OF IMBALANCED DATA: A REVIEW

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001409007326 ◽

2009 ◽

Vol 23 (04) ◽

pp. 687-719 ◽

Cited By ~ 534

Author(s):

YANMIN SUN ◽

ANDREW K. C. WONG ◽

MOHAMED S. KAMEL

Keyword(s):

Learning Algorithms ◽

Class Imbalance ◽

Imbalanced Data ◽

Class Imbalance Problem ◽

Class Distribution ◽

Imbalance Problem ◽

Misclassification Costs ◽

Imbalanced Class Distribution ◽

Classifier Learning

Classification of data with imbalanced class distribution has encountered a significant drawback of the performance attainable by most standard classifier learning algorithms which assume a relatively balanced class distribution and equal misclassification costs. This paper provides a review of the classification of imbalanced data regarding: the application domains; the nature of the problem; the learning difficulties with standard classifier learning algorithms; the learning objectives and evaluation measures; the reported research solutions; and the class imbalance problem in the presence of multiple classes.

Download Full-text

Exploiting Correlation Subspace to Predict Heterogeneous Cross-Project Defects

International Journal of Software Engineering and Knowledge Engineering ◽

10.1142/s0218194016710017 ◽

2016 ◽

Vol 26 (09n10) ◽

pp. 1571-1580 ◽

Cited By ~ 6

Author(s):

Ming Cheng ◽

Guoqing Wu ◽

Hongyan Wan ◽

Guoan You ◽

Mengting Yuan ◽

...

Keyword(s):

Class Imbalance ◽

Imbalanced Data ◽

Feature Space ◽

Support Vector ◽

Class Imbalance Problem ◽

Classifier Design ◽

Imbalance Problem ◽

Project Data ◽

The Impact ◽

Cross Project

Cross-project defect prediction trains a prediction model using historical data from source projects and applies the model to target projects. Most previous efforts assumed the cross-project data have the same metrics set, which means the metrics used and the size of metrics set are the same. However, this assumption may not hold in practical scenarios. In addition, software defect datasets have the class-imbalance problem which increases the difficulty for the learner to predict defects. In this paper, we advance canonical correlation analysis by deriving a joint feature space for associating cross-project data. We also propose a novel support vector machine algorithm which incorporates the correlation transfer information into classifier design for cross-project prediction. Moreover, we take different misclassification costs into consideration to make the classification inclining to classify a module as a defective one, alleviating the impact of imbalanced data. The experimental results show that our method is more effective compared to state-of-the-art methods.

Download Full-text

Evaluation of imbalanced datasets using fuzzy support vector machine-class imbalance learning (FSVM-CIL)

2011 International Conference on Recent Trends in Information Technology (ICRTIT) ◽

10.1109/icrtit.2011.5972431 ◽

2011 ◽

Cited By ~ 2

Author(s):

B. Lakshmanan ◽

A. Jeril Priscilla ◽

S. Ponni ◽

V. Sankari

Keyword(s):

Support Vector Machine ◽

Class Imbalance ◽

Support Vector ◽

Fuzzy Support Vector Machine ◽

Imbalanced Datasets ◽

Imbalance Learning ◽

Class Imbalance Learning

Download Full-text

A Novel Hybrid Sampling Algorithm for Solving Class Imbalance Problem in Big Data

Advances in Data Science and Adaptive Analysis ◽

10.1142/s2424922x21500054 ◽

2021 ◽

pp. 2150005

Author(s):

Khyati Ahlawat ◽

Anuradha Chug ◽

Amit Prakash Singh

Keyword(s):

Big Data ◽

Class Imbalance ◽

Support Vector ◽

Efficiency Gain ◽

Learning Approaches ◽

K Nearest Neighbor ◽

Class Imbalance Problem ◽

Sampling Algorithm ◽

Imbalance Problem ◽

Hybrid Sampling

The uneven distribution of classes in any dataset poses a tendency of biasness toward the majority class when analyzed using any standard classifier. The instances of the significant class being deficient in numbers are generally ignored and their correct classification which is of paramount interest is often overlooked in calculating overall accuracy. Therefore, the conventional machine learning approaches are rigorously refined to address this class imbalance problem. This challenge of imbalanced classes is more prevalent in big data scenario due to its high volume. This study deals with acknowledging a sampling solution based on cluster computing in handling class imbalance problems in the case of big data. The newly proposed approach hybrid sampling algorithm (HSA) is assessed using three popular classification algorithms namely, support vector machine, decision tree and k-nearest neighbor based on balanced accuracy and elapsed time. The results obtained from the experiment are considered promising with an efficiency gain of 42% in comparison to the traditional sampling solution synthetic minority oversampling technique (SMOTE). This work proves the effectiveness of the distribution and clustering principle in imbalanced big data scenarios.

Download Full-text