A New Under-Sampling Method to Face Class Overlap and Imbalance

Angélica Guzmán-Ponce; Rosa María Valdovinos; José Salvador Sánchez; José Raymundo Marcial-Romero

doi:10.3390/app10155164

A New Under-Sampling Method to Face Class Overlap and Imbalance

Applied Sciences ◽

10.3390/app10155164 ◽

2020 ◽

Vol 10 (15) ◽

pp. 5164

Author(s):

Angélica Guzmán-Ponce ◽

Rosa María Valdovinos ◽

José Salvador Sánchez ◽

José Raymundo Marcial-Romero

Keyword(s):

Clustering Algorithm ◽

Nearest Neighbor ◽

Minimum Spanning Tree ◽

Real Life ◽

Class Imbalance ◽

Sampling Technique ◽

Significant Loss ◽

Support Vector ◽

Dbscan Clustering ◽

Under Sampling

Class overlap and class imbalance are two data complexities that challenge the design of effective classifiers in Pattern Recognition and Data Mining as they may cause a significant loss in performance. Several solutions have been proposed to face both data difficulties, but most of these approaches tackle each problem separately. In this paper, we propose a two-stage under-sampling technique that combines the DBSCAN clustering algorithm to remove noisy samples and clean the decision boundary with a minimum spanning tree algorithm to face the class imbalance, thus handling class overlap and imbalance simultaneously with the aim of improving the performance of classifiers. An extensive experimental study shows a significantly better behavior of the new algorithm as compared to 12 state-of-the-art under-sampling methods using three standard classification models (nearest neighbor rule, J48 decision tree, and support vector machine with a linear kernel) on both real-life and synthetic databases.

Download Full-text

A Deep Learning Based Method for the Non-Destructive Measuring of Rock Strength through Hammering Sound

Applied Sciences ◽

10.3390/app9173484 ◽

2019 ◽

Vol 9 (17) ◽

pp. 3484

Author(s):

Shuai Han ◽

Heng Li ◽

Mingchao Li ◽

Timothy Rose

Keyword(s):

Clustering Algorithm ◽

Nearest Neighbor ◽

Rock Strength ◽

Support Vector ◽

K Nearest Neighbor ◽

Strength Measurement ◽

Regression Algorithms ◽

Almost All ◽

The Relationship ◽

Non Destructive

Hammering rocks of different strengths can make different sounds. Geological engineers often use this method to approximate the strengths of rocks in geology surveys. This method is quick and convenient but subjective. Inspired by this problem, we present a new, non-destructive method for measuring the surface strengths of rocks based on deep neural network (DNN) and spectrogram analysis. All the hammering sounds are transformed into spectrograms firstly, and a clustering algorithm is presented to filter out the outliers of the spectrograms automatically. One of the most advanced image classification DNN, the Inception-ResNet-v2, is then re-trained with the spectrograms. The results show that the training accurate is up to 94.5%. Following this, three regression algorithms, including Support Vector Machine (SVM), K-Nearest Neighbor (KNN), and Random Forest (RF) are adopted to fit the relationship between the outputs of the DNN and the strength values. The tests show that KNN has the highest fitting accuracy, and SVM has the strongest generalization ability. The strengths (represented by rebound values) of almost all the samples can be predicted within an error of [−5, 5]. Overall, the proposed method has great potential in supporting the implementation of efficient rock strength measurement methods in the field.

Download Full-text

Pear Defect Detection Method Based on ResNet and DCGAN

Information ◽

10.3390/info12100397 ◽

2021 ◽

Vol 12 (10) ◽

pp. 397

Author(s):

Yan Zhang ◽

Shiyun Wa ◽

Pengshuo Sun ◽

Yaojun Wang

Keyword(s):

Defect Detection ◽

Clustering Algorithm ◽

Nearest Neighbor ◽

Detection System ◽

Implementation Process ◽

Support Vector ◽

Detection Accuracy ◽

K Nearest Neighbor ◽

Generation Network ◽

Low Efficiency

To address the current situation, in which pear defect detection is still based on a workforce with low efficiency, we propose the use of the CNN model to detect pear defects. Since it is challenging to obtain defect images in the implementation process, a deep convolutional adversarial generation network was used to augment the defect images. As the experimental results indicated, the detection accuracy of the proposed method on the 3000 validation set was as high as 97.35%. Variant mainstream CNNs were compared to evaluate the model’s performance thoroughly, and the top performer was selected to conduct further comparative experiments with traditional machine learning methods, such as support vector machine algorithm, random forest algorithm, and k-nearest neighbor clustering algorithm. Moreover, the other two varieties of pears that have not been trained were chosen to validate the robustness and generalization capability of the model. The validation results illustrated that the proposed method is more accurate than the commonly used algorithms for pear defect detection. It is robust enough to be generalized well to other datasets. In order to allow the method proposed in this paper to be applied in agriculture, an intelligent pear defect detection system was built based on an iOS device.

Download Full-text

Received Signal Strength-Based Indoor Localization Using Hierarchical Classification

Sensors ◽

10.3390/s20041067 ◽

2020 ◽

Vol 20 (4) ◽

pp. 1067 ◽

Cited By ~ 6

Author(s):

Chenbin Zhang ◽

Ningning Qin ◽

Yanbo Xue ◽

Le Yang

Keyword(s):

Clustering Algorithm ◽

Nearest Neighbor ◽

Indoor Localization ◽

Hierarchical Classification ◽

Signal Strength ◽

Received Signal Strength ◽

Support Vector ◽

K Nearest Neighbor ◽

Position Information ◽

Area Of Interest

Commercial interests in indoor localization have been increasing in the past decade. The success of many applications relies at least partially on indoor localization that is expected to provide reliable indoor position information. Wi-Fi received signal strength (RSS)-based indoor localization techniques have attracted extensive attentions because Wi-Fi access points (APs) are widely deployed and we can obtain the Wi-Fi RSS measurements without extra hardware cost. In this paper, we propose a hierarchical classification-based method as a new solution to the indoor localization problem. Within the developed approach, we first adopt an improved K-Means clustering algorithm to divide the area of interest into several zones and they are allowed to overlap with one another to improve the generalization capability of the following indoor positioning process. To find the localization result, the K-Nearest Neighbor (KNN) algorithm and support vector machine (SVM) with the one-versus-one strategy are employed. The proposed method is implemented on a tablet, and its performance is evaluated in real-world environments. Experiment results reveal that the proposed method offers an improvement of 1.4% to 3.2% in terms of position classification accuracy and a reduction of 10% to 22% in terms of average positioning error compared with several benchmark methods.

Download Full-text

HCBST: An Efficient Hybrid Sampling Technique for Class Imbalance Problems

ACM Transactions on Knowledge Discovery from Data ◽

10.1145/3488280 ◽

2022 ◽

Vol 16 (3) ◽

pp. 1-37

Author(s):

Robert A. Sowah ◽

Bernard Kuditchar ◽

Godfrey A. Mills ◽

Amevi Acakpovi ◽

Raphael A. Twum ◽

...

Keyword(s):

Geometric Mean ◽

Class Imbalance ◽

Sampling Technique ◽

Data Repository ◽

Support Vector ◽

Classification Algorithms ◽

Class Imbalance Problem ◽

Imbalance Problem ◽

High Degree ◽

Hybrid Sampling

Class imbalance problem is prevalent in many real-world domains. It has become an active area of research. In binary classification problems, imbalance learning refers to learning from a dataset with a high degree of skewness to the negative class. This phenomenon causes classification algorithms to perform woefully when predicting positive classes with new examples. Data resampling, which involves manipulating the training data before applying standard classification techniques, is among the most commonly used techniques to deal with the class imbalance problem. This article presents a new hybrid sampling technique that improves the overall performance of classification algorithms for solving the class imbalance problem significantly. The proposed method called the Hybrid Cluster-Based Undersampling Technique (HCBST) uses a combination of the cluster undersampling technique to under-sample the majority instances and an oversampling technique derived from Sigma Nearest Oversampling based on Convex Combination, to oversample the minority instances to solve the class imbalance problem with a high degree of accuracy and reliability. The performance of the proposed algorithm was tested using 11 datasets from the National Aeronautics and Space Administration Metric Data Program data repository and University of California Irvine Machine Learning data repository with varying degrees of imbalance. Results were compared with classification algorithms such as the K-nearest neighbours, support vector machines, decision tree, random forest, neural network, AdaBoost, naïve Bayes, and quadratic discriminant analysis. Tests results revealed that for the same datasets, the HCBST performed better with average performances of 0.73, 0.67, and 0.35 in terms of performance measures of area under curve, geometric mean, and Matthews Correlation Coefficient, respectively, across all the classifiers used for this study. The HCBST has the potential of improving the performance of the class imbalance problem, which by extension, will improve on the various applications that rely on the concept for a solution.

Download Full-text

Dynamic financial distress prediction based on class-imbalanced data batches

International Journal of Financial Engineering ◽

10.1142/s2424786321500262 ◽

2021 ◽

pp. 2150026

Author(s):

Jie Sun ◽

Xin Liu ◽

Wenguo Ai ◽

Qianyuan Tian

Keyword(s):

Financial Distress ◽

Time Window ◽

Concept Drift ◽

Class Imbalance ◽

Imbalanced Data ◽

Sampling Technique ◽

Support Vector ◽

Multiple Discriminant Analysis ◽

Financial Distress Prediction ◽

Distress Prediction

This study proposes two approaches for dynamic financial distress prediction (FDP) based on class-imbalanced data batches by considering both concept drift and class imbalance. One is based on sliding time window and synthetic minority over-sampling technique (SMOTE) and the other is based on sliding time window and majority class partition. Support vector machine, multiple discriminant analysis (MDA) and logistic regression are used as base classifiers in the experiments on a real-world dataset. The results indicate that the two approaches perform better than the pure dynamic FDP (DFDP) models without class imbalance processing and the static FDP models either with or without class imbalance processing.

Download Full-text

Efficient Semi-Supervised Learning and Sparse Structural Learning for Feature Selection of Leukemia Dataset

Journal of Medical Imaging and Health Informatics ◽

10.1166/jmihi.2020.3110 ◽

2020 ◽

Vol 10 (8) ◽

pp. 1815-1824

Author(s):

S. Nithya Roopa ◽

N. Nagarajan

Keyword(s):

Feature Selection ◽

Supervised Learning ◽

Health Informatics ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Research Work ◽

Real Life ◽

Support Vector ◽

Structural Learning ◽

Huge Amount

The amount of data produced in health informatics growing large and as a result analysis of this huge amount of data requires a great knowledge which is to be gained. The basic aim of health informatics is to take in real world medical data from all levels of human existence to help improve our understanding of medicine and medical practices. Huge amount of unlabeled data are obtainable in lots of real-life data-mining tasks, e.g., uncategorized messages in an automatic email categorization system, unknown genes functions for doing gene function calculation, and so on. Labelled data is frequently restricted and expensive to produce, while labelling classically needs human proficiency. Consequently, semi-supervised learning has become a topic of significant recent interest. This research work proposed a new semi-supervised grouping, where the performance of unsupervised clustering algorithms is enhanced with restricted numbers of supervision in labels form on constraints or data. The previous system designed a Clustering Guided Hybrid support vector machine based Sparse Structural Learning (CGHSSL) for feature selection. However, it does not produce a satisfactory accuracy results. In this research, proposed clustering-guided with Convolution Neural Network (CNN) based sparse structural learning clustering algorithm. Density-Based Spatial Clustering of Applications with Noise (DBSCAN) clustering algorithm is progressed for learning cluster labels of input samples having more accuracy guiding features election at same time. Concurrently, prediction of cluster labels is as well performed by CNN by means of using hidden structure which is shared by various characteristics. The parameters of CNN are then optimized maximizing Multi-objective Bee Colony (MBO) algorithm that can unravel feature correlations to render outcomes with additional consistency. Row-wise sparse designs are then balanced to yield design depicted to suit for feature selection. This semi supervised algorithm is utilized to choose important characteristics from Leukemia1 dataset additional resourcefully. Therefore dataset size is decreased significantly utilizing semi supervised algorithm prominently. As well proposed Semi Supervised Clustering-Guided Sparse Structural Learning (SSCGSSL) technique is utilized to increase the clustering performance in higher. The experimental results show that the proposed system achieves better performance compared with the existing system in terms of Accuracy, Entropy, Purity, Normalized Mutual Information (NMI) and F-measure.

Download Full-text

Classification of Sentinel-2 Images Utilizing Abundance Representation

Proceedings ◽

10.3390/ecrs-2-05141 ◽

2018 ◽

Vol 2 (7) ◽

pp. 328 ◽

Cited By ~ 6

Author(s):

Eleftheria Mylona ◽

Vassiliki Daskalopoulou ◽

Olga Sykioti ◽

Konstantinos Koutroumbas ◽

Athanasios Rontogiannis

Keyword(s):

Clustering Algorithm ◽

Nearest Neighbor ◽

Unsupervised Classification ◽

Bare Soil ◽

Spectral Unmixing ◽

Support Vector ◽

Endmember Extraction ◽

Bayes Algorithm ◽

Sentinel 2

This paper deals with (both supervised and unsupervised) classification of multispectral Sentinel-2 images, utilizing the abundance representation of the pixels of interest. The latter pixel representation uncovers the hidden structured regions that are not often available in the reference maps. Additionally, it encourages class distinctions and bolsters accuracy. The adopted methodology, which has been successfully applied to hyperpsectral data, involves two main stages: (I) the determination of the pixel’s abundance representation; and (II) the employment of a classification algorithm applied to the abundance representations. More specifically, stage (I) incorporates two key processes, namely (a) endmember extraction, utilizing spectrally homogeneous regions of interest (ROIs); and (b) spectral unmixing, which hinges upon the endmember selection. The adopted spectral unmixing process assumes the linear mixing model (LMM), where each pixel is expressed as a linear combination of the endmembers. The pixel’s abundance vector is estimated via a variational Bayes algorithm that is based on a suitably defined hierarchical Bayesian model. The resulting abundance vectors are then fed to stage (II), where two off-the-shelf supervised classification approaches (namely nearest neighbor (NN) classification and support vector machines (SVM)), as well as an unsupervised classification process (namely the online adaptive possibilistic c-means (OAPCM) clustering algorithm), are adopted. Experiments are performed on a Sentinel-2 image acquired for a specific region of the Northern Pindos National Park in north-western Greece containing water, vegetation and bare soil areas. The experimental results demonstrate that the ad-hoc classification approaches utilizing abundance representations of the pixels outperform those utilizing the spectral signatures of the pixels in terms of accuracy.

Download Full-text

Predicting Stock Exchange using Supervised Learning Algorithms

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.a4144.119119 ◽

2019 ◽

Vol 9 (1) ◽

pp. 4081-4090

Keyword(s):

Machine Learning ◽

Random Forest ◽

Stock Market ◽

Supervised Learning ◽

Nearest Neighbor ◽

Market Price ◽

Real Life ◽

Support Vector ◽

K Nearest Neighbor ◽

Future Value

The stock market price trend is one of the brightest areas in the field of computer science, economics, finance, administration, etc. The stock market forecast is an attempt to determine the future value of the equity traded on a financial transaction with another financial system. The current work clearly describes the prediction of a stock using Machine Learning. The adoption of machine learning and artificial intelligence techniques to predict the prices of the stock is a growing trend. More and more researchers invest their time every day in coming up with ways to arrive at techniques that can further improve the accuracy of the stock prediction model. This paper is mainly concerned with the best model to predict the stock market value. During the mechanism of contemplating the various techniques and variables that can be taken into consideration, we discovered five models Which are based on supervised learning techniques i.e.., Support Vector Machine (SVM), Random Forest, K-Nearest Neighbor (KNN), Bernoulli Naïve Bayes.The empirical results show that SVC performs the best for large datasets and Random Forest, Naïve Bayes is the best for small datasets. The successful prediction for the stock will be a great asset for the stock The stock market price trend is one of the brightest areas in the field of computer science, economics, finance, administration, etc. The stock market forecast is an attempt to determine the future value of the equity traded on a financial transaction with another financial system. The current work clearly describes the prediction of a stock using Machine Learning. The adoption of machine learning and artificial intelligence techniques to predict the prices of the stock is a growing trend. More and more researchers invest their time every day in coming up with ways to arrive at techniques that can further improve the accuracy of the stock prediction model. This paper is mainly concerned with the best model to predict the stock market value. During the mechanism of contemplating the various techniques and variables that can be taken into consideration, we discovered five models Which are based on supervised learning techniques i.e.., Support Vector Machine (SVM), Random Forest, K-Nearest Neighbor (KNN), Bernoulli Naïve Bayes.The empirical results show that SVC performs the best for large datasets and Random Forest, Naïve Bayes is the best for small datasets. The successful prediction for the stock will be a great asset for the stock market institutions and will provide real-life solutions to the problems that stock investors face.market institutions and will provide real-life solutions to the problems that stock investors face.

Download Full-text

Automatic road sign detection and recognition based on neural network

10.21203/rs.3.rs-408446/v1 ◽

2021 ◽

Author(s):

Redouan Lahmyed ◽

Mohamed El Ansari ◽

Zakaria Kerkaou

Keyword(s):

Clustering Algorithm ◽

Color Space ◽

Local Binary Patterns ◽

Support Vector ◽

Traffic Sign ◽

Initial Image ◽

Dbscan Clustering ◽

Road Sign ◽

Sign Detection ◽

Detection And Recognition

Abstract Road sign detection and recognition is an integral part of intelligent transportation sys-tems (ITS). It increases protection by reminding the driver of the current condition of the route, such as notices, bans, limitations and other valuable driving information. This paper describes a novel system for automatic detection and recognition of road signs, which is achieved in two main steps. First, the initial image is pre-processed using DBSCAN clustering algorithm. The clustering is performed based on color information, and the generated clusters are segmented using Artiﬁcial neural networks (ANN) classiﬁer. The resulting ROIs are then carried out based on their aspect ratio and size to retain only signiﬁcant ones. Then, a shape-based classiﬁcation is performed using ANN as classiﬁer and HDSO as feature to detect the circular, rectangular and triangular shapes. Second, a hybrid feature is deﬁned to recognize the ROIs detected from the ﬁrst step. It involves a combination of the so-called GLBP-Color which is an extension of the classical gradient local binary patterns (GLPB) feature to the RGB color space and the local self-similarity (LSS) feature. ANN, Adaboost and support vector machine (SVM) have been tested with the introduced hybrid feature and the ﬁrst one is selected as it outperforms the other two. The proposed method has been tested in outdoor scenes, using a collection of common databasets, well known in the traﬃc sign community (GTSRB, GTSDB and STS). The results demonstrate the eﬀectiveness of our method when compared to recent state-of-the-art methods.

Download Full-text

Support Vector Machines for Class Imbalance Rail Data Classification with Bootstrapping-based Over-Sampling and Under-Sampling

IFAC Proceedings Volumes ◽

10.3182/20140824-6-za-1003.00794 ◽

2014 ◽

Vol 47 (3) ◽

pp. 8756-8761 ◽

Cited By ~ 8

Author(s):

Ali Zughrat ◽

M. Mahfouf ◽

Y.Y. Yang ◽

S. Thornton

Keyword(s):

Support Vector Machines ◽

Class Imbalance ◽

Data Classification ◽

Support Vector ◽

Vector Machines ◽

Under Sampling

Download Full-text