A Heterogeneous Ensemble Learning Framework for Spam Detection in Social Networks with Imbalanced Data

Chensu Zhao; Yang Xin; Xuefeng Li; Yixian Yang; Yuling Chen

doi:10.3390/app10030936

A Heterogeneous Ensemble Learning Framework for Spam Detection in Social Networks with Imbalanced Data

Applied Sciences ◽

10.3390/app10030936 ◽

2020 ◽

Vol 10 (3) ◽

pp. 936 ◽

Cited By ~ 3

Author(s):

Chensu Zhao ◽

Yang Xin ◽

Xuefeng Li ◽

Yixian Yang ◽

Yuling Chen

Keyword(s):

Social Networks ◽

Ensemble Learning ◽

Class Imbalance ◽

Imbalanced Data ◽

Machine Learning Algorithms ◽

Spam Detection ◽

Real World Data ◽

Learning Framework ◽

The Impact ◽

Base Module

The popularity of social networks provides people with many conveniences, but their rapid growth has also attracted many attackers. In recent years, the malicious behavior of social network spammers has seriously threatened the information security of ordinary users. To reduce this threat, many researchers have mined the behavior characteristics of spammers and have obtained good results by applying machine learning algorithms to identify spammers in social networks. However, most of these studies overlook class imbalance situations that exist in real world data. In this paper, we propose a heterogeneous stacking-based ensemble learning framework to ameliorate the impact of class imbalance on spam detection in social networks. The proposed framework consists of two main components, a base module and a combining module. In the base module, we adopt six different base classifiers and utilize this classifier diversity to construct new ensemble input members. In the combination module, we introduce cost sensitive learning into deep neural network training. By setting different costs for misclassification and dynamically adjusting the weights of the prediction results of the base classifiers, we can integrate the input members and aggregate the classification results. The experimental results show that our framework effectively improves the spam detection rate on imbalanced datasets.

Download Full-text

The impact of data difficulty factors on classification of imbalanced and concept drifting data streams

Knowledge and Information Systems ◽

10.1007/s10115-021-01560-w ◽

2021 ◽

Author(s):

Dariusz Brzezinski ◽

Leandro L. Minku ◽

Tomasz Pewinski ◽

Jerzy Stefanowski ◽

Artur Szumaczuk

Keyword(s):

Data Streams ◽

Class Imbalance ◽

Imbalanced Data ◽

Real Data ◽

Local Data ◽

Real World Data ◽

Minority Class ◽

The Impact ◽

Concept Drifts

AbstractClass imbalance introduces additional challenges when learning classifiers from concept drifting data streams. Most existing work focuses on designing new algorithms for dealing with the global imbalance ratio and does not consider other data complexities. Independent research on static imbalanced data has highlighted the influential role of local data difficulty factors such as minority class decomposition and presence of unsafe types of examples. Despite often being present in real-world data, the interactions between concept drifts and local data difficulty factors have not been investigated in concept drifting data streams yet. We thoroughly study the impact of such interactions on drifting imbalanced streams. For this purpose, we put forward a new categorization of concept drifts for class imbalanced problems. Through comprehensive experiments with synthetic and real data streams, we study the influence of concept drifts, global class imbalance, local data difficulty factors, and their combinations, on predictions of representative online classifiers. Experimental results reveal the high influence of new considered factors and their local drifts, as well as differences in existing classifiers’ reactions to such factors. Combinations of multiple factors are the most challenging for classifiers. Although existing classifiers are partially capable of coping with global class imbalance, new approaches are needed to address challenges posed by imbalanced data streams.

Download Full-text

Scalable kernel-based SVM classification algorithm on imbalance air quality data for proficient healthcare

Complex & Intelligent Systems ◽

10.1007/s40747-021-00435-5 ◽

2021 ◽

Author(s):

Shwet Ketu ◽

Pramod Kumar Mishra

Keyword(s):

Air Pollution ◽

Air Quality ◽

Class Imbalance ◽

Imbalanced Data ◽

Classification Algorithm ◽

Quality Data ◽

Pollution Level ◽

Classification Problems ◽

Chi Square ◽

The Impact

AbstractIn the last decade, we have seen drastic changes in the air pollution level, which has become a critical environmental issue. It should be handled carefully towards making the solutions for proficient healthcare. Reducing the impact of air pollution on human health is possible only if the data is correctly classified. In numerous classification problems, we are facing the class imbalance issue. Learning from imbalanced data is always a challenging task for researchers, and from time to time, possible solutions have been developed by researchers. In this paper, we are focused on dealing with the imbalanced class distribution in a way that the classification algorithm will not compromise its performance. The proposed algorithm is based on the concept of the adjusting kernel scaling (AKS) method to deal with the multi-class imbalanced dataset. The kernel function's selection has been evaluated with the help of weighting criteria and the chi-square test. All the experimental evaluation has been performed on sensor-based Indian Central Pollution Control Board (CPCB) dataset. The proposed algorithm with the highest accuracy of 99.66% wins the race among all the classification algorithms i.e. Adaboost (59.72%), Multi-Layer Perceptron (95.71%), GaussianNB (80.87%), and SVM (96.92). The results of the proposed algorithm are also better than the existing literature methods. It is also clear from these results that our proposed algorithm is efficient for dealing with class imbalance problems along with enhanced performance. Thus, accurate classification of air quality through our proposed algorithm will be useful for improving the existing preventive policies and will also help in enhancing the capabilities of effective emergency response in the worst pollution situation.

Download Full-text

Classification of Imbalanced Malaria Disease Using Naïve Bayesian Algorithm

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i2.7.10978 ◽

2018 ◽

Vol 7 (2.7) ◽

pp. 786 ◽

Cited By ~ 1

Author(s):

T Sajana ◽

M R.Narasingarao

Keyword(s):

Comparative Study ◽

Class Imbalance ◽

Machine Learning Algorithms ◽

Bayesian Algorithm ◽

Naive Bayesian ◽

Class Distribution ◽

Naïve Bayesian ◽

R Programming ◽

The Impact

Malaria disease is one whose presence is rampant in semi urban and non-urban areas especially resource poor developing countries. It is quite evident from the datasets like malaria, dengue, etc., where there is always a possibility of having more negative patients (non-occurrence of the disease) compared to patients suffering from disease (positive cases). Developing a model based decision support system with such unbalanced datasets is a cause of concern and it is indeed necessary to have a model predicting the disease quite accurately. Classification of imbalanced malaria disease data become a crucial task in medical application domain because most of the conventional machine learning algorithms are showing very poor performance to classify whether a patient is affected by malaria disease or not. In imbalanced data, majority (unaffected) class samples are dominates the minority (affected) class samples leading to class imbalance. To overcome the nature of class imbalance problem, balancing the data samples is the best solution which produces the better accuracy in classification of minority samples. The aim of this research is to propose a comparative study on classifying the imbalanced malaria disease data using Naive Bayesian classifier in different environments like weka and using an R-language. We present here, clinical descriptive study on 165 patients of different age group people collected at medical wards of Narasaraopet from 2014-17. Synthetic Minority Oversampling Technique (SMOTE) technique has been used to balance the class distribution and then we performed a comparative study on the dataset using Naïve Bayesian algorithm in various platforms. Out of balanced class distribution data, 70% data was given to train the Naive Bayesian algorithm and the rest of the data was used for testing the model for both weka and R programming environments. Experimental results have indicated that, classification of malaria disease data in weka environment has highest accuracy of 88.5% than the Naive Bayesian algorithm accuracy of 87.5% using R programming language. The impact of vector borne disease is very high in medical applications. Prediction of disease like malaria is an hour of the need and this is possible only with a suitable model for a given dataset. Hence, we have developed a model with Naive Bayesian algorithm is used for current research.

Download Full-text

A Selective Ensemble Learning Framework for ECG-Based Heartbeat Classification with Imbalanced Data

2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) ◽

10.1109/bibm.2018.8621523 ◽

2018 ◽

Cited By ~ 1

Author(s):

Hongwei Ge ◽

Keyi Sun ◽

Liang Sun ◽

Mingde Zhao ◽

Chunguo Wu

Keyword(s):

Ensemble Learning ◽

Imbalanced Data ◽

Heartbeat Classification ◽

Learning Framework ◽

Selective Ensemble

Download Full-text

An Ensemble Learning Framework for Online Web Spam Detection

2013 12th International Conference on Machine Learning and Applications ◽

10.1109/icmla.2013.15 ◽

2013 ◽

Cited By ~ 1

Author(s):

Cailing Dong ◽

Bin Zhou

Keyword(s):

Ensemble Learning ◽

Spam Detection ◽

Learning Framework ◽

Web Spam ◽

Online Web

Download Full-text

Exploiting Correlation Subspace to Predict Heterogeneous Cross-Project Defects

International Journal of Software Engineering and Knowledge Engineering ◽

10.1142/s0218194016710017 ◽

2016 ◽

Vol 26 (09n10) ◽

pp. 1571-1580 ◽

Cited By ~ 6

Author(s):

Ming Cheng ◽

Guoqing Wu ◽

Hongyan Wan ◽

Guoan You ◽

Mengting Yuan ◽

...

Keyword(s):

Class Imbalance ◽

Imbalanced Data ◽

Feature Space ◽

Support Vector ◽

Class Imbalance Problem ◽

Classifier Design ◽

Imbalance Problem ◽

Project Data ◽

The Impact ◽

Cross Project

Cross-project defect prediction trains a prediction model using historical data from source projects and applies the model to target projects. Most previous efforts assumed the cross-project data have the same metrics set, which means the metrics used and the size of metrics set are the same. However, this assumption may not hold in practical scenarios. In addition, software defect datasets have the class-imbalance problem which increases the difficulty for the learner to predict defects. In this paper, we advance canonical correlation analysis by deriving a joint feature space for associating cross-project data. We also propose a novel support vector machine algorithm which incorporates the correlation transfer information into classifier design for cross-project prediction. Moreover, we take different misclassification costs into consideration to make the classification inclining to classify a module as a defective one, alleviating the impact of imbalanced data. The experimental results show that our method is more effective compared to state-of-the-art methods.

Download Full-text

Cost-based heterogeneous learning framework for real-time spam detection in social networks with expert decisions

IEEE Access ◽

10.1109/access.2021.3098799 ◽

2021 ◽

pp. 1-1

Author(s):

Jaeun Choi ◽

Chunmi Jeon

Keyword(s):

Social Networks ◽

Real Time ◽

Spam Detection ◽

Learning Framework ◽

Heterogeneous Learning

Download Full-text

ANALYZING THE IMPACT OF RESAMPLING METHOD FOR IMBALANCED DATA TEXT IN INDONESIAN SCIENTIFIC ARTICLES CATEGORIZATION

BACA JURNAL DOKUMENTASI DAN INFORMASI ◽

10.14203/j.baca.v41i2.702 ◽

2020 ◽

Vol 41 (2) ◽

pp. 133

Author(s):

Ariani Indrawati ◽

Hendro Subagyo ◽

Andre Sihombing ◽

Wagiyah Wagiyah ◽

Sjaeful Afandi

Keyword(s):

Machine Learning ◽

Comparative Research ◽

Scientific Journal ◽

Imbalanced Data ◽

Machine Learning Algorithms ◽

Unstructured Data ◽

Resampling Methods ◽

Classifier Performance ◽

Resampling Method ◽

The Impact

The extremely skewed data in artificial intelligence, machine learning, and data mining cases are often given misleading results. It is caused because machine learning algorithms are designated to work best with balanced data. However, we often meet with imbalanced data in the real situation. To handling imbalanced data issues, the most popular technique is resampling the dataset to modify the number of instances in the majority and minority classes into a standard balanced data. Many resampling techniques, oversampling, undersampling, or combined both of them, have been proposed and continue until now. Resampling techniques may increase or decrease the classifier performance. Comparative research on resampling methods in structured data has been widely carried out, but studies that compare resampling methods with unstructured data are very rarely conducted. That raises many questions, one of which is whether this method is applied to unstructured data such as text that has large dimensions and very diverse characters. To understand how different resampling techniques will affect the learning of classifiers for imbalanced data text, we perform an experimental analysis using various resampling methods with several classification algorithms to classify articles at the Indonesian Scientific Journal Database (ISJD). From this experiment, it is known resampling techniques on imbalanced data text generally to improve the classifier performance but they are doesn’t give significant result because data text has very diverse and large dimensions.

Download Full-text

Probability Density Machine: A New Solution of Class Imbalance Learning

Scientific Programming ◽

10.1155/2021/7555587 ◽

2021 ◽

Vol 2021 ◽

pp. 1-14

Author(s):

Ruihan Cheng ◽

Longfei Zhang ◽

Shiqi Wu ◽

Sen Xu ◽

Shang Gao ◽

...

Keyword(s):

Probability Density ◽

Predictive Model ◽

Data Distribution ◽

Class Imbalance ◽

Imbalanced Data ◽

Training Data ◽

Probability Density Estimation ◽

Imbalance Learning ◽

Class Imbalance Learning ◽

The Impact

Class imbalance learning (CIL) is an important branch of machine learning as, in general, it is difficult for classification models to learn from imbalanced data; meanwhile, skewed data distribution frequently exists in various real-world applications. In this paper, we introduce a novel solution of CIL called Probability Density Machine (PDM). First, in the context of Gaussian Naive Bayes (GNB) predictive model, we analyze the reason why imbalanced data distribution makes the performance of predictive model decline in theory and draw a conclusion regarding the impact of class imbalance that is only associated with the prior probability, but does not relate to the conditional probability of training data. Then, in such context, we show the rationality of several traditional CIL techniques. Furthermore, we indicate the drawback of combining GNB with these traditional CIL techniques. Next, profiting from the idea of K-nearest neighbors probability density estimation (KNN-PDE), we propose the PDM which is an improved GNB-based CIL algorithm. Finally, we conduct experiments on lots of class imbalance data sets, and the proposed PDM algorithm shows the promising results.

Download Full-text

Effects of Class Imbalance Using Machine Learning Algorithms

International Journal of Applied Evolutionary Computation ◽

10.4018/ijaec.2021010101 ◽

2021 ◽

Vol 12 (1) ◽

pp. 1-17

Author(s):

Swati V. Narwane ◽

Sudhir D. Sawarkar

Keyword(s):

Machine Learning ◽

Learning Algorithms ◽

Class Imbalance ◽

Imbalanced Data ◽

Machine Learning Algorithms ◽

Training Data ◽

Model Accuracy ◽

Data Set ◽

Class Distribution ◽

Imbalance Problem

Class imbalance is the major hurdle for machine learning-based systems. Data set is the backbone of machine learning and must be studied to handle the class imbalance. The purpose of this paper is to investigate the effect of class imbalance on the data sets. The proposed methodology determines the model accuracy for class distribution. To find possible solutions, the behaviour of an imbalanced data set was investigated. The study considers two case studies with data set divided balanced to unbalanced class distribution. Testing of the data set with trained and test data was carried out for standard machine learning algorithms. Model accuracy for class distribution was measured with the training data set. Further, the built model was tested with individual binary class. Results show that, for the improvement of the system performance, it is essential to work on class imbalance problems. The study concludes that the system produces biased results due to the majority class. In the future, the multiclass imbalance problem can be studied using advanced algorithms.

Download Full-text