Multi-Class Imbalance in Text Classification: A Feature Engineering Approach to Detect Cyberbullying in Twitter

Bandeh Ali Talpur; Declan O’Sullivan

doi:10.3390/informatics7040052

Multi-Class Imbalance in Text Classification: A Feature Engineering Approach to Detect Cyberbullying in Twitter

Informatics ◽

10.3390/informatics7040052 ◽

2020 ◽

Vol 7 (4) ◽

pp. 52

Author(s):

Bandeh Ali Talpur ◽

Declan O’Sullivan

Keyword(s):

Binary Classification ◽

Class Imbalance ◽

Age Group ◽

Learning Classifier ◽

Semantic Orientation ◽

Medium Level ◽

Twitter Account ◽

Feature Based ◽

Multi Class Classification ◽

High Level

Twitter enables millions of active users to send and read concise messages on the internet every day. Yet some people use Twitter to propagate violent and threatening messages resulting in cyberbullying. Previous research has focused on whether cyberbullying behavior exists or not in a tweet (binary classification). In this research, we developed a model for detecting the severity of cyberbullying in a tweet. The developed model is a feature-based model that uses features from the content of a tweet, to develop a machine learning classifier for classifying the tweets as non-cyberbullied, and low, medium, or high-level cyberbullied tweets. In this study, we introduced pointwise semantic orientation as a new input feature along with utilizing predicted features (gender, age, and personality type) and Twitter API features. Results from experiments with our proposed framework in a multi-class setting are promising both with respect to Kappa (84%), classifier accuracy (93%), and F-measure (92%) metric. Overall, 40% of the classifiers increased performance in comparison with baseline approaches. Our analysis shows that features with the highest odd ratio: for detecting low-level severity include: age group between 19–22 years and users with <1 year of Twitter account activation; for medium-level severity: neuroticism, age group between 23–29 years, and being a Twitter user between one to two years; and for high-level severity: neuroticism and extraversion, and the number of times tweet has been favorited by other users. We believe that this research using a multi-class classification approach provides a step forward in identifying severity at different levels (low, medium, high) when the content of a tweet is classified as cyberbullied. Lastly, the current study only focused on the Twitter platform; other social network platforms can be investigated using the same approach to detect cyberbullying severity patterns.

Download Full-text

A Novel Deep Neural Network-Based Approach to Measure Scholarly Research Dissemination Using Citations Network

Applied Sciences ◽

10.3390/app112210970 ◽

2021 ◽

Vol 11 (22) ◽

pp. 10970

Author(s):

Naif Radi Aljohani ◽

Ayman Fayoumi ◽

Saeed-Ul Hassan

Keyword(s):

Neural Network ◽

State Of The Art ◽

Binary Classification ◽

Class Imbalance ◽

Citation Network ◽

Graph Visualization ◽

Scholarly Research ◽

Research Dissemination ◽

Reference Corpus ◽

Feature Based

We investigated the scientific research dissemination by analyzing the publications and citation data, implying that not all citations are significantly important. Therefore, as alluded to existing state-of-the-art models that employ feature-based techniques to measure the scholarly research dissemination between multiple entities, our model implements the convolutional neural network (CNN) with fastText-based pre-trained embedding vectors, utilizes only the citation context as its input to distinguish between important and non-important citations. Moreover, we speculate using focal-loss and class weight methods to address the inherited class imbalance problems in citation classification datasets. Using a dataset of 10 K annotated citation contexts, we achieved an accuracy of 90.7% along with a 90.6% f1-score, in the case of binary classification. Finally, we present a case study to measure the comprehensiveness of our deployed model on a dataset of 3100 K citations taken from the ACL Anthology Reference Corpus. We employed state-of-the-art graph visualization open-source tool Gephi to analyze the various aspects of citation network graphs, for each respective citation behavior.

Download Full-text

A Study on Multi Class Classification from Breast Cancer Images using Ensemble Network and Transfer Learning

Recent Patents on Engineering ◽

10.2174/1872212114999201109205421 ◽

2020 ◽

Vol 14 ◽

Author(s):

Lahari Tipirneni ◽

Rizwan Patan

Keyword(s):

Breast Cancer ◽

Neural Network ◽

Convolutional Neural Network ◽

Binary Classification ◽

Disease Diagnosis ◽

Feature Descriptors ◽

Histopathological Images ◽

Viable Approach ◽

Multi Class Classification

Abstract:: Millions of deaths all over the world are caused by breast cancer every year. It has become the most common type of cancer in women. Early detection will help in better prognosis and increases the chance of survival. Automating the classification using Computer-Aided Diagnosis (CAD) systems can make the diagnosis less prone to errors. Multi class classification and Binary classification of breast cancer is a challenging problem. Convolutional neural network architectures extract specific feature descriptors from images, which cannot represent different types of breast cancer. This leads to false positives in classification, which is undesirable in disease diagnosis. The current paper presents an ensemble Convolutional neural network for multi class classification and Binary classification of breast cancer. The feature descriptors from each network are combined to produce the final classification. In this paper, histopathological images are taken from publicly available BreakHis dataset and classified between 8 classes. The proposed ensemble model can perform better when compared to the methods proposed in the literature. The results showed that the proposed model could be a viable approach for breast cancer classification.

Download Full-text

A multilayer multimodal detection and prediction model based on explainable artificial intelligence for Alzheimer’s disease

Scientific Reports ◽

10.1038/s41598-021-82098-3 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Shaker El-Sappagh ◽

Jose M. Alonso ◽

S. M. Riazul Islam ◽

Ahmad M. Sultan ◽

Kyung Sup Kwak

Keyword(s):

Alzheimer’S Disease ◽

Alzheimer's Disease ◽

Cross Validation ◽

Disease Risk ◽

Binary Classification ◽

Fuzzy Rule ◽

Large Set ◽

Detection Model ◽

Multi Class Classification ◽

Clinical Measures

AbstractAlzheimer’s disease (AD) is the most common type of dementia. Its diagnosis and progression detection have been intensively studied. Nevertheless, research studies often have little effect on clinical practice mainly due to the following reasons: (1) Most studies depend mainly on a single modality, especially neuroimaging; (2) diagnosis and progression detection are usually studied separately as two independent problems; and (3) current studies concentrate mainly on optimizing the performance of complex machine learning models, while disregarding their explainability. As a result, physicians struggle to interpret these models, and feel it is hard to trust them. In this paper, we carefully develop an accurate and interpretable AD diagnosis and progression detection model. This model provides physicians with accurate decisions along with a set of explanations for every decision. Specifically, the model integrates 11 modalities of 1048 subjects from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) real-world dataset: 294 cognitively normal, 254 stable mild cognitive impairment (MCI), 232 progressive MCI, and 268 AD. It is actually a two-layer model with random forest (RF) as classifier algorithm. In the first layer, the model carries out a multi-class classification for the early diagnosis of AD patients. In the second layer, the model applies binary classification to detect possible MCI-to-AD progression within three years from a baseline diagnosis. The performance of the model is optimized with key markers selected from a large set of biological and clinical measures. Regarding explainability, we provide, for each layer, global and instance-based explanations of the RF classifier by using the SHapley Additive exPlanations (SHAP) feature attribution framework. In addition, we implement 22 explainers based on decision trees and fuzzy rule-based systems to provide complementary justifications for every RF decision in each layer. Furthermore, these explanations are represented in natural language form to help physicians understand the predictions. The designed model achieves a cross-validation accuracy of 93.95% and an F1-score of 93.94% in the first layer, while it achieves a cross-validation accuracy of 87.08% and an F1-Score of 87.09% in the second layer. The resulting system is not only accurate, but also trustworthy, accountable, and medically applicable, thanks to the provided explanations which are broadly consistent with each other and with the AD medical literature. The proposed system can help to enhance the clinical understanding of AD diagnosis and progression processes by providing detailed insights into the effect of different modalities on the disease risk.

Download Full-text

Urban network of China from the perspective of population mobility: Three-dimensional co-occurrence of nodes and links

Environment and Planning A Economy and Space ◽

10.1177/0308518x21997818 ◽

2021 ◽

pp. 0308518X2199781

Author(s):

Xinyue Luo ◽

Mingxing Chen

Keyword(s):

Three Dimensional ◽

Geographical Distance ◽

Two Dimensional ◽

Population Mobility ◽

Urban Network ◽

Urban Networks ◽

Low Level ◽

Medium Level ◽

High Level ◽

External Connections

The nodes and links in urban networks are usually presented in a two-dimensional(2D) view. The co-occurrence of nodes and links can also be realized from a three-dimensional(3D) perspective to make the characteristics of urban network more intuitively revealed. Our result shows that the external connections of high-level cities are mainly affected by the level of cities(nodes) and less affected by geographical distance, while medium-level cities are affected by the interaction of the level of cities(nodes) and geographical distance. The external connections of low-level cities are greatly restricted by geographical distance.

Download Full-text

Confidence interval for micro-averaged F1 and macro-averaged F1 scores

Applied Intelligence ◽

10.1007/s10489-021-02635-5 ◽

2021 ◽

Author(s):

Kanae Takahashi ◽

Kouji Yamamoto ◽

Aya Kuchiba ◽

Tatsuki Koyama

Keyword(s):

Binary Classification ◽

Classification Problem ◽

Classification Problems ◽

Summary Measure ◽

Medical Field ◽

Predictive Values ◽

Binary Classification Problem ◽

Multi Class Classification ◽

Sensitivity Specificity ◽

Measures Of Performance

AbstractA binary classification problem is common in medical field, and we often use sensitivity, specificity, accuracy, negative and positive predictive values as measures of performance of a binary predictor. In computer science, a classifier is usually evaluated with precision (positive predictive value) and recall (sensitivity). As a single summary measure of a classifier’s performance, F1 score, defined as the harmonic mean of precision and recall, is widely used in the context of information retrieval and information extraction evaluation since it possesses favorable characteristics, especially when the prevalence is low. Some statistical methods for inference have been developed for the F1 score in binary classification problems; however, they have not been extended to the problem of multi-class classification. There are three types of F1 scores, and statistical properties of these F1 scores have hardly ever been discussed. We propose methods based on the large sample multivariate central limit theorem for estimating F1 scores with confidence intervals.

Download Full-text

Serological evidence for Babesia canis infection of horses and an endemic focus of B. caballi in Hungary

Acta Veterinaria Hungarica ◽

10.1556/avet.55.2007.4.8 ◽

2007 ◽

Vol 55 (4) ◽

pp. 491-500 ◽

Cited By ~ 12

Author(s):

S. Hornok ◽

Renate Edelhofer ◽

G. Földvári ◽

Anja Joachim ◽

R. Farkas

Keyword(s):

Competitive Elisa ◽

Babesia Canis ◽

Age Group ◽

Serological Evidence ◽

Babesia Caballi ◽

Seropositivity Rate ◽

Blood Samples ◽

Endemic Focus ◽

Medium Level ◽

Level 1

In order to evaluate the seroconversion of horses to Babesia caballi and B. canis in Hungary, blood samples were collected from 371 animals on 23 different locations of the country. The presence of antibodies to B. caballi was screened with a competitive ELISA. All 29 positive samples came from one region (the Hortobágy). The prevalence of infection did not show correlation with sexes, and reached 100% in the age group of 2–5 years. Babesia canis -specific antibodies were demonstrated by IFAT in 6.74% of animals kept in 7 regions. The titres were low or medium level (1:40 to 1:160), indicating that the horses had previously been exposed to this piroplasm, but their infection must have been limited. The highest seropositivity rate was observed in the age group of 3–4 years, and males (stallions and geldings) were significantly more frequently infected than females. However, neither B. caballi nor B. canis could be identified in the peripheral blood samples of infected horses by PCR. Since most of the B. caballi -positive horses remained negative in the B. canis IFAT, whereas seroconversion solely to B. canis was detected in several regions of the country, serological cross-reaction between the two species can be discounted. This is the first serological evidence of horses being naturally infected with B. canis , supporting the view that piroplasms are less host specific than previously thought.

Download Full-text

Quality of colostral passive immunity and pattern of serum protein fluctuation in newborn calves

Scientia Agricola ◽

10.1590/s0103-90162003000300006 ◽

2003 ◽

Vol 60 (3) ◽

pp. 453-456 ◽

Cited By ~ 6

Author(s):

Patricia Pauletti ◽

Raul Machado Neto ◽

Irineu Umberto Packer ◽

Raul Dantas D'Arce ◽

Rosana Bessi

Keyword(s):

Total Protein ◽

Statistical Design ◽

Extended Period ◽

Experimental Period ◽

Passive Immunity ◽

Serum Total Protein ◽

Non Linear ◽

Medium Level ◽

Group 2 ◽

High Level

Immunity acquired by newborn animals is known as passive immunity, and for ruminants, antibody acquisition depends on the ingestion and absorption of adequate amounts of immunoglobulins from colostrum. This study relates different initial levels of acquired passive protection and serum total protein (TP) and immunoglobulin G (IgG). Serum immunoglobulin concentration and total protein were evaluated for female Holstein calves in the first sixty days of life. Animals were separated into three groups according to their initial level of passive immunity: group 1- animals with a low level of passive immunity (below 20 mg mL-1); group 2- animals with a medium level (between 20 and 30 mg mL-1), and group 3- animals with a high level (above 30 mg mL-1). Serum total protein was determined through the biuret method and IgG was determined by radial immunodiffusion. Data were analyzed as a completely randomized, split-plot statistical design. Fluctuation of the variables along the experimental period was determined through non-linear regression by the DUD method (PROC NLIN - Non Linear SAS). Animals with low antibody acquisition started to produce antibodies earlier, reflecting a compensatory synthesis. On the other hand, animals having adequate levels exhibited an extended period of immunoglobulin catabolism and the beginning of the endogenous phase was delayed. Regardless initial levels, the fluctuations in IgG contents occurred around adequate physiological concentrations, ranging from 20 to 25 mg mL-1.

Download Full-text

A Fast and Effective Method to Identify Relevant Sets of Variables in Complex Systems

Mathematics ◽

10.3390/math9091022 ◽

2021 ◽

Vol 9 (9) ◽

pp. 1022

Author(s):

Gianluca D’Addese ◽

Martina Casari ◽

Roberto Serra ◽

Marco Villani

Keyword(s):

Complex Systems ◽

Gene Regulatory Networks ◽

Regulatory Networks ◽

Computational Cost ◽

Graph Analysis ◽

The Past ◽

Medium Level ◽

Micro Level ◽

Gene Regulatory ◽

High Level

In many complex systems one observes the formation of medium-level structures, whose detection could allow a high-level description of the dynamical organization of the system itself, and thus to its better understanding. We have developed in the past a powerful method to achieve this goal, which however requires a heavy computational cost in several real-world cases. In this work we introduce a modified version of our approach, which reduces the computational burden. The design of the new algorithm allowed the realization of an original suite of methods able to work simultaneously at the micro level (that of the binary relationships of the single variables) and at meso level (the identification of dynamically relevant groups). We apply this suite to a particularly relevant case, in which we look for the dynamic organization of a gene regulatory network when it is subject to knock-outs. The approach combines information theory, graph analysis, and an iterated sieving algorithm in order to describe rather complex situations. Its application allowed to derive some general observations on the dynamical organization of gene regulatory networks, and to observe interesting characteristics in an experimental case.

Download Full-text

Detection of COVID-19 cases through X-ray images using hybrid deep neural network

World Journal of Engineering ◽

10.1108/wje-10-2020-0529 ◽

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Rajit Nair ◽

Santosh Vishwakarma ◽

Mukesh Soni ◽

Tejas Patel ◽

Shubham Joshi

Keyword(s):

Real Time ◽

Binary Classification ◽

Data Set ◽

Content Type ◽

X Ray ◽

Internal Parameters ◽

Average Accuracy ◽

Chest X Ray ◽

Multi Class Classification

Purpose The latest 2019 coronavirus (COVID-2019), which first appeared in December 2019 in Wuhan's city in China, rapidly spread around the world and became a pandemic. It has had a devastating impact on daily lives, the public's health and the global economy. The positive cases must be identified as soon as possible to avoid further dissemination of this disease and swift care of patients affected. The need for supportive diagnostic instruments increased, as no specific automated toolkits are available. The latest results from radiology imaging techniques indicate that these photos provide valuable details on the virus COVID-19. User advanced artificial intelligence (AI) technologies and radiological imagery can help diagnose this condition accurately and help resolve the lack of specialist doctors in isolated areas. In this research, a new paradigm for automatic detection of COVID-19 with bare chest X-ray images is displayed. Images are presented. The proposed model DarkCovidNet is designed to provide correct binary classification diagnostics (COVID vs no detection) and multi-class (COVID vs no results vs pneumonia) classification. The implemented model computed the average precision for the binary and multi-class classification of 98.46% and 91.352%, respectively, and an average accuracy of 98.97% and 87.868%. The DarkNet model was used in this research as a classifier for a real-time object detection method only once. A total of 17 convolutionary layers and different filters on each layer have been implemented. This platform can be used by the radiologists to verify their initial application screening and can also be used for screening patients through the cloud. Design/methodology/approach This study also uses the CNN-based model named Darknet-19 model, and this model will act as a platform for the real-time object detection system. The architecture of this system is designed in such a way that they can be able to detect real-time objects. This study has developed the DarkCovidNet model based on Darknet architecture with few layers and filters. So before discussing the DarkCovidNet model, look at the concept of Darknet architecture with their functionality. Typically, the DarkNet architecture consists of 5 pool layers though the max pool and 19 convolution layers. Assume as a convolution layer, and as a pooling layer. Findings The work discussed in this paper is used to diagnose the various radiology images and to develop a model that can accurately predict or classify the disease. The data set used in this work is the images bases on COVID-19 and non-COVID-19 taken from the various sources. The deep learning model named DarkCovidNet is applied to the data set, and these have shown signification performance in the case of binary classification and multi-class classification. During the multi-class classification, the model has shown an average accuracy 98.97% for the detection of COVID-19, whereas in a multi-class classification model has achieved an average accuracy of 87.868% during the classification of COVID-19, no detection and Pneumonia. Research limitations/implications One of the significant limitations of this work is that a limited number of chest X-ray images were used. It is observed that patients related to COVID-19 are increasing rapidly. In the future, the model on the larger data set which can be generated from the local hospitals will be implemented, and how the model is performing on the same will be checked. Originality/value Deep learning technology has made significant changes in the field of AI by generating good results, especially in pattern recognition. A conventional CNN structure includes a convolution layer that extracts characteristics from the input using the filters it applies, a pooling layer that reduces calculation efficiency and the neural network's completely connected layer. A CNN model is created by integrating one or more of these layers, and its internal parameters are modified to accomplish a specific mission, such as classification or object recognition. A typical CNN structure has a convolution layer that extracts features from the input with the filters it applies, a pooling layer to reduce the size for computational performance and a fully connected layer, which is a neural network. A CNN model is created by combining one or more such layers, and its internal parameters are adjusted to accomplish a particular task, such as classification or object recognition.

Download Full-text

ACRIPPER: A New Associative Classification Based on RIPPER Algorithm

Journal of Information & Knowledge Management ◽

10.1142/s0219649221500131 ◽

2021 ◽

Vol 20 (01) ◽

pp. 2150013

Author(s):

Mohammed Abu-Arqoub ◽

Wael Hadi ◽

Abdelraouf Ishtaiwi

Keyword(s):

Decision Making ◽

Class Imbalance ◽

The Other ◽

Associative Classification ◽

Rule Based ◽

New Approach ◽

New Novel ◽

Accuracy Measure ◽

High Level ◽

Substantial Interest

Associative Classification (AC) classifiers are of substantial interest due to their ability to be utilised for mining vast sets of rules. However, researchers over the decades have shown that a large number of these mined rules are trivial, irrelevant, redundant, and sometimes harmful, as they can cause decision-making bias. Accordingly, in our paper, we address these challenges and propose a new novel AC approach based on the RIPPER algorithm, which we refer to as ACRIPPER. Our new approach combines the strength of the RIPPER algorithm with the classical AC method, in order to achieve: (1) a reduction in the number of rules being mined, especially those rules that are largely insignificant; (2) a high level of integration among the confidence and support of the rules on one hand and the class imbalance level in the prediction phase on the other hand. Our experimental results, using 20 different well-known datasets, reveal that the proposed ACRIPPER significantly outperforms the well-known rule-based algorithms RIPPER and J48. Moreover, ACRIPPER significantly outperforms the current AC-based algorithms CBA, CMAR, ECBA, FACA, and ACPRISM. Finally, ACRIPPER is found to achieve the best average and ranking on the accuracy measure.

Download Full-text