Automated Classification of Evidence of Respect in the Communication through Twitter

Krzysztof Fiok; Waldemar Karwowski; Edgar Gutierrez; Tameika Liciaga; Alessandro Belmonte; Rocco Capobianco

doi:10.3390/app11031294

Automated Classification of Evidence of Respect in the Communication through Twitter

Applied Sciences ◽

10.3390/app11031294 ◽

2021 ◽

Vol 11 (3) ◽

pp. 1294

Author(s):

Krzysztof Fiok ◽

Waldemar Karwowski ◽

Edgar Gutierrez ◽

Tameika Liciaga ◽

Alessandro Belmonte ◽

...

Keyword(s):

Social Media ◽

Text Analysis ◽

Hate Speech ◽

Automatic Detection ◽

Data Sets ◽

Automated Classification ◽

Data Set ◽

Analysis Methods ◽

Textual Data

Volcanoes of hate and disrespect erupt in societies often not without fatal consequences. To address this negative phenomenon scientists struggled to understand and analyze its roots and language expressions described as hate speech. As a result, it is now possible to automatically detect and counter hate speech in textual data spreading rapidly, for example, in social media. However, recently another approach to tackling the roots of disrespect was proposed, it is based on the concept of promoting positive behavior instead of only penalizing hate and disrespect. In our study, we followed this approach and discovered that it is hard to find any textual data sets or studies discussing automatic detection regarding respectful behaviors and their textual expressions. Therefore, we decided to contribute probably one of the first human-annotated data sets which allows for supervised training of text analysis methods for automatic detection of respectful messages. By choosing a data set of tweets which already possessed sentiment annotations we were also able to discuss the correlation of sentiment and respect. Finally, we provide a comparison of recent machine and deep learning text analysis methods and their performance which allowed us to demonstrate that automatic detection of respectful messages in social media is feasible.

Download Full-text

Investigating the impact of pre-processing techniques and pre-trained word embeddings in detecting Arabic health information on social media

Journal Of Big Data ◽

10.1186/s40537-021-00488-w ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Yahya Albalawi ◽

Jim Buckley ◽

Nikola S. Nikolov

Keyword(s):

Social Media ◽

Deep Learning ◽

Comprehensive Evaluation ◽

Classification Problem ◽

Data Sets ◽

Word Embeddings ◽

Data Set ◽

Lower Accuracy ◽

Health Related ◽

The Impact

AbstractThis paper presents a comprehensive evaluation of data pre-processing and word embedding techniques in the context of Arabic document classification in the domain of health-related communication on social media. We evaluate 26 text pre-processings applied to Arabic tweets within the process of training a classifier to identify health-related tweets. For this task we use the (traditional) machine learning classifiers KNN, SVM, Multinomial NB and Logistic Regression. Furthermore, we report experimental results with the deep learning architectures BLSTM and CNN for the same text classification problem. Since word embeddings are more typically used as the input layer in deep networks, in the deep learning experiments we evaluate several state-of-the-art pre-trained word embeddings with the same text pre-processing applied. To achieve these goals, we use two data sets: one for both training and testing, and another for testing the generality of our models only. Our results point to the conclusion that only four out of the 26 pre-processings improve the classification accuracy significantly. For the first data set of Arabic tweets, we found that Mazajak CBOW pre-trained word embeddings as the input to a BLSTM deep network led to the most accurate classifier with F1 score of 89.7%. For the second data set, Mazajak Skip-Gram pre-trained word embeddings as the input to BLSTM led to the most accurate model with F1 score of 75.2% and accuracy of 90.7% compared to F1 score of 90.8% achieved by Mazajak CBOW for the same architecture but with lower accuracy of 70.89%. Our results also show that the performance of the best of the traditional classifier we trained is comparable to the deep learning methods on the first dataset, but significantly worse on the second dataset.

Download Full-text

Classification of jujube defects in small data sets based on transfer learning

Neural Computing and Applications ◽

10.1007/s00521-021-05715-2 ◽

2021 ◽

Author(s):

Jianping Ju ◽

Hong Zheng ◽

Xiaohang Xu ◽

Zhongyuan Guo ◽

Zhaohui Zheng ◽

...

Keyword(s):

Transfer Learning ◽

Loss Function ◽

Training Model ◽

Parameter Distribution ◽

Test Accuracy ◽

Small Data ◽

Data Sets ◽

Data Set ◽

Small Data Sets

AbstractAlthough convolutional neural networks have achieved success in the field of image classification, there are still challenges in the field of agricultural product quality sorting such as machine vision-based jujube defects detection. The performance of jujube defect detection mainly depends on the feature extraction and the classifier used. Due to the diversity of the jujube materials and the variability of the testing environment, the traditional method of manually extracting the features often fails to meet the requirements of practical application. In this paper, a jujube sorting model in small data sets based on convolutional neural network and transfer learning is proposed to meet the actual demand of jujube defects detection. Firstly, the original images collected from the actual jujube sorting production line were pre-processed, and the data were augmented to establish a data set of five categories of jujube defects. The original CNN model is then improved by embedding the SE module and using the triplet loss function and the center loss function to replace the softmax loss function. Finally, the depth pre-training model on the ImageNet image data set was used to conduct training on the jujube defects data set, so that the parameters of the pre-training model could fit the parameter distribution of the jujube defects image, and the parameter distribution was transferred to the jujube defects data set to complete the transfer of the model and realize the detection and classification of the jujube defects. The classification results are visualized by heatmap through the analysis of classification accuracy and confusion matrix compared with the comparison models. The experimental results show that the SE-ResNet50-CL model optimizes the fine-grained classification problem of jujube defect recognition, and the test accuracy reaches 94.15%. The model has good stability and high recognition accuracy in complex environments.

Download Full-text

Automated Classification of Fake News Spreaders to Break the Misinformation Chain

Information ◽

10.3390/info12060248 ◽

2021 ◽

Vol 12 (6) ◽

pp. 248

Author(s):

Simone Leonardi ◽

Giuseppe Rizzo ◽

Maurizio Morisio

Keyword(s):

Social Media ◽

Network Architecture ◽

Diffusion Mechanism ◽

Automated Classification ◽

Fake News ◽

Neural Network Architecture ◽

User Classification ◽

The Social ◽

Network Topologies

In social media, users are spreading misinformation easily and without fact checking. In principle, they do not have a malicious intent, but their sharing leads to a socially dangerous diffusion mechanism. The motivations behind this behavior have been linked to a wide variety of social and personal outcomes, but these users are not easily identified. The existing solutions show how the analysis of linguistic signals in social media posts combined with the exploration of network topologies are effective in this field. These applications have some limitations such as focusing solely on the fake news shared and not understanding the typology of the user spreading them. In this paper, we propose a computational approach to extract features from the social media posts of these users to recognize who is a fake news spreader for a given topic. Thanks to the CoAID dataset, we start the analysis with 300 K users engaged on an online micro-blogging platform; then, we enriched the dataset by extending it to a collection of more than 1 M share actions and their associated posts on the platform. The proposed approach processes a batch of Twitter posts authored by users of the CoAID dataset and turns them into a high-dimensional matrix of features, which are then exploited by a deep neural network architecture based on transformers to perform user classification. We prove the effectiveness of our work by comparing the precision, recall, and f1 score of our model with different configurations and with a baseline classifier. We obtained an f1 score of 0.8076, obtaining an improvement from the state-of-the-art by 4%.

Download Full-text

Sequential Sampling for Estimation and Classification of the Incidence of Hop Powdery Mildew II: Cone Sampling

Plant Disease ◽

10.1094/pdis-91-8-1013 ◽

2007 ◽

Vol 91 (8) ◽

pp. 1013-1020 ◽

Cited By ~ 8

Author(s):

David H. Gent ◽

William W. Turechek ◽

Walter F. Mahaffee

Keyword(s):

Powdery Mildew ◽

Binomial Distribution ◽

Disease Incidence ◽

Sequential Sampling ◽

Model Construction ◽

Data Sets ◽

Data Set ◽

Sampling Plans ◽

Simulated Sampling

Sequential sampling models for estimation and classification of the incidence of powdery mildew (caused by Podosphaera macularis) on hop (Humulus lupulus) cones were developed using parameter estimates of the binary power law derived from the analysis of 221 transect data sets (model construction data set) collected from 41 hop yards sampled in Oregon and Washington from 2000 to 2005. Stop lines, models that determine when sufficient information has been collected to estimate mean disease incidence and stop sampling, for sequential estimation were validated by bootstrap simulation using a subset of 21 model construction data sets and simulated sampling of an additional 13 model construction data sets. Achieved coefficient of variation (C) approached the prespecified C as the estimated disease incidence, [Formula: see text], increased, although achieving a C of 0.1 was not possible for data sets in which [Formula: see text] < 0.03 with the number of sampling units evaluated in this study. The 95% confidence interval of the median difference between [Formula: see text] of each yard (achieved by sequential sampling) and the true p of the original data set included 0 for all 21 data sets evaluated at levels of C of 0.1 and 0.2. For sequential classification, operating characteristic (OC) and average sample number (ASN) curves of the sequential sampling plans obtained by bootstrap analysis and simulated sampling were similar to the OC and ASN values determined by Monte Carlo simulation. Correct decisions of whether disease incidence was above or below prespecified thresholds (pt) were made for 84.6 or 100% of the data sets during simulated sampling when stop lines were determined assuming a binomial or beta-binomial distribution of disease incidence, respectively. However, the higher proportion of correct decisions obtained by assuming a beta-binomial distribution of disease incidence required, on average, sampling 3.9 more plants per sampling round to classify disease incidence compared with the binomial distribution. Use of these sequential sampling plans may aid growers in deciding the order in which to harvest hop yards to minimize the risk of a condition called “cone early maturity” caused by late-season infection of cones by P. macularis. Also, sequential sampling could aid in research efforts, such as efficacy trials, where many hop cones are assessed to determine disease incidence.

Download Full-text

ON A KNOWLEDGE-BASED APPROACH TO THE CLASSIFICATION OF MOBILE LASER SCANNING POINT CLOUDS

ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences ◽

10.5194/isprs-archives-xlii-4-343-2018 ◽

2018 ◽

Vol XLII-4 ◽

pp. 343-349 ◽

Cited By ~ 1

Author(s):

M. Lemmens

Keyword(s):

Point Cloud ◽

Laser Scanning ◽

Decision Rules ◽

Point Clouds ◽

Bench Mark ◽

Automated Classification ◽

Initial Experiment ◽

Data Set ◽

Knowledge Based

Abstract. A knowledge-based system exploits the knowledge, which a human expert uses for completing a complex task, through a database containing decision rules, and an inference engine. Already in the early nineties knowledge-based systems have been proposed for automated image classification. Lack of success faded out initial interest and enthusiasm, the same fate neural networks struck at that time. Today the latter enjoy a steady revival. This paper aims at demonstrating that a knowledge-based approach to automated classification of mobile laser scanning point clouds has promising prospects. An initial experiment exploiting only two features, height and reflectance value, resulted in an overall accuracy of 79% for the Paris-rue-Madame point cloud bench mark data set.

Download Full-text

News Classification Using Machine Learning

International Journal on Recent and Innovation Trends in Computing and Communication ◽

10.17762/ijritcc.v9i5.5464 ◽

2021 ◽

Vol 9 (5) ◽

pp. 23-27

Author(s):

SHWETA MAHAJAN

Keyword(s):

Machine Learning ◽

Social Media ◽

Performance Improvement ◽

Vital Role ◽

Learning Approach ◽

Entertainment Education ◽

Meaningful Information ◽

Textual Data ◽

Machine Learning Approach

There are plenty of social media webpages and platforms producing the textual data. These different kind of a data needs to be analysed and processed to extract meaningful information from raw data. Classification of text plays a vital role in extraction of useful information along with summarization, text retrieval. In our work we have considered the problem of news classification using machine learning approach. Currently we have a news related dataset which having various types of data like entertainment, education, sports, politics, etc. On this data we have applying classification algorithm with some word vectorizing techniques in order to get best result. The results which we got that have been compared on different parameters like Precision, Recall, F1 Score, accuracy for performance improvement.

Download Full-text

What Your Tweets Tell Us About You: Identity, Ownership and Privacy of Twitter Data

International Journal of Digital Curation ◽

10.2218/ijdc.v7i1.224 ◽

2012 ◽

Vol 7 (1) ◽

pp. 174-197 ◽

Cited By ~ 9

Author(s):

Heather Small ◽

Kristine Kasianovitz ◽

Ronald Blanford ◽

Ina Celaya

Keyword(s):

Social Media ◽

Social Networking Sites ◽

Data Sets ◽

Data Set ◽

Social Media Data ◽

Twitter Data ◽

Other Information ◽

Rich Data ◽

Additional Value ◽

Media Data

Social networking sites and other social media have enabled new forms of collaborative communication and participation for users, and created additional value as rich data sets for research. Research based on accessing, mining, and analyzing social media data has risen steadily over the last several years and is increasingly multidisciplinary; researchers from the social sciences, humanities, computer science and other domains have used social media data as the basis of their studies. The broad use of this form of data has implications for how curators address preservation, access and reuse for an audience with divergent disciplinary norms related to privacy, ownership, authenticity and reliability.In this paper, we explore how the characteristics of the Twitter platform, coupled with an ambiguous and evolving understanding of privacy in networked communication, and divergent disciplinary understandings of the resulting data, combine to create complex issues for curators trying to ensure broad-based and ethical reuse of Twitter data. We provide a case study of a specific data set to illustrate how data curators can engage with the topics and questions raised in the paper. While some initial suggestions are offered to librarians and other information professionals who are beginning to receive social media data from researchers, our larger goal is to stimulate discussion and prompt additional research on the curation and preservation of social media data.

Download Full-text

Identifying Victims of Human Sex Trafficking in Online Ads

Encyclopedia of Criminal Activities and the Deep Web ◽

10.4018/978-1-5225-9715-5.ch034 ◽

2020 ◽

pp. 497-517

Author(s):

Jessica Whitney ◽

Marisa Hultgren ◽

Murray Eugene Jennex ◽

Aaron Elkins ◽

Eric Frost

Keyword(s):

Social Media ◽

Knowledge Management ◽

Sex Trafficking ◽

Data Sets ◽

Data Set ◽

Viable Approach ◽

Unexpected Outcome ◽

Analyze Data

Social media and the interactive web have enabled human traffickers to lure victims and then sell them faster and in greater safety than ever before. However, these same tools have also enabled investigators in their search for victims and criminals. A prototype was designed to identify victims of human sex trafficking by analyzing online ads. The prototype used a knowledge management to generate actionable intelligence by applying a set of strong filters based on an ontology to identify potential victims. The prototype was used to analyze data sets generated from online ads. An unexpected outcome of the second data set was the discovery of the use of emojis in an expanded ontology. The final prototype used the expanded ontology to identify potential victims. The results of applying the prototypes suggest a viable approach to identifying victims of human sex trafficking in online ads.

Download Full-text

Multilabel Classification of Hate Speech and Abusive Words on Indonesian Twitter Social Media

2020 International Conference on Data Science and Its Applications (ICoDSA) ◽

10.1109/icodsa50139.2020.9212962 ◽

2020 ◽

Author(s):

Rahmat Hendrawan ◽

Adiwijaya ◽

Said Al Faraby

Keyword(s):

Social Media ◽

Hate Speech ◽

Multilabel Classification

Download Full-text

Massive Data Classification of Neural Responses

Advances in Medical Technologies and Clinical Practice - Biomedical Diagnostics and Clinical Technologies ◽

10.4018/978-1-60566-280-0.ch009 ◽

2010 ◽

pp. 278-298

Author(s):

Pedro Tomás ◽

IST TU Lisbon ◽

Aleksandar Ilic ◽

Leonel Sousa

Keyword(s):

Execution Time ◽

Data Parallelism ◽

Data Sets ◽

Neural Responses ◽

Neuronal Responses ◽

Data Set ◽

Web Interfaces ◽

Mass Classification ◽

Neuronal Code

When analyzing the neuronal code, neuroscientists usually perform extra-cellular recordings of neuronal responses (spikes). Since the size of the microelectrodes used to perform these recordings is much larger than the size of the cells, responses from multiple neurons are recorded by each micro-electrode. Thus, the obtained response must be classified and evaluated, in order to identify how many neurons were recorded, and to assess which neuron generated each spike. A platform for the mass-classification of neuronal responses is proposed in this chapter, employing data-parallelism for speeding up the classification of neuronal responses. The platform is built in a modular way, supporting multiple web-interfaces, different back-end environments for parallel computing or different algorithms for spike classification. Experimental results on the proposed platform show that even for an unbalanced data set of neuronal responses the execution time was reduced of about 45%. For balanced data sets, the platform may achieve a reduction in execution time equal to the inverse of the number of back-end computational elements.

Download Full-text