scholarly journals Identifying vulgarity in Bengali social media textual content

2021 ◽  
Vol 7 ◽  
pp. e665
Author(s):  
Salim Sazzed

The presence of abusive and vulgar language in social media has become an issue of increasing concern in recent years. However, research pertaining to the prevalence and identification of vulgar language has remained largely unexplored in low-resource languages such as Bengali. In this paper, we provide the first comprehensive analysis on the presence of vulgarity in Bengali social media content. We develop two benchmark corpora consisting of 7,245 reviews collected from YouTube and manually annotate them into vulgar and non-vulgar categories. The manual annotation reveals the ubiquity of vulgar and swear words in Bengali social media content (i.e., in two corpora), ranging from 20% to 34%. To automatically identify vulgarity, we employ various approaches, such as classical machine learning (CML) classifiers, Stochastic Gradient Descent (SGD) optimizer, a deep learning (DL) based architecture, and lexicon-based methods. Although small in size, we find that the swear/vulgar lexicon is effective at identifying the vulgar language due to the high presence of some swear terms in Bengali social media. We observe that the performances of machine leanings (ML) classifiers are affected by the class distribution of the dataset. The DL-based BiLSTM (Bidirectional Long Short Term Memory) model yields the highest recall scores for identifying vulgarity in both datasets (i.e., in both original and class-balanced settings). Besides, the analysis reveals that vulgarity is highly correlated with negative sentiment in social media comments.

Author(s):  
T. V. Divya ◽  
Barnali Gupta Banik

Fake news detection on job advertisements has grabbed the attention of many researchers over past decade. Various classifiers such as Support Vector Machine (SVM), XGBoost Classifier and Random Forest (RF) methods are greatly utilized for fake and real news detection pertaining to job advertisement posts in social media. Bi-Directional Long Short-Term Memory (Bi-LSTM) classifier is greatly utilized for learning word representations in lower-dimensional vector space and learning significant words word embedding or terms revealed through Word embedding algorithm. The fake news detection is greatly achieved along with real news on job post from online social media is achieved by Bi-LSTM classifier and thereby evaluating corresponding performance. The performance metrics such as Precision, Recall, F1-score, and Accuracy are assessed for effectiveness by fraudulency based on job posts. The outcome infers the effectiveness and prominence of features for detecting false news. .


2021 ◽  
Vol 4 (1) ◽  
pp. 121-128
Author(s):  
A Iorliam ◽  
S Agber ◽  
MP Dzungwe ◽  
DK Kwaghtyo ◽  
S Bum

Social media provides opportunities for individuals to anonymously communicate and express hateful feelings and opinions at the comfort of their rooms. This anonymity has become a shield for many individuals or groups who use social media to express deep hatred for other individuals or groups, tribes or race, religion, gender, as well as belief systems. In this study, a comparative analysis is performed using Long Short-Term Memory and Convolutional Neural Network deep learning techniques for Hate Speech classification. This analysis demonstrates that the Long Short-Term Memory classifier achieved an accuracy of 92.47%, while the Convolutional Neural Network classifier achieved an accuracy of 92.74%. These results showed that deep learning techniques can effectively classify hate speech from normal speech.


2019 ◽  
Vol 3 (3) ◽  
pp. 357-363
Author(s):  
Soffa Zahara ◽  
Sugianto ◽  
M. Bahril Ilmiddafiq

Long Short Term Memory (LSTM) is known as optimized Recurrent Neural Network (RNN) architectures that overcome RNN’s lact about maintaining long period of memories. As part of machine learning networks, LSTM also notable as the right choice for time-series prediction. Currently, machine learning is a burning issue in economic world, abundant studies such predicting macroeconomic and microeconomics indicators are emerge. Inflation rate has been used for decision making for central banks also private sector. In Indonesia, CPI (Consumer Price Index) is one of best practice inflation indicators besides Wholesale Price Index and The Gross Domestic Product (GDP). Since CPI data could be used as a direction for next inflation move, we conducted CPI prediction model using LSTM method. The network model input consists of 28 variables of staple price in Surabaya and the output is CPI value, also the entire development of prediction model are done in Amazon Web Service (AWS) Cloud. In the interest of accuracy improvement, we used several optimization algorithm i.e. Stochastic Gradient Descent (sgd), Root Mean Square Propagation (RMSProp), Adaptive Gradient(AdaGrad), Adaptive moment (Adam), Adadelta, Nesterov Adam (Nadam) and Adamax. The results indicate that Nadam has 4,008 RMSE’s value, less than other algorithm which indicate the most accurate optimization algorithm to predict CPI value.


2018 ◽  
Vol 2018 ◽  
pp. 1-8 ◽  
Author(s):  
Buzhou Tang ◽  
Jianglu Hu ◽  
Xiaolong Wang ◽  
Qingcai Chen

Social media in medicine, where patients can express their personal treatment experiences by personal computers and mobile devices, usually contains plenty of useful medical information, such as adverse drug reactions (ADRs); mining this useful medical information from social media has attracted more and more attention from researchers. In this study, we propose a deep neural network (called LSTM-CRF) combining long short-term memory (LSTM) neural networks (a type of recurrent neural networks) and conditional random fields (CRFs) to recognize ADR mentions from social media in medicine and investigate the effects of three factors on ADR mention recognition. The three factors are as follows: (1) representation for continuous and discontinuous ADR mentions: two novel representations, that is, “BIOHD” and “Multilabel,” are compared; (2) subject of posts: each post has a subject (i.e., drug here); and (3) external knowledge bases. Experiments conducted on a benchmark corpus, that is, CADEC, show that LSTM-CRF achieves better F-score than CRF; “Multilabel” is better in representing continuous and discontinuous ADR mentions than “BIOHD”; both subjects of comments and external knowledge bases are individually beneficial to ADR mention recognition. To the best of our knowledge, this is the first time to investigate deep neural networks to mine continuous and discontinuous ADRs from social media.


Information ◽  
2020 ◽  
Vol 11 (6) ◽  
pp. 312 ◽  
Author(s):  
Asma Baccouche ◽  
Sadaf Ahmed ◽  
Daniel Sierra-Sosa ◽  
Adel Elmaghraby

Identifying internet spam has been a challenging problem for decades. Several solutions have succeeded to detect spam comments in social media or fraudulent emails. However, an adequate strategy for filtering messages is difficult to achieve, as these messages resemble real communications. From the Natural Language Processing (NLP) perspective, Deep Learning models are a good alternative for classifying text after being preprocessed. In particular, Long Short-Term Memory (LSTM) networks are one of the models that perform well for the binary and multi-label text classification problems. In this paper, an approach merging two different data sources, one intended for Spam in social media posts and the other for Fraud classification in emails, is presented. We designed a multi-label LSTM model and trained it on the joint datasets including text with common bigrams, extracted from each independent dataset. The experiment results show that our proposed model is capable of identifying malicious text regardless of the source. The LSTM model trained with the merged dataset outperforms the models trained independently on each dataset.


Sign in / Sign up

Export Citation Format

Share Document