An Effective BERT-Based Pipeline for Twitter Sentiment Analysis: A Case Study in Italian

Marco Pota; Mirko Ventura; Rosario Catelli; Massimo Esposito

doi:10.3390/s21010133

An Effective BERT-Based Pipeline for Twitter Sentiment Analysis: A Case Study in Italian

Sensors ◽

10.3390/s21010133 ◽

2020 ◽

Vol 21 (1) ◽

pp. 133

Author(s):

Marco Pota ◽

Mirko Ventura ◽

Rosario Catelli ◽

Massimo Esposito

Keyword(s):

Sentiment Analysis ◽

Language Model ◽

Language Models ◽

Plain Text ◽

Analysis Techniques ◽

Academic Communities ◽

Text Corpora ◽

General Basis ◽

Model Training

Over the last decade industrial and academic communities have increased their focus on sentiment analysis techniques, especially applied to tweets. State-of-the-art results have been recently achieved using language models trained from scratch on corpora made up exclusively of tweets, in order to better handle the Twitter jargon. This work aims to introduce a different approach for Twitter sentiment analysis based on two steps. Firstly, the tweet jargon, including emojis and emoticons, is transformed into plain text, exploiting procedures that are language-independent or easily applicable to different languages. Secondly, the resulting tweets are classified using the language model BERT, but pre-trained on plain text, instead of tweets, for two reasons: (1) pre-trained models on plain text are easily available in many languages, avoiding resource- and time-consuming model training directly on tweets from scratch; (2) available plain text corpora are larger than tweet-only ones, therefore allowing better performance. A case study describing the application of the approach to Italian is presented, with a comparison with other Italian existing solutions. The results obtained show the effectiveness of the approach and indicate that, thanks to its general basis from a methodological perspective, it can also be promising for other languages.

Download Full-text

A Study and Comparison of Sentiment Analysis Techniques Using Demonetization

Advances in Business Information Systems and Analytics - Sentiment Analysis and Knowledge Discovery in Contemporary Business ◽

10.4018/978-1-5225-4999-4.ch001 ◽

2019 ◽

pp. 1-14 ◽

Cited By ~ 1

Author(s):

Krishna Kumar Mohbey ◽

Brijesh Bakariya ◽

Vishakha Kalal

Keyword(s):

Sentiment Analysis ◽

Text Analysis ◽

Analytical Approach ◽

Document Type ◽

Analysis Techniques

Sentiment analysis is an analytical approach that is used for text analysis. The aim of sentiment analysis is to determine the opinion and subjectivity of any opinion, review, or tweet. The aim of this chapter is to study and compare some of the techniques used to classify opinions using sentiment analysis. In this chapter, different techniques of sentiment analysis have been discussed with the case study of demonetization in India during 2016. Based on the sentiment analysis, people's opinion can be classified on different polarities such as positive, negative, or neutral. These techniques will be classified on different categories based on size of data, document type, and availability. In addition, this chapter also discusses various applications of sentiment analysis techniques in different domains.

Download Full-text

Better Word Representation Vectors Using Syllabic Alphabet: A Case Study of Swahili

Applied Sciences ◽

10.3390/app9183648 ◽

2019 ◽

Vol 9 (18) ◽

pp. 3648

Author(s):

Casper S. Shikali ◽

Zhou Sijie ◽

Liu Qihe ◽

Refuoe Mokhosi

Keyword(s):

Language Processing ◽

Critical Role ◽

Language Model ◽

Central Africa ◽

Spoken Language ◽

Language Models ◽

Word Embeddings ◽

Word Representation

Deep learning has extensively been used in natural language processing with sub-word representation vectors playing a critical role. However, this cannot be said of Swahili, which is a low resource and widely spoken language in East and Central Africa. This study proposed novel word embeddings from syllable embeddings (WEFSE) for Swahili to address the concern of word representation for agglutinative and syllabic-based languages. Inspired by the learning methodology of Swahili in beginner classes, we encoded respective syllables instead of characters, character n-grams or morphemes of words and generated quality word embeddings using a convolutional neural network. The quality of WEFSE was demonstrated by the state-of-art results in the syllable-aware language model on both the small dataset (31.229 perplexity value) and the medium dataset (45.859 perplexity value), outperforming character-aware language models. We further evaluated the word embeddings using word analogy task. To the best of our knowledge, syllabic alphabets have not been used to compose the word representation vectors. Therefore, the main contributions of the study are a syllabic alphabet, WEFSE, a syllabic-aware language model and a word analogy dataset for Swahili.

Download Full-text

Automatic Identification of Close Languages - Case study: Malay and Indonesian

ECTI Transactions on Computer and Information Technology (ECTI-CIT) ◽

10.37936/ecti-cit.200622.53288 ◽

1970 ◽

Vol 2 (2) ◽

pp. 126-134 ◽

Cited By ~ 5

Author(s):

Bali Ranaivo-Malancon

Keyword(s):

Language Model ◽

Language Models ◽

Automatic Identification ◽

Input Text ◽

Current Program

Identifying the language of an unknown text is not a new problem but what is new is the task of identifying close languages. Malay and Indonesian as many other languages are very similar, and therefore it is a real difficulty to search, retrieve, classify, and above all translate texts written in one of the two languages. We have built a language identifier todetermine whether the text is written in Malay or Indonesian which could be used in any similar situation. It uses the frequency and rank of trigrams of characters, the lists of exclusive words, and the format of numbers. The trigrams are derived from the most frequent words in each language. The current program contains as language models: Malay/Indonesian (661 trigrams), Dutch (826 trigrams), English (652 trigrams), French (579 trigrams), and German (482 trigrams). The trigrams of an unknown text are searched in each language model. The language of the input text is the language having the highest ratio in “number of shared trigrams / total number of trigrams” and “number of winner trigrams / number of shared trigrams”. If the language found at trigram search level is ’Malay or Indonesian’, the text is then scanned by searching the format of numbers and ofsome exclusive words.

Download Full-text

COVID-19 sentiment analysis via deep learning during the rise of novel cases

PLoS ONE ◽

10.1371/journal.pone.0255615 ◽

2021 ◽

Vol 16 (8) ◽

pp. e0255615 ◽

Cited By ~ 1

Author(s):

Rohitash Chandra ◽

Aswin Krishna

Keyword(s):

Deep Learning ◽

Sentiment Analysis ◽

Short Term Memory ◽

Language Model ◽

Deep Understanding ◽

Language Models ◽

Catastrophic Events ◽

Social Scientists ◽

Significant Group ◽

Global Vector

Social scientists and psychologists take interest in understanding how people express emotions and sentiments when dealing with catastrophic events such as natural disasters, political unrest, and terrorism. The COVID-19 pandemic is a catastrophic event that has raised a number of psychological issues such as depression given abrupt social changes and lack of employment. Advancements of deep learning-based language models have been promising for sentiment analysis with data from social networks such as Twitter. Given the situation with COVID-19 pandemic, different countries had different peaks where rise and fall of new cases affected lock-downs which directly affected the economy and employment. During the rise of COVID-19 cases with stricter lock-downs, people have been expressing their sentiments in social media. This can provide a deep understanding of human psychology during catastrophic events. In this paper, we present a framework that employs deep learning-based language models via long short-term memory (LSTM) recurrent neural networks for sentiment analysis during the rise of novel COVID-19 cases in India. The framework features LSTM language model with a global vector embedding and state-of-art BERT language model. We review the sentiments expressed for selective months in 2020 which covers the major peak of novel cases in India. Our framework utilises multi-label sentiment classification where more than one sentiment can be expressed at once. Our results indicate that the majority of the tweets have been positive with high levels of optimism during the rise of the novel COVID-19 cases and the number of tweets significantly lowered towards the peak. We find that the optimistic, annoyed and joking tweets mostly dominate the monthly tweets with much lower portion of negative sentiments. The predictions generally indicate that although the majority have been optimistic, a significant group of population has been annoyed towards the way the pandemic was handled by the authorities.

Download Full-text

A Survey on Sentiment Analysis and Opinion Mining in Greek Social Media

Information ◽

10.3390/info12080331 ◽

2021 ◽

Vol 12 (8) ◽

pp. 331

Author(s):

Georgios Alexandridis ◽

Iraklis Varlamis ◽

Konstantinos Korovesis ◽

George Caridakis ◽

Panagiotis Tsantilas

Keyword(s):

Social Media ◽

Sentiment Analysis ◽

Opinion Mining ◽

Language Model ◽

Language Models ◽

Current Work ◽

Support Vector ◽

Greek Language ◽

Linguistic Resources ◽

Generic Language

As the amount of content that is created on social media is constantly increasing, more and more opinions and sentiments are expressed by people in various subjects. In this respect, sentiment analysis and opinion mining techniques can be valuable for the automatic analysis of huge textual corpora (comments, reviews, tweets etc.). Despite the advances in text mining algorithms, deep learning techniques, and text representation models, the results in such tasks are very good for only a few high-density languages (e.g., English) that possess large training corpora and rich linguistic resources; nevertheless, there is still room for improvement for the other lower-density languages as well. In this direction, the current work employs various language models for representing social media texts and text classifiers in the Greek language, for detecting the polarity of opinions expressed on social media. The experimental results on a related dataset collected by the authors of the current work are promising, since various classifiers based on the language models (naive bayesian, random forests, support vector machines, logistic regression, deep feed-forward neural networks) outperform those of word or sentence-based embeddings (word2vec, GloVe), achieving a classification accuracy of more than 80%. Additionally, a new language model for Greek social media has also been trained on the aforementioned dataset, proving that language models based on domain specific corpora can improve the performance of generic language models by a margin of 2%. Finally, the resulting models are made freely available to the research community.

Download Full-text

Learning Algorithms of Sentiment Analysis

10.4018/978-1-7998-8473-6.ch012 ◽

2022 ◽

pp. 176-194

Author(s):

Suania Acampa ◽

Ciro Clemente De Falco ◽

Domenico Trezza

Keyword(s):

Machine Learning ◽

Sentiment Analysis ◽

Scientific Community ◽

Learning Algorithms ◽

Learning Approaches ◽

Italian Language ◽

Analysis Techniques ◽

Stratified Sample

The uncritical application of automatic analysis techniques can be insidious. For this reason, the scientific community is very interested in the supervised approach. Can this be enough? This chapter aims to these issues by comparing three machine learning approaches to measuring the sentiment. The case study is the analysis of the sentiment expressed by the Italians on Twitter during the first post-lockdown day. To start the supervised model, it has been necessary to build a stratified sample of tweets by daily and classifying them manually. The model to be test provides for further analysis at the end of the process useful for comparing the three models: index will be built on the tweets processed with the aim of detecting the goodness of the results produced. The comparison of the three algorithms helps the authors to understand not only which is the best approach for the Italian language but tries to understand which strategy is to verify the quality of the data obtained.

Download Full-text

Semantic programming by example with pre-trained models

Proceedings of the ACM on Programming Languages ◽

10.1145/3485477 ◽

2021 ◽

Vol 5 (OOPSLA) ◽

pp. 1-25

Author(s):

Gust Verbruggen ◽

Vu Le ◽

Sumit Gulwani

Keyword(s):

Syntactic Structure ◽

Language Model ◽

Expressive Power ◽

Language Models ◽

Inductive Synthesis ◽

Inductive Programming ◽

The Given ◽

Learning Language ◽

Semantic Operators

The ability to learn programs from few examples is a powerful technology with disruptive applications in many domains, as it allows users to automate repetitive tasks in an intuitive way. Existing frameworks on inductive synthesis only perform syntactic manipulations, where they rely on the syntactic structure of the given examples and not their meaning. Any semantic manipulations, such as transforming dates, have to be manually encoded by the designer of the inductive programming framework. Recent advances in large language models have shown these models to be very adept at performing semantic transformations of its input by simply providing a few examples of the task at hand. When it comes to syntactic transformations, however, these models are limited in their expressive power. In this paper, we propose a novel framework for integrating inductive synthesis with few-shot learning language models to combine the strength of these two popular technologies. In particular, the inductive synthesis is tasked with breaking down the problem in smaller subproblems, among which those that cannot be solved syntactically are passed to the language model. We formalize three semantic operators that can be integrated with inductive synthesizers. To minimize invoking expensive semantic operators during learning, we introduce a novel deferred query execution algorithm that considers the operators to be oracles during learning. We evaluate our approach in the domain of string transformations: the combination methodology can automate tasks that cannot be handled using either technologies by themselves. Finally, we demonstrate the generality of our approach via a case study in the domain of string profiling.

Download Full-text

Part-of-Speech Tagging

10.1093/oxfordhb/9780199276349.013.0011 ◽

2012 ◽

Author(s):

Atro Voutilainen

Keyword(s):

Markov Models ◽

Language Model ◽

Language Models ◽

Symbolic Language ◽

Part Of Speech Tagging ◽

Part Of Speech ◽

Text Corpora ◽

History Of ◽

General Architecture ◽

Speech Tagging

This article outlines the recently used methods for designing part-of-speech taggers; computer programs for assigning contextually appropriate grammatical descriptors to words in texts. It begins with the description of general architecture and task setting. It gives an overview of the history of tagging and describes the central approaches to tagging. These approaches are: taggers based on handwritten local rules, taggers based on n-grams automatically derived from text corpora, taggers based on hidden Markov models, taggers using automatically generated symbolic language models derived using methods from machine tagging, taggers based on handwritten global rules, and hybrid taggers, which combine the advantages of handwritten and automatically generated taggers. This article focuses on handwritten tagging rules. Well-tagged training corpora are a valuable resource for testing and improving language model. The text corpus reminds the grammarian about any oversight while designing a rule.

Download Full-text

An Integration of Sentiment Analysis and MCDM Approach for Smartphone Recommendation

International Journal of Information Technology & Decision Making ◽

10.1142/s021962202050025x ◽

2020 ◽

Vol 19 (04) ◽

pp. 1037-1063

Author(s):

Gaurav Kumar ◽

N. Parimala

Keyword(s):

Sentiment Analysis ◽

Decision Support Tool ◽

Decision Making Process ◽

Support Tool ◽

Customer Reviews ◽

Analysis Techniques ◽

Online Platforms ◽

Almost All ◽

Product Ratings

Today, smartphones are being used to manage almost all aspects of our lives, ranging from personal to professional. Different users have different requirements and preferences while selecting a smartphone. There is ‘no one-size fits all’ remedy when it comes to smartphones. Additionally, the availability of a wide variety of smartphones in the market makes it difficult for the user to select the best one. The use of only product ratings to choose the best smartphone is not sufficient because the interpretation of such ratings can be quite vague and ambiguous. In this paper, reviews of products are incorporated into the decision-making process in order to select the best product for a recommendation. The top five different brands of smartphones are considered for a case study. The proposed system, then, analyses the customer reviews of these smartphones from two online platforms, Flipkart and Amazon, using sentiment analysis techniques. Next, it uses a hybrid MCDM approach, where characteristics of AHP and TOPSIS methods are combined to evaluate the best smartphones from a list of five alternatives and recommend the best product. The result shows that brand1 smartphone is considered to be the best smartphone among five smartphones based on four important decision criteria. The result of the proposed system is also validated by manually annotated customer reviews of the smartphone by experts. It shows that recommendation of the best product by the proposed system matches the experts’ ranking. Thus, the proposed system can be a useful decision support tool for the best smartphone recommendation.

Download Full-text

Eliciting Attribute-Level User Needs from Online Reviews with Deep Language Models and Information Extraction

Journal of Mechanical Design ◽

10.1115/1.4048819 ◽

2020 ◽

pp. 1-34

Author(s):

Yi Han ◽

Mohsen Moghaddam

Keyword(s):

Sentiment Analysis ◽

Large Scale ◽

User Behavior ◽

Language Model ◽

Named Entity Recognition ◽

Online Reviews ◽

Entity Recognition ◽

Language Models ◽

Attribute Level ◽

User Needs

Abstract Eliciting user needs for individual components and features of a product or a service on a large scale is a key requirement for innovative design. Gathering and analyzing data as an initial discovery phase of a design process is usually accomplished with a small number of participants, employing qualitative research methods such as observations, focus groups, and interviews. This leaves an entire swath of pertinent user behavior, preferences, and opinions not captured. Sentiment analysis is a key enabler for large-scale need finding from online user reviews generated on a regular basis. A major limitation of current sentiment analysis approaches used in design sciences, however, is the need for laborious labeling and annotation of large review datasets for training, which in turn hinders their scalability and transferability across different domains. This article proposes an efficient and scalable methodology for automated and large-scale elicitation of attribute-level user needs. The methodology builds on the state-of-the-art pretrained deep language model, BERT (Bidirectional Encoder Representations from Transformers), with new convolutional net and named-entity recognition (NER) layers for extracting attribute, description, and sentiment words from online user review corpora. The machine translation algorithm BLEU (BiLingual Evaluation Understudy) is utilized to extract need expressions in the form of predefined part-of-speech combinations (e.g., adjective-noun, verb-noun). Numerical experiments are conducted on a large dataset scraped from a major e-commerce retail store for apparel and footwear to demonstrate the performance, feasibility, and potentials of the developed methodology.

Download Full-text