Sentiment Analysis of Short Informal Texts

Journal of Artificial Intelligence Research ◽

10.1613/jair.4272 ◽

2014 ◽

Vol 50 ◽

pp. 723-762 ◽

Cited By ~ 274

Author(s):

S. Kiritchenko ◽

X. Zhu ◽

S. M. Mohammad

Keyword(s):

Sentiment Analysis ◽

Text Classification ◽

State Of The Art ◽

Surface Form ◽

Shared Task ◽

High Coverage ◽

Percentage Points ◽

Sentiment Lexicon ◽

Analysis System ◽

Performance Gains

We describe a state-of-the-art sentiment analysis system that detects (a) the sentiment of short informal textual messages such as tweets and SMS (message-level task) and (b) the sentiment of a word or a phrase within a message (term-level task). The system is based on a supervised statistical text classification approach leveraging a variety of surface-form, semantic, and sentiment features. The sentiment features are primarily derived from novel high-coverage tweet-specific sentiment lexicons. These lexicons are automatically generated from tweets with sentiment-word hashtags and from tweets with emoticons. To adequately capture the sentiment of words in negated contexts, a separate sentiment lexicon is generated for negated words. The system ranked first in the SemEval-2013 shared task `Sentiment Analysis in Twitter' (Task 2), obtaining an F-score of 69.02 in the message-level task and 88.93 in the term-level task. Post-competition improvements boost the performance to an F-score of 70.45 (message-level task) and 89.50 (term-level task). The system also obtains state-of-the-art performance on two additional datasets: the SemEval-2013 SMS test set and a corpus of movie review excerpts. The ablation experiments demonstrate that the use of the automatically generated lexicons results in performance gains of up to 6.5 absolute percentage points.

Download Full-text

Deep Learning for text in limted data settings

10.36227/techrxiv.12100692 ◽

2020 ◽

Author(s):

Pathikkumar Patel ◽

Bhargav Lad ◽

Jinan Fiaidhi

Keyword(s):

Machine Learning ◽

Time Series ◽

Deep Learning ◽

Sentiment Analysis ◽

Transfer Learning ◽

Text Classification ◽

State Of The Art ◽

Time Series Forecasting ◽

Text Data ◽

Performance Levels

During the last few years, RNN models have been extensively used and they have proven to be better for sequence and text data. RNNs have achieved state-of-the-art performance levels in several applications such as text classification, sequence to sequence modelling and time series forecasting. In this article we will review different Machine Learning and Deep Learning based approaches for text data and look at the results obtained from these methods. This work also explores the use of transfer learning in NLP and how it affects the performance of models on a specific application of sentiment analysis.

Download Full-text

IASL valence-arousal analysis system at IALP 2016 shared task: Dimensional sentiment analysis for Chinese words

2016 International Conference on Asian Language Processing (IALP) ◽

10.1109/ialp.2016.7875990 ◽

2016 ◽

Cited By ~ 1

Author(s):

Yu-Lun Hsieh ◽

Chen-Ann Wang ◽

Ying-Wei Wu ◽

Yung-Chun Chang ◽

Wen-Lian Hsu

Keyword(s):

Sentiment Analysis ◽

Shared Task ◽

Analysis System

Download Full-text

Automatic Generation of an Aspect and Domain Sensitive Sentiment Lexicon

International Journal of Artificial Intelligence Tools ◽

10.1142/s0218213014600197 ◽

2014 ◽

Vol 23 (04) ◽

pp. 1460019

Author(s):

Hsiang Hui Lek ◽

Danny Chiang Choon Poo

Keyword(s):

Sentiment Analysis ◽

State Of The Art ◽

Automatic Generation ◽

Research Community ◽

Automatic Method ◽

Long Tail ◽

Tail Distribution ◽

Sentiment Lexicon ◽

Fully Automatic ◽

Distribution Behavior

Sentiment lexicon plays an important role in determining the polarity of words and proves to be an important component in sentiment analysis applications. Most of these sentiment lexicons assign a fixed polarity to each word. However, it has been noted that the polarity of words depends on how they are used and so these lexicons are unable to accurately classify the polarity of the sentiments. By considering the aspect and domain of a word will allow us to more accurately classify sentiments. This paper presents a fully automatic method to build an aspect and domain sensitive sentiment lexicon which assigns a polarity to a word depending on both the aspect and the domain. The experimental results show that our lexicon significantly outperforms other commonly used sentiment lexicons / state-of-the-art approaches. To the best of our knowledge, such a lexicon is not publicly available. As such, we also make this lexicon publicly available as we believe it will benefit the research community. In addition, we observe the long tail distribution behavior of product aspects and propose the possibility of aspect ranking by comparing the number of domains and number of sentiment words present for an aspect.

Download Full-text

An Ensemble Learning Strategy for Eligibility Criteria Text Classification for Clinical Trial Recruitment: Algorithm Development and Validation

JMIR Medical Informatics ◽

10.2196/17832 ◽

2020 ◽

Vol 8 (7) ◽

pp. e17832

Author(s):

Kun Zeng ◽

Zhiwei Pan ◽

Yibin Xu ◽

Yingying Qu

Keyword(s):

Clinical Trial ◽

Clinical Trials ◽

Natural Language Processing ◽

Ensemble Learning ◽

Language Processing ◽

Text Classification ◽

State Of The Art ◽

Shared Task ◽

Eligibility Criteria ◽

Short Text

Background Eligibility criteria are the main strategy for screening appropriate participants for clinical trials. Automatic analysis of clinical trial eligibility criteria by digital screening, leveraging natural language processing techniques, can improve recruitment efficiency and reduce the costs involved in promoting clinical research. Objective We aimed to create a natural language processing model to automatically classify clinical trial eligibility criteria. Methods We proposed a classifier for short text eligibility criteria based on ensemble learning, where a set of pretrained models was integrated. The pretrained models included state-of-the-art deep learning methods for training and classification, including Bidirectional Encoder Representations from Transformers (BERT), XLNet, and A Robustly Optimized BERT Pretraining Approach (RoBERTa). The classification results by the integrated models were combined as new features for training a Light Gradient Boosting Machine (LightGBM) model for eligibility criteria classification. Results Our proposed method obtained an accuracy of 0.846, a precision of 0.803, and a recall of 0.817 on a standard data set from a shared task of an international conference. The macro F1 value was 0.807, outperforming the state-of-the-art baseline methods on the shared task. Conclusions We designed a model for screening short text classification criteria for clinical trials based on multimodel ensemble learning. Through experiments, we concluded that performance was improved significantly with a model ensemble compared to a single model. The introduction of focal loss could reduce the impact of class imbalance to achieve better performance.

Download Full-text

An Empirical Study of Korean Sentence Representation with Various Tokenizations

Electronics ◽

10.3390/electronics10070845 ◽

2021 ◽

Vol 10 (7) ◽

pp. 845

Author(s):

Danbi Cho ◽

Hyunyoung Lee ◽

Seungshik Kang

Keyword(s):

Empirical Study ◽

Natural Language ◽

Sentiment Analysis ◽

Machine Translation ◽

Text Classification ◽

State Of The Art ◽

Language Models ◽

Vocabulary Size ◽

Analysis Task ◽

Natural Language Process

It is important how the token unit is defined in a sentence in natural language process tasks, such as text classification, machine translation, and generation. Many studies recently utilized the subword tokenization in language models such as BERT, KoBERT, and ALBERT. Although these language models achieved state-of-the-art results in various NLP tasks, it is not clear whether the subword tokenization is the best token unit for Korean sentence embedding. Thus, we carried out sentence embedding based on word, morpheme, subword, and submorpheme, respectively, on Korean sentiment analysis. We explored the two-sentence representation methods for sentence embedding: considering the order of tokens in a sentence and not considering the order. While inputting a sentence, which is decomposed by token unit, to the two-sentence representation methods, we construct the sentence embedding with various tokenizations to find the most effective token unit for Korean sentence embedding. In our work, we confirmed: the robustness of the subword unit for out-of-vocabulary (OOV) problems compared to other token units, the disadvantage of replacing whitespace with a particular symbol in the sentiment analysis task, and that the optimal vocabulary size is 16K in subword and submorpheme tokenization. We empirically noticed that the subword, which was tokenized by a vocabulary size of 16K without replacement of whitespace, was the most effective for sentence embedding on the Korean sentiment analysis task.

Download Full-text

Near-Lossless Binarization of Word Embeddings

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33017104 ◽

2019 ◽

Vol 33 ◽

pp. 7104-7111 ◽

Cited By ~ 3

Author(s):

Julien Tissier ◽

Christophe Gravier ◽

Amaury Habrard

Keyword(s):

Sentiment Analysis ◽

Semantic Similarity ◽

Text Classification ◽

Semantic Information ◽

State Of The Art ◽

Floating Point ◽

Word Embeddings ◽

Binary Vectors ◽

Starting Point ◽

Memory Footprint

Word embeddings are commonly used as a starting point in many NLP models to achieve state-of-the-art performances. However, with a large vocabulary and many dimensions, these floating-point representations are expensive both in terms of memory and calculations which makes them unsuitable for use on low-resource devices. The method proposed in this paper transforms real-valued embeddings into binary embeddings while preserving semantic information, requiring only 128 or 256 bits for each vector. This leads to a small memory footprint and fast vector operations. The model is based on an autoencoder architecture, which also allows to reconstruct original vectors from the binary ones. Experimental results on semantic similarity, text classification and sentiment analysis tasks show that the binarization of word embeddings only leads to a loss of ∼2% in accuracy while vector size is reduced by 97%. Furthermore, a top-k benchmark demonstrates that using these binary vectors is 30 times faster than using real-valued vectors.

Download Full-text

Fuzzy-Based Sentiment Analysis System for Analyzing Student Feedback and Satisfaction

10.20944/preprints201907.0006.v1 ◽

2019 ◽

Author(s):

Muhammad Zubair Asghar ◽

Ikram Ullah ◽

Shahab Shamshirband ◽

Fazal Masud Kundi ◽

Ammara Habib

Keyword(s):

Sentiment Analysis ◽

State Of The Art ◽

Student Feedback ◽

Feedback Analysis ◽

Fine Grained ◽

Online Social Media ◽

Logic Module ◽

Data Collection And Analysis ◽

Analysis System ◽

Sentiment Score

The feedback collection and analysis has remained an important subject matter since long. The traditional techniques for student feedback analysis are based on questionnaire-based data collection and analysis. However, the student expresses their feedback opinions on online social media sites, which need to be analyzed. This study aims at the development of fuzzy-based sentiment analysis system for analyzing student feedback and satisfaction by assigning proper sentiment score to opinion words and polarity shifters present in the input reviews. Our technique computes the sentiment score of student feedback reviews and then applies fuzzy-logic module to analyze and quantify student’s satisfaction at the fine-grained level. The experimental results reveal that the proposed work has outperformed the baseline studies as well as state-of-the-art machine learning classifiers.

Download Full-text

Deep Learning for text in limted data settings

10.36227/techrxiv.12100692.v1 ◽

2020 ◽

Author(s):

Pathikkumar Patel ◽

Bhargav Lad ◽

Jinan Fiaidhi

Keyword(s):

Machine Learning ◽

Time Series ◽

Deep Learning ◽

Sentiment Analysis ◽

Transfer Learning ◽

Text Classification ◽

State Of The Art ◽

Time Series Forecasting ◽

Text Data ◽

Performance Levels

Download Full-text

An Integrated Deep Learning and Belief Rule-Based Expert System for Visual Sentiment Analysis under Uncertainty

Algorithms ◽

10.3390/a14070213 ◽

2021 ◽

Vol 14 (7) ◽

pp. 213

Author(s):

Sharif Noor Zisad ◽

Etu Chowdhury ◽

Mohammad Shahadat Hossain ◽

Raihan Ul Islam ◽

Karl Andersson

Keyword(s):

Deep Learning ◽

Expert System ◽

Sentiment Analysis ◽

State Of The Art ◽

Integrated System ◽

Data Driven ◽

Psychological State ◽

Rule Base ◽

Belief Rule Base ◽

Analysis System

Visual sentiment analysis has become more popular than textual ones in various domains for decision-making purposes. On account of this, we develop a visual sentiment analysis system, which can classify image expression. The system classifies images by taking into account six different expressions such as anger, joy, love, surprise, fear, and sadness. In our study, we propose an expert system by integrating a Deep Learning method with a Belief Rule Base (known as the BRB-DL approach) to assess an image’s overall sentiment under uncertainty. This BRB-DL approach includes both the data-driven and knowledge-driven techniques to determine the overall sentiment. Our integrated expert system outperforms the state-of-the-art methods of visual sentiment analysis with promising results. The integrated system can classify images with 86% accuracy. The system can be beneficial to understand the emotional tendency and psychological state of an individual.

Download Full-text

An Ensemble Learning Strategy for Eligibility Criteria Text Classification for Clinical Trial Recruitment: Algorithm Development and Validation (Preprint)

10.2196/preprints.17832 ◽

2020 ◽

Author(s):

Kun Zeng ◽

Zhiwei Pan ◽

Yibin Xu ◽

Yingying Qu

Keyword(s):

Clinical Trial ◽

Clinical Trials ◽

Natural Language Processing ◽

Ensemble Learning ◽

Language Processing ◽

Text Classification ◽

State Of The Art ◽

Shared Task ◽

Eligibility Criteria ◽

Short Text

BACKGROUND Eligibility criteria are the main strategy for screening appropriate participants for clinical trials. Automatic analysis of clinical trial eligibility criteria by digital screening, leveraging natural language processing techniques, can improve recruitment efficiency and reduce the costs involved in promoting clinical research. OBJECTIVE We aimed to create a natural language processing model to automatically classify clinical trial eligibility criteria. METHODS We proposed a classifier for short text eligibility criteria based on ensemble learning, where a set of pretrained models was integrated. The pretrained models included state-of-the-art deep learning methods for training and classification, including Bidirectional Encoder Representations from Transformers (BERT), XLNet, and A Robustly Optimized BERT Pretraining Approach (RoBERTa). The classification results by the integrated models were combined as new features for training a Light Gradient Boosting Machine (LightGBM) model for eligibility criteria classification. RESULTS Our proposed method obtained an accuracy of 0.846, a precision of 0.803, and a recall of 0.817 on a standard data set from a shared task of an international conference. The macro F1 value was 0.807, outperforming the state-of-the-art baseline methods on the shared task. CONCLUSIONS We designed a model for screening short text classification criteria for clinical trials based on multimodel ensemble learning. Through experiments, we concluded that performance was improved significantly with a model ensemble compared to a single model. The introduction of focal loss could reduce the impact of class imbalance to achieve better performance.

Download Full-text