Categorical Metadata Representation for Customized Text Classification

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00263 ◽

2019 ◽

Vol 7 ◽

pp. 201-215 ◽

Cited By ~ 2

Author(s):

Jihyeok Kim ◽

Reinald Kim Amplayo ◽

Kyungjae Lee ◽

Sua Sung ◽

Minji Seo ◽

...

Keyword(s):

Text Classification ◽

Product Information ◽

Human Consumption ◽

Data Sets ◽

Word Embeddings ◽

Classification Methods ◽

Additional Information ◽

Final Layer ◽

Current Representation ◽

Basis Vectors

The performance of text classification has improved tremendously using intelligently engineered neural-based models, especially those injecting categorical metadata as additional information, e.g., using user/product information for sentiment classification. This information has been used to modify parts of the model (e.g., word embeddings, attention mechanisms) such that results can be customized according to the metadata. We observe that current representation methods for categorical metadata, which are devised for human consumption, are not as effective as claimed in popular classification methods, outperformed even by simple concatenation of categorical features in the final layer of the sentence encoder. We conjecture that categorical features are harder to represent for machine use, as available context only indirectly describes the category, and even such context is often scarce (for tail category). To this end, we propose using basis vectors to effectively incorporate categorical metadata on various parts of a neural-based model. This additionally decreases the number of parameters dramatically, especially when the number of categorical features is large. Extensive experiments on various data sets with different properties are performed and show that through our method, we can represent categorical metadata more effectively to customize parts of the model, including unexplored ones, and increase the performance of the model greatly.

Download Full-text

Study of text classification methods for data sets with huge features

2010 2nd International Conference on Industrial and Information Systems ◽

10.1109/indusis.2010.5565817 ◽

2010 ◽

Author(s):

Guiying Wei ◽

Xuedong Gao ◽

Sen Wu

Keyword(s):

Text Classification ◽

Data Sets ◽

Classification Methods

Download Full-text

Survey of Feature Selection and Text Classification Methods for Genetic Mutation Classification

International Journal of Computer Sciences and Engineering ◽

10.26438/ijcse/v7i4.933937 ◽

2019 ◽

Vol 7 (4) ◽

pp. 933-937

Author(s):

Varun Saproo ◽

Rujuta Upadhyay ◽

Manisha Valera

Keyword(s):

Feature Selection ◽

Text Classification ◽

Genetic Mutation ◽

Classification Methods

Download Full-text

Investigating the impact of pre-processing techniques and pre-trained word embeddings in detecting Arabic health information on social media

Journal Of Big Data ◽

10.1186/s40537-021-00488-w ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Yahya Albalawi ◽

Jim Buckley ◽

Nikola S. Nikolov

Keyword(s):

Social Media ◽

Deep Learning ◽

Comprehensive Evaluation ◽

Classification Problem ◽

Data Sets ◽

Word Embeddings ◽

Data Set ◽

Lower Accuracy ◽

Health Related ◽

The Impact

AbstractThis paper presents a comprehensive evaluation of data pre-processing and word embedding techniques in the context of Arabic document classification in the domain of health-related communication on social media. We evaluate 26 text pre-processings applied to Arabic tweets within the process of training a classifier to identify health-related tweets. For this task we use the (traditional) machine learning classifiers KNN, SVM, Multinomial NB and Logistic Regression. Furthermore, we report experimental results with the deep learning architectures BLSTM and CNN for the same text classification problem. Since word embeddings are more typically used as the input layer in deep networks, in the deep learning experiments we evaluate several state-of-the-art pre-trained word embeddings with the same text pre-processing applied. To achieve these goals, we use two data sets: one for both training and testing, and another for testing the generality of our models only. Our results point to the conclusion that only four out of the 26 pre-processings improve the classification accuracy significantly. For the first data set of Arabic tweets, we found that Mazajak CBOW pre-trained word embeddings as the input to a BLSTM deep network led to the most accurate classifier with F1 score of 89.7%. For the second data set, Mazajak Skip-Gram pre-trained word embeddings as the input to BLSTM led to the most accurate model with F1 score of 75.2% and accuracy of 90.7% compared to F1 score of 90.8% achieved by Mazajak CBOW for the same architecture but with lower accuracy of 70.89%. Our results also show that the performance of the best of the traditional classifier we trained is comparable to the deep learning methods on the first dataset, but significantly worse on the second dataset.

Download Full-text

Learning emotional word embeddings for sentiment analysis

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-201993 ◽

2021 ◽

pp. 1-13

Author(s):

Qingtian Zeng ◽

Xishi Zhao ◽

Xiaohui Hu ◽

Hua Duan ◽

Zhongying Zhao ◽

...

Keyword(s):

Sentiment Analysis ◽

Language Processing ◽

State Of The Art ◽

Research Problem ◽

Emotional Word ◽

Classification Model ◽

Data Sets ◽

Word Embeddings ◽

Real World Data ◽

Text Documents

Word embeddings have been successfully applied in many natural language processing tasks due to its their effectiveness. However, the state-of-the-art algorithms for learning word representations from large amounts of text documents ignore emotional information, which is a significant research problem that must be addressed. To solve the above problem, we propose an emotional word embedding (EWE) model for sentiment analysis in this paper. This method first applies pre-trained word vectors to represent document features using two different linear weighting methods. Then, the resulting document vectors are input to a classification model and used to train a text sentiment classifier, which is based on a neural network. In this way, the emotional polarity of the text is propagated into the word vectors. The experimental results on three kinds of real-world data sets demonstrate that the proposed EWE model achieves superior performances on text sentiment prediction, text similarity calculation, and word emotional expression tasks compared to other state-of-the-art models.

Download Full-text

Review of classification methods on unbalanced data sets

IEEE Access ◽

10.1109/access.2021.3074243 ◽

2021 ◽

pp. 1-1

Author(s):

Le Wang ◽

Meng Han ◽

Xiaojuan Li ◽

Ni Zhang ◽

Haodong Cheng

Keyword(s):

Unbalanced Data ◽

Data Sets ◽

Classification Methods

Download Full-text

Rethinking a Designers’ Rule of Thumb: Influence of Information Seeking and Consumption Goals on Mobile Commerce Interface Design

Journal of theoretical and applied electronic commerce research ◽

10.3390/jtaer16050092 ◽

2021 ◽

Vol 16 (5) ◽

pp. 1631-1647

Author(s):

Sooa Hwang ◽

Hyunah Park ◽

Kyunghui Oh ◽

Sangwoong Hwang ◽

Jaewoo Joo

Keyword(s):

Information Seeking ◽

Interface Design ◽

Field Experiments ◽

Product Information ◽

Mobile Commerce ◽

Rule Of Thumb ◽

Additional Information ◽

Dynamic Relationship ◽

The Usa ◽

Consumption Goals

We investigated whether adding product information in mobile commerce improved consumers’ attitudes toward a product and whether this relationship was moderated by consumption goals. We conducted two field experiments in which we recruited parents in Korea and the USA and asked them how they evaluated two childcare hybrid products (HPs) newly developed by Samsung Electronics designers. The results revealed that participants exposed to additional information about the HPs evaluated them more favorably than those who were not exposed. However, this relationship disappeared when a consumption goal was activated. Our findings establish a dynamic relationship between information seeking and consumption goals, asking designers to rethink their rule of thumb in the mobile commerce context.

Download Full-text

A Comparative Study on Word Embeddings in Deep Learning for Text Classification

Proceedings of the 4th International Conference on Natural Language Processing and Information Retrieval ◽

10.1145/3443279.3443304 ◽

2020 ◽

Author(s):

Congcong Wang ◽

Paul Nulty ◽

David Lillis

Keyword(s):

Deep Learning ◽

Comparative Study ◽

Text Classification ◽

Word Embeddings

Download Full-text

GENE SELECTION USING LOGISTIC REGRESSIONS BASED ON AIC, BIC AND MDL CRITERIA

New Mathematics and Natural Computation ◽

10.1142/s179300570500007x ◽

2005 ◽

Vol 01 (01) ◽

pp. 129-145 ◽

Cited By ~ 15

Author(s):

XIAOBO ZHOU ◽

XIAODONG WANG ◽

EDWARD R. DOUGHERTY

Keyword(s):

Logistic Regression ◽

Regression Model ◽

Logistic Regression Model ◽

Gene Selection ◽

Information Criterion ◽

Cancer Classification ◽

Data Sets ◽

Classification Methods ◽

Gene Expressions ◽

Experimental Conditions

In microarray-based cancer classification, gene selection is an important issue owing to the large number of variables (gene expressions) and the small number of experimental conditions. Many gene-selection and classification methods have been proposed; however most of these treat gene selection and classification separately, and not under the same model. We propose a Bayesian approach to gene selection using the logistic regression model. The Akaike information criterion (AIC), the Bayesian information criterion (BIC) and the minimum description length (MDL) principle are used in constructing the posterior distribution of the chosen genes. The same logistic regression model is then used for cancer classification. Fast implementation issues for these methods are discussed. The proposed methods are tested on several data sets including those arising from hereditary breast cancer, small round blue-cell tumors, lymphoma, and acute leukemia. The experimental results indicate that the proposed methods show high classification accuracies on these data sets. Some robustness and sensitivity properties of the proposed methods are also discussed. Finally, mixing logistic-regression based gene selection with other classification methods and mixing logistic-regression-based classification with other gene-selection methods are considered.

Download Full-text

Comparison between XBT data and TOPEX/Poseidon satellite altimetry in the Ligurian-Tyrrhenian area

Annales Geophysicae ◽

10.5194/angeo-21-123-2003 ◽

2003 ◽

Vol 21 (1) ◽

pp. 123-135 ◽

Cited By ~ 18

Author(s):

S. Vignudelli ◽

P. Cipollini ◽

F. Reseghetti ◽

G. Fusco ◽

G. P. Gasparini ◽

...

Keyword(s):

Pilot Project ◽

Atlantic Water ◽

Data Sets ◽

Tyrrhenian Sea ◽

Satellite Altimeter ◽

Ligurian Sea ◽

Data Set ◽

Additional Information ◽

Forecasting System ◽

Computational Errors

Abstract. From September 1999 to December 2000, eXpendable Bathy-Thermograph (XBT) profiles were collected along the Genova-Palermo shipping route in the framework of the Mediterranean Forecasting System Pilot Project (MFSPP). The route is virtually coincident with track 0044 of the TOPEX/Poseidon satellite altimeter, crossing the Ligurian and Tyrrhenian basins in an approximate N–S direction. This allows a direct comparison between XBT and altimetry, whose findings are presented in this paper. XBT sections reveal the presence of the major features of the regional circulation, namely the eastern boundary of the Ligurian gyre, the Bonifacio gyre and the Modified Atlantic Water inflow along the Sicily coast. Twenty-two comparisons of steric heights derived from the XBT data set with concurrent realizations of single-pass altimetric heights are made. The overall correlation is around 0.55 with an RMS difference of less than 3 cm. In the Tyrrhenian Sea the spectra are remarkably similar in shape, but in general the altimetric heights contain more energy. This difference is explained in terms of oceanographic signals, which are captured with a different intensity by the satellite altimeter and XBTs, as well as computational errors. On scales larger than 100 km, the data sets are also significantly coherent, with increasing coherence values at longer wavelengths. The XBTs were dropped every 18–20 km along the track: as a consequence, the spacing scale was unable to resolve adequately the internal radius of deformation (< 20 km). Furthermore, few XBT drops were carried out in the Ligurian Sea, due to the limited north-south extent of this basin, so the comparison is problematic there. On the contrary, the major features observed in the XBT data in the Tyrrhenian Sea are also detected by TOPEX/Poseidon. The manuscript is completed by a discussion on how to integrate the two data sets, in order to extract additional information. In particular, the results emphasize their complementariety in providing a dynamically complete description of the observed structures. Key words. Oceanography: general (descriptive and regional oceanography) Oceanography: physical (sea level variations; instruments and techniques)

Download Full-text

Channel-independent recreation of artefactual signals in chronically recorded local field potentials using machine learning

Brain Informatics ◽

10.1186/s40708-021-00149-x ◽

2022 ◽

Vol 9 (1) ◽

Author(s):

Marcos Fabietti ◽

Mufti Mahmud ◽

Ahmad Lotfi

Keyword(s):

Machine Learning ◽

Open Access ◽

Short Term Memory ◽

The Body ◽

Data Sets ◽

Additional Information ◽

Machine Learning Model ◽

Signal Characteristics ◽

Wide Range ◽

Memory Network

AbstractAcquisition of neuronal signals involves a wide range of devices with specific electrical properties. Combined with other physiological sources within the body, the signals sensed by the devices are often distorted. Sometimes these distortions are visually identifiable, other times, they overlay with the signal characteristics making them very difficult to detect. To remove these distortions, the recordings are visually inspected and manually processed. However, this manual annotation process is time-consuming and automatic computational methods are needed to identify and remove these artefacts. Most of the existing artefact removal approaches rely on additional information from other recorded channels and fail when global artefacts are present or the affected channels constitute the majority of the recording system. Addressing this issue, this paper reports a novel channel-independent machine learning model to accurately identify and replace the artefactual segments present in the signals. Discarding these artifactual segments by the existing approaches causes discontinuities in the reproduced signals which may introduce errors in subsequent analyses. To avoid this, the proposed method predicts multiple values of the artefactual region using long–short term memory network to recreate the temporal and spectral properties of the recorded signal. The method has been tested on two open-access data sets and incorporated into the open-access SANTIA (SigMate Advanced: a Novel Tool for Identification of Artefacts in Neuronal Signals) toolbox for community use.

Download Full-text