scholarly journals Categorical Metadata Representation for Customized Text Classification

2019 ◽  
Vol 7 ◽  
pp. 201-215 ◽  
Author(s):  
Jihyeok Kim ◽  
Reinald Kim Amplayo ◽  
Kyungjae Lee ◽  
Sua Sung ◽  
Minji Seo ◽  
...  

The performance of text classification has improved tremendously using intelligently engineered neural-based models, especially those injecting categorical metadata as additional information, e.g., using user/product information for sentiment classification. This information has been used to modify parts of the model (e.g., word embeddings, attention mechanisms) such that results can be customized according to the metadata. We observe that current representation methods for categorical metadata, which are devised for human consumption, are not as effective as claimed in popular classification methods, outperformed even by simple concatenation of categorical features in the final layer of the sentence encoder. We conjecture that categorical features are harder to represent for machine use, as available context only indirectly describes the category, and even such context is often scarce (for tail category). To this end, we propose using basis vectors to effectively incorporate categorical metadata on various parts of a neural-based model. This additionally decreases the number of parameters dramatically, especially when the number of categorical features is large. Extensive experiments on various data sets with different properties are performed and show that through our method, we can represent categorical metadata more effectively to customize parts of the model, including unexplored ones, and increase the performance of the model greatly.

2021 ◽  
Vol 8 (1) ◽  
Author(s):  
Yahya Albalawi ◽  
Jim Buckley ◽  
Nikola S. Nikolov

AbstractThis paper presents a comprehensive evaluation of data pre-processing and word embedding techniques in the context of Arabic document classification in the domain of health-related communication on social media. We evaluate 26 text pre-processings applied to Arabic tweets within the process of training a classifier to identify health-related tweets. For this task we use the (traditional) machine learning classifiers KNN, SVM, Multinomial NB and Logistic Regression. Furthermore, we report experimental results with the deep learning architectures BLSTM and CNN for the same text classification problem. Since word embeddings are more typically used as the input layer in deep networks, in the deep learning experiments we evaluate several state-of-the-art pre-trained word embeddings with the same text pre-processing applied. To achieve these goals, we use two data sets: one for both training and testing, and another for testing the generality of our models only. Our results point to the conclusion that only four out of the 26 pre-processings improve the classification accuracy significantly. For the first data set of Arabic tweets, we found that Mazajak CBOW pre-trained word embeddings as the input to a BLSTM deep network led to the most accurate classifier with F1 score of 89.7%. For the second data set, Mazajak Skip-Gram pre-trained word embeddings as the input to BLSTM led to the most accurate model with F1 score of 75.2% and accuracy of 90.7% compared to F1 score of 90.8% achieved by Mazajak CBOW for the same architecture but with lower accuracy of 70.89%. Our results also show that the performance of the best of the traditional classifier we trained is comparable to the deep learning methods on the first dataset, but significantly worse on the second dataset.


2021 ◽  
pp. 1-13
Author(s):  
Qingtian Zeng ◽  
Xishi Zhao ◽  
Xiaohui Hu ◽  
Hua Duan ◽  
Zhongying Zhao ◽  
...  

Word embeddings have been successfully applied in many natural language processing tasks due to its their effectiveness. However, the state-of-the-art algorithms for learning word representations from large amounts of text documents ignore emotional information, which is a significant research problem that must be addressed. To solve the above problem, we propose an emotional word embedding (EWE) model for sentiment analysis in this paper. This method first applies pre-trained word vectors to represent document features using two different linear weighting methods. Then, the resulting document vectors are input to a classification model and used to train a text sentiment classifier, which is based on a neural network. In this way, the emotional polarity of the text is propagated into the word vectors. The experimental results on three kinds of real-world data sets demonstrate that the proposed EWE model achieves superior performances on text sentiment prediction, text similarity calculation, and word emotional expression tasks compared to other state-of-the-art models.


IEEE Access ◽  
2021 ◽  
pp. 1-1
Author(s):  
Le Wang ◽  
Meng Han ◽  
Xiaojuan Li ◽  
Ni Zhang ◽  
Haodong Cheng

2021 ◽  
Vol 16 (5) ◽  
pp. 1631-1647
Author(s):  
Sooa Hwang ◽  
Hyunah Park ◽  
Kyunghui Oh ◽  
Sangwoong Hwang ◽  
Jaewoo Joo

We investigated whether adding product information in mobile commerce improved consumers’ attitudes toward a product and whether this relationship was moderated by consumption goals. We conducted two field experiments in which we recruited parents in Korea and the USA and asked them how they evaluated two childcare hybrid products (HPs) newly developed by Samsung Electronics designers. The results revealed that participants exposed to additional information about the HPs evaluated them more favorably than those who were not exposed. However, this relationship disappeared when a consumption goal was activated. Our findings establish a dynamic relationship between information seeking and consumption goals, asking designers to rethink their rule of thumb in the mobile commerce context.


2005 ◽  
Vol 01 (01) ◽  
pp. 129-145 ◽  
Author(s):  
XIAOBO ZHOU ◽  
XIAODONG WANG ◽  
EDWARD R. DOUGHERTY

In microarray-based cancer classification, gene selection is an important issue owing to the large number of variables (gene expressions) and the small number of experimental conditions. Many gene-selection and classification methods have been proposed; however most of these treat gene selection and classification separately, and not under the same model. We propose a Bayesian approach to gene selection using the logistic regression model. The Akaike information criterion (AIC), the Bayesian information criterion (BIC) and the minimum description length (MDL) principle are used in constructing the posterior distribution of the chosen genes. The same logistic regression model is then used for cancer classification. Fast implementation issues for these methods are discussed. The proposed methods are tested on several data sets including those arising from hereditary breast cancer, small round blue-cell tumors, lymphoma, and acute leukemia. The experimental results indicate that the proposed methods show high classification accuracies on these data sets. Some robustness and sensitivity properties of the proposed methods are also discussed. Finally, mixing logistic-regression based gene selection with other classification methods and mixing logistic-regression-based classification with other gene-selection methods are considered.


2003 ◽  
Vol 21 (1) ◽  
pp. 123-135 ◽  
Author(s):  
S. Vignudelli ◽  
P. Cipollini ◽  
F. Reseghetti ◽  
G. Fusco ◽  
G. P. Gasparini ◽  
...  

Abstract. From September 1999 to December 2000, eXpendable Bathy-Thermograph (XBT) profiles were collected along the Genova-Palermo shipping route in the framework of the Mediterranean Forecasting System Pilot Project (MFSPP). The route is virtually coincident with track 0044 of the TOPEX/Poseidon satellite altimeter, crossing the Ligurian and Tyrrhenian basins in an approximate N–S direction. This allows a direct comparison between XBT and altimetry, whose findings are presented in this paper. XBT sections reveal the presence of the major features of the regional circulation, namely the eastern boundary of the Ligurian gyre, the Bonifacio gyre and the Modified Atlantic Water inflow along the Sicily coast. Twenty-two comparisons of steric heights derived from the XBT data set with concurrent realizations of single-pass altimetric heights are made. The overall correlation is around 0.55 with an RMS difference of less than 3 cm. In the Tyrrhenian Sea the spectra are remarkably similar in shape, but in general the altimetric heights contain more energy. This difference is explained in terms of oceanographic signals, which are captured with a different intensity by the satellite altimeter and XBTs, as well as computational errors. On scales larger than 100 km, the data sets are also significantly coherent, with increasing coherence values at longer wavelengths. The XBTs were dropped every 18–20 km along the track: as a consequence, the spacing scale was unable to resolve adequately the internal radius of deformation (< 20 km). Furthermore, few XBT drops were carried out in the Ligurian Sea, due to the limited north-south extent of this basin, so the comparison is problematic there. On the contrary, the major features observed in the XBT data in the Tyrrhenian Sea are also detected by TOPEX/Poseidon. The manuscript is completed by a discussion on how to integrate the two data sets, in order to extract additional information. In particular, the results emphasize their complementariety in providing a dynamically complete description of the observed structures. Key words. Oceanography: general (descriptive and regional oceanography) Oceanography: physical (sea level variations; instruments and techniques)


2022 ◽  
Vol 9 (1) ◽  
Author(s):  
Marcos Fabietti ◽  
Mufti Mahmud ◽  
Ahmad Lotfi

AbstractAcquisition of neuronal signals involves a wide range of devices with specific electrical properties. Combined with other physiological sources within the body, the signals sensed by the devices are often distorted. Sometimes these distortions are visually identifiable, other times, they overlay with the signal characteristics making them very difficult to detect. To remove these distortions, the recordings are visually inspected and manually processed. However, this manual annotation process is time-consuming and automatic computational methods are needed to identify and remove these artefacts. Most of the existing artefact removal approaches rely on additional information from other recorded channels and fail when global artefacts are present or the affected channels constitute the majority of the recording system. Addressing this issue, this paper reports a novel channel-independent machine learning model to accurately identify and replace the artefactual segments present in the signals. Discarding these artifactual segments by the existing approaches causes discontinuities in the reproduced signals which may introduce errors in subsequent analyses. To avoid this, the proposed method predicts multiple values of the artefactual region using long–short term memory network to recreate the temporal and spectral properties of the recorded signal. The method has been tested on two open-access data sets and incorporated into the open-access SANTIA (SigMate Advanced: a Novel Tool for Identification of Artefacts in Neuronal Signals) toolbox for community use.


Sign in / Sign up

Export Citation Format

Share Document