scholarly journals Learning Structural Kernels for Natural Language Processing

2015 ◽  
Vol 3 ◽  
pp. 461-473 ◽  
Author(s):  
Daniel Beck ◽  
Trevor Cohn ◽  
Christian Hardmeier ◽  
Lucia Specia

Structural kernels are a flexible learning paradigm that has been widely used in Natural Language Processing. However, the problem of model selection in kernel-based methods is usually overlooked. Previous approaches mostly rely on setting default values for kernel hyperparameters or using grid search, which is slow and coarse-grained. In contrast, Bayesian methods allow efficient model selection by maximizing the evidence on the training data through gradient-based methods. In this paper we show how to perform this in the context of structural kernels by using Gaussian Processes. Experimental results on tree kernels show that this procedure results in better prediction performance compared to hyperparameter optimization via grid search. The framework proposed in this paper can be adapted to other structures besides trees, e.g., strings and graphs, thereby extending the utility of kernel-based methods.

Author(s):  
Warnia Nengsih ◽  
M. Mahrus Zein ◽  
Nazifa Hayati

Sentiment analysis adalah metode untuk memperoleh data dari berbagai platform yang tersedia di internet. Kemajuan teknologi memungkinkan mesin untuk mengenali suatu istilah yang dianggap sebagai opini positif maupun sebaliknya. Data-data dan opini tersebut berperan penting sebagai umpan balik produk, layanan, dan topik lainnya. Tanpa perlu memperoleh opini secara langsung dari masyarakat, pihak penyedia telah mendapatkan evaluasi yang penting guna mengembangkan diri. Bisnis perhotelan merupakan bidang yang terkait dengan jasa memberikan layanan pada pelanggan. Indikator keberlangsungan bisnis ini juga bergantung pada umpan balik pelanggannya dan dijadikan sebagai acuan untuk pengambilan kebijakan strategis. Teknik sentiment analysis berbasis Natural Language Processing dapat mengatasi permasalahan tersebut. Pada makalah ini prediksi dilakukan menggunakan classifier Random Forest (RF), sementara untuk merangkum kualitas classifier, digunakan kurva Receiver Operating Characteristic (ROC). Kurva ROC berupa grafik yang baik untuk merangkum kualitas classifier. Semakin tinggi kurva berada di atas garis diagonal, semakin baik prediksinya, dengan nilai kurva ROC yang diperoleh sebesar 0,90. Terlihat hasil ulasan terhadap opini pelanggan terhadap jasa dan pelayanan yang diberikan oleh hotel untuk kategori positif lebih banyak daripada kategori negatif. Polaritas dari ulasan diperoleh 68% ulasan pelanggan berada pada area positif dan 32% berada pada area negatif.


2020 ◽  
Vol 34 (05) ◽  
pp. 8504-8511
Author(s):  
Arindam Mitra ◽  
Ishan Shrivastava ◽  
Chitta Baral

Natural Language Inference (NLI) plays an important role in many natural language processing tasks such as question answering. However, existing NLI modules that are trained on existing NLI datasets have several drawbacks. For example, they do not capture the notion of entity and role well and often end up making mistakes such as “Peter signed a deal” can be inferred from “John signed a deal”. As part of this work, we have developed two datasets that help mitigate such issues and make the systems better at understanding the notion of “entities” and “roles”. After training the existing models on the new dataset we observe that the existing models do not perform well on one of the new benchmark. We then propose a modification to the “word-to-word” attention function which has been uniformly reused across several popular NLI architectures. The resulting models perform as well as their unmodified counterparts on the existing benchmarks and perform significantly well on the new benchmarks that emphasize “roles” and “entities”.


2020 ◽  
pp. 016555152096278
Author(s):  
Rouzbeh Ghasemi ◽  
Seyed Arad Ashrafi Asli ◽  
Saeedeh Momtazi

With the advent of deep neural models in natural language processing tasks, having a large amount of training data plays an essential role in achieving accurate models. Creating valid training data, however, is a challenging issue in many low-resource languages. This problem results in a significant difference between the accuracy of available natural language processing tools for low-resource languages compared with rich languages. To address this problem in the sentiment analysis task in the Persian language, we propose a cross-lingual deep learning framework to benefit from available training data of English. We deployed cross-lingual embedding to model sentiment analysis as a transfer learning model which transfers a model from a rich-resource language to low-resource ones. Our model is flexible to use any cross-lingual word embedding model and any deep architecture for text classification. Our experiments on English Amazon dataset and Persian Digikala dataset using two different embedding models and four different classification networks show the superiority of the proposed model compared with the state-of-the-art monolingual techniques. Based on our experiment, the performance of Persian sentiment analysis improves 22% in static embedding and 9% in dynamic embedding. Our proposed model is general and language-independent; that is, it can be used for any low-resource language, once a cross-lingual embedding is available for the source–target language pair. Moreover, by benefitting from word-aligned cross-lingual embedding, the only required data for a reliable cross-lingual embedding is a bilingual dictionary that is available between almost all languages and the English language, as a potential source language.


2019 ◽  
Author(s):  
Peng Su ◽  
Gang Li ◽  
Cathy Wu ◽  
K. Vijay-Shanker

AbstractSignificant progress has been made in applying deep learning on natural language processing tasks recently. However, deep learning models typically require a large amount of annotated training data while often only small labeled datasets are available for many natural language processing tasks in biomedical literature. Building large-size datasets for deep learning is expensive since it involves considerable human effort and usually requires domain expertise in specialized fields. In this work, we consider augmenting manually annotated data with large amounts of data using distant supervision. However, data obtained by distant supervision is often noisy, we first apply some heuristics to remove some of the incorrect annotations. Then using methods inspired from transfer learning, we show that the resulting models outperform models trained on the original manually annotated sets.


2021 ◽  
Vol 50 (3) ◽  
pp. 27-28
Author(s):  
Immanuel Trummer

Introduction. We have seen significant advances in the state of the art in natural language processing (NLP) over the past few years [20]. These advances have been driven by new neural network architectures, in particular the Transformer model [19], as well as the successful application of transfer learning approaches to NLP [13]. Typically, training for specific NLP tasks starts from large language models that have been pre-trained on generic tasks (e.g., predicting obfuscated words in text [5]) for which large amounts of training data are available. Using such models as a starting point reduces task-specific training cost as well as the number of required training samples by orders of magnitude [7]. These advances motivate new use cases for NLP methods in the context of databases.


2021 ◽  
Vol 15 ◽  
Author(s):  
Nora Hollenstein ◽  
Cedric Renggli ◽  
Benjamin Glaus ◽  
Maria Barrett ◽  
Marius Troendle ◽  
...  

Until recently, human behavioral data from reading has mainly been of interest to researchers to understand human cognition. However, these human language processing signals can also be beneficial in machine learning-based natural language processing tasks. Using EEG brain activity for this purpose is largely unexplored as of yet. In this paper, we present the first large-scale study of systematically analyzing the potential of EEG brain activity data for improving natural language processing tasks, with a special focus on which features of the signal are most beneficial. We present a multi-modal machine learning architecture that learns jointly from textual input as well as from EEG features. We find that filtering the EEG signals into frequency bands is more beneficial than using the broadband signal. Moreover, for a range of word embedding types, EEG data improves binary and ternary sentiment classification and outperforms multiple baselines. For more complex tasks such as relation detection, only the contextualized BERT embeddings outperform the baselines in our experiments, which raises the need for further research. Finally, EEG data shows to be particularly promising when limited training data is available.


2021 ◽  
Vol 45 (10) ◽  
Author(s):  
A. W. Olthof ◽  
P. M. A. van Ooijen ◽  
L. J. Cornelissen

AbstractIn radiology, natural language processing (NLP) allows the extraction of valuable information from radiology reports. It can be used for various downstream tasks such as quality improvement, epidemiological research, and monitoring guideline adherence. Class imbalance, variation in dataset size, variation in report complexity, and algorithm type all influence NLP performance but have not yet been systematically and interrelatedly evaluated. In this study, we investigate these factors on the performance of four types [a fully connected neural network (Dense), a long short-term memory recurrent neural network (LSTM), a convolutional neural network (CNN), and a Bidirectional Encoder Representations from Transformers (BERT)] of deep learning-based NLP. Two datasets consisting of radiologist-annotated reports of both trauma radiographs (n = 2469) and chest radiographs and computer tomography (CT) studies (n = 2255) were split into training sets (80%) and testing sets (20%). The training data was used as a source to train all four model types in 84 experiments (Fracture-data) and 45 experiments (Chest-data) with variation in size and prevalence. The performance was evaluated on sensitivity, specificity, positive predictive value, negative predictive value, area under the curve, and F score. After the NLP of radiology reports, all four model-architectures demonstrated high performance with metrics up to > 0.90. CNN, LSTM, and Dense were outperformed by the BERT algorithm because of its stable results despite variation in training size and prevalence. Awareness of variation in prevalence is warranted because it impacts sensitivity and specificity in opposite directions.


2018 ◽  
Vol 6 ◽  
pp. 571-585 ◽  
Author(s):  
Silviu Paun ◽  
Bob Carpenter ◽  
Jon Chamberlain ◽  
Dirk Hovy ◽  
Udo Kruschwitz ◽  
...  

The analysis of crowdsourced annotations in natural language processing is concerned with identifying (1) gold standard labels, (2) annotator accuracies and biases, and (3) item difficulties and error patterns. Traditionally, majority voting was used for 1, and coefficients of agreement for 2 and 3. Lately, model-based analysis of corpus annotations have proven better at all three tasks. But there has been relatively little work comparing them on the same datasets. This paper aims to fill this gap by analyzing six models of annotation, covering different approaches to annotator ability, item difficulty, and parameter pooling (tying) across annotators and items. We evaluate these models along four aspects: comparison to gold labels, predictive accuracy for new annotations, annotator characterization, and item difficulty, using four datasets with varying degrees of noise in the form of random (spammy) annotators. We conclude with guidelines for model selection, application, and implementation.


Author(s):  
Piotr Bojanowski ◽  
Edouard Grave ◽  
Armand Joulin ◽  
Tomas Mikolov

Continuous word representations, trained on large unlabeled corpora are useful for many natural language processing tasks. Popular models that learn such representations ignore the morphology of words, by assigning a distinct vector to each word. This is a limitation, especially for languages with large vocabularies and many rare words. In this paper, we propose a new approach based on the skipgram model, where each word is represented as a bag of character n-grams. A vector representation is associated to each character n-gram; words being represented as the sum of these representations. Our method is fast, allowing to train models on large corpora quickly and allows us to compute word representations for words that did not appear in the training data. We evaluate our word representations on nine different languages, both on word similarity and analogy tasks. By comparing to recently proposed morphological word representations, we show that our vectors achieve state-of-the-art performance on these tasks.


Sign in / Sign up

Export Citation Format

Share Document