Cross-lingual Sentiment Lexicon Learning With Bilingual Word Graph Label Propagation

Dehong Gao; Furu Wei; Wenjie Li; Xiaohua Liu; Ming Zhou

doi:10.1162/coli_a_00207

Cross-lingual Sentiment Lexicon Learning With Bilingual Word Graph Label Propagation

Computational Linguistics ◽

10.1162/coli_a_00207 ◽

2015 ◽

Vol 41 (1) ◽

pp. 21-40 ◽

Cited By ~ 15

Author(s):

Dehong Gao ◽

Furu Wei ◽

Wenjie Li ◽

Xiaohua Liu ◽

Ming Zhou

Keyword(s):

Label Propagation ◽

Target Language ◽

Word Alignment ◽

Learning Problem ◽

Data Set ◽

Sentence Level ◽

Sentiment Lexicon ◽

Target Languages ◽

Cross Lingual ◽

Graph Label

In this article we address the task of cross-lingual sentiment lexicon learning, which aims to automatically generate sentiment lexicons for the target languages with available English sentiment lexicons. We formalize the task as a learning problem on a bilingual word graph, in which the intra-language relations among the words in the same language and the inter-language relations among the words between different languages are properly represented. With the words in the English sentiment lexicon as seeds, we propose a bilingual word graph label propagation approach to induce sentiment polarities of the unlabeled words in the target language. Particularly, we show that both synonym and antonym word relations can be used to build the intra-language relation, and that the word alignment information derived from bilingual parallel sentences can be effectively leveraged to build the inter-language relation. The evaluation of Chinese sentiment lexicon learning shows that the proposed approach outperforms existing approaches in both precision and recall. Experiments conducted on the NTCIR data set further demonstrate the effectiveness of the learned sentiment lexicon in sentence-level sentiment classification.

Download Full-text

Enhanced Meta-Learning for Cross-Lingual Named Entity Recognition with Minimal Resources

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6466 ◽

2020 ◽

Vol 34 (05) ◽

pp. 9274-9281

Author(s):

Qianhui Wu ◽

Zijia Lin ◽

Guoxin Wang ◽

Hui Chen ◽

Börje F. Karlsson ◽

...

Keyword(s):

Learning Algorithm ◽

Named Entity Recognition ◽

Entity Recognition ◽

Target Language ◽

Test Case ◽

Named Entity ◽

Meta Learning ◽

Target Languages ◽

Cross Lingual ◽

The Given

For languages with no annotated resources, transferring knowledge from rich-resource languages is an effective solution for named entity recognition (NER). While all existing methods directly transfer from source-learned model to a target language, in this paper, we propose to fine-tune the learned model with a few similar examples given a test case, which could benefit the prediction by leveraging the structural and semantic information conveyed in such similar examples. To this end, we present a meta-learning algorithm to find a good model parameter initialization that could fast adapt to the given test case and propose to construct multiple pseudo-NER tasks for meta-training by computing sentence similarities. To further improve the model's generalization ability across different languages, we introduce a masking scheme and augment the loss function with an additional maximum term during meta-training. We conduct extensive experiments on cross-lingual named entity recognition with minimal resources over five target languages. The results show that our approach significantly outperforms existing state-of-the-art methods across the board.

Download Full-text

Multilingual Dependency Parsing: Using Machine Translated Texts Instead of Parallel Corpora

Prague Bulletin of Mathematical Linguistics ◽

10.2478/pralin-2014-0017 ◽

2014 ◽

Vol 102 (1) ◽

pp. 93-104

Author(s):

Ramasamy Loganathan ◽

Mareček David ◽

Žabokrtský Zdenčk

Keyword(s):

Machine Translation ◽

The Other ◽

Target Language ◽

Grammar Induction ◽

Language Resources ◽

Parallel Corpora ◽

Similar Performance ◽

Part Of Speech ◽

Target Languages ◽

Cross Lingual

Abstract This paper revisits the projection-based approach to dependency grammar induction task. Traditional cross-lingual dependency induction tasks one way or the other, depend on the existence of bitexts or target language tools such as part-of-speech (POS) taggers to obtain reasonable parsing accuracy. In this paper, we transfer dependency parsers using only approximate resources, i.e., machine translated bitexts instead of manually created bitexts. We do this by obtaining the the source side of the text from a machine translation (MT) system and then apply transfer approaches to induce parser for the target languages. We further reduce the need for the availability of labeled target language resources by using unsupervised target tagger. We show that our approach consistently outperforms unsupervised parsers by a bigger margin (8.2% absolute), and results in similar performance when compared with delexicalized transfer parsers.

Download Full-text

A classification approach for detecting cross-lingual biomedical term translations

Natural Language Engineering ◽

10.1017/s1351324915000431 ◽

2015 ◽

Vol 23 (1) ◽

pp. 31-51 ◽

Cited By ~ 3

Author(s):

H. HAKAMI ◽

D. BOLLEGALA

Keyword(s):

Machine Translation ◽

Feature Space ◽

Target Language ◽

Average Precision ◽

Common Features ◽

Target Languages ◽

Cross Lingual ◽

Translation Accuracy ◽

Translation Systems ◽

Technical Terms

AbstractFinding translations for technical terms is an important problem in machine translation. In particular, in highly specialized domains such as biology or medicine, it is difficult to find bilingual experts to annotate sufficient cross-lingual texts in order to train machine translation systems. Moreover, new terms are constantly being generated in the biomedical community, which makes it difficult to keep the translation dictionaries up to date for all language pairs of interest. Given a biomedical term in one language (source language), we propose a method for detecting its translations in a different language (target language). Specifically, we train a binary classifier to determine whether two biomedical terms written in two languages are translations. Training such a classifier is often complicated due to the lack of common features between the source and target languages. We propose several feature space concatenation methods to successfully overcome this problem. Moreover, we study the effectiveness of contextual and character n-gram features for detecting term translations. Experiments conducted using a standard dataset for biomedical term translation show that the proposed method outperforms several competitive baseline methods in terms of mean average precision and top-k translation accuracy.

Download Full-text

Reinforced Transformer with Cross-Lingual Distillation for Cross-Lingual Aspect Sentiment Classification

Electronics ◽

10.3390/electronics10030270 ◽

2021 ◽

Vol 10 (3) ◽

pp. 270

Author(s):

Hanqian Wu ◽

Zhike Wang ◽

Feng Qing ◽

Shoushan Li

Keyword(s):

General Purpose ◽

Sentiment Classification ◽

Training Data ◽

Target Language ◽

Source Language ◽

Domain Specific ◽

Novel Approach ◽

The Rich ◽

Target Languages ◽

Cross Lingual

Though great progress has been made in the Aspect-Based Sentiment Analysis(ABSA) task through research, most of the previous work focuses on English-based ABSA problems, and there are few efforts on other languages mainly due to the lack of training data. In this paper, we propose an approach for performing a Cross-Lingual Aspect Sentiment Classification (CLASC) task which leverages the rich resources in one language (source language) for aspect sentiment classification in a under-resourced language (target language). Specifically, we first build a bilingual lexicon for domain-specific training data to translate the aspect category annotated in the source-language corpus and then translate sentences from the source language to the target language via Machine Translation (MT) tools. However, most MT systems are general-purpose, it non-avoidably introduces translation ambiguities which would degrade the performance of CLASC. In this context, we propose a novel approach called Reinforced Transformer with Cross-Lingual Distillation (RTCLD) combined with target-sensitive adversarial learning to minimize the undesirable effects of translation ambiguities in sentence translation. We conduct experiments on different language combinations, treating English as the source language and Chinese, Russian, and Spanish as target languages. The experimental results show that our proposed approach outperforms the state-of-the-art methods on different target languages.

Download Full-text

Embedding Projection for Targeted Cross-lingual Sentiment: Model Comparisons and a Real-World Study

Journal of Artificial Intelligence Research ◽

10.1613/jair.1.11561 ◽

2019 ◽

Vol 66 ◽

Author(s):

Jeremy Barnes ◽

Roman Klinger

Keyword(s):

Sentiment Analysis ◽

State Of The Art ◽

Target Language ◽

Test Machine ◽

Fine Grained ◽

Sentence Level ◽

Level Information ◽

Cross Lingual ◽

Multiple Domains ◽

Embedding Methods

Sentiment analysis benefits from large, hand-annotated resources in order to train and test machine learning models, which are often data hungry. While some languages, e.g., English, have a vast arrayof these resources, most under-resourced languages do not, especially for fine-grained sentiment tasks, such as aspect-level or targeted sentiment analysis. To improve this situation, we propose a cross-lingual approach to sentiment analysis that is applicable to under-resourced languages and takes into account target-level information. This model incorporates sentiment information into bilingual distributional representations, byjointly optimizing them for semantics and sentiment, showing state-of-the-art performance at sentence-level when combined with machine translation. The adaptation to targeted sentiment analysis on multiple domains shows that our model outperforms other projection-based bilingual embedding methods on binary targetedsentiment tasks. Our analysis on ten languages demonstrates that the amount of unlabeled monolingual data has surprisingly little effect on the sentiment results. As expected, the choice of a annotated source language for projection to a target leads to better results for source-target language pairs which are similar. Therefore, our results suggest that more efforts should be spent on the creation of resources for less similar languages tothose which are resource-rich already. Finally, a domain mismatch leads to a decreased performance. This suggests resources in any language should ideally cover varieties of domains.

Download Full-text

Lexicon-based sentiment analysis: Comparative evaluation of six sentiment lexicons

Journal of Information Science ◽

10.1177/0165551517703514 ◽

2017 ◽

Vol 44 (4) ◽

pp. 491-511 ◽

Cited By ~ 32

Author(s):

Christopher SG Khoo ◽

Sathik Basha Johnkhan

Keyword(s):

Question Answering ◽

Pearson Correlation ◽

National Research Council ◽

General Purpose ◽

Product Review ◽

Data Set ◽

Sentence Level ◽

Sentiment Lexicon ◽

News Headlines ◽

Document Level

This article introduces a new general-purpose sentiment lexicon called WKWSCI Sentiment Lexicon and compares it with five existing lexicons: Hu & Liu Opinion Lexicon, Multi-perspective Question Answering (MPQA) Subjectivity Lexicon, General Inquirer, National Research Council Canada (NRC) Word-Sentiment Association Lexicon and Semantic Orientation Calculator (SO-CAL) lexicon. The effectiveness of the sentiment lexicons for sentiment categorisation at the document level and sentence level was evaluated using an Amazon product review data set and a news headlines data set. WKWSCI, MPQA, Hu & Liu and SO-CAL lexicons are equally good for product review sentiment categorisation, obtaining accuracy rates of 75%–77% when appropriate weights are used for different categories of sentiment words. However, when a training corpus is not available, Hu & Liu obtained the best accuracy with a simple-minded approach of counting positive and negative words for both document-level and sentence-level sentiment categorisation. The WKWSCI lexicon obtained the best accuracy of 69% on the news headlines sentiment categorisation task, and the sentiment strength values obtained a Pearson correlation of 0.57 with human-assigned sentiment values. It is recommended that the Hu & Liu lexicon be used for product review texts and the WKWSCI lexicon for non-review texts.

Download Full-text

Monolingual and Cross-Lingual Intent Detection without Training Data in Target Languages

Electronics ◽

10.3390/electronics10121412 ◽

2021 ◽

Vol 10 (12) ◽

pp. 1412

Author(s):

Jurgita Kapočiūtė-Dzikienė ◽

Askars Salimbajevs ◽

Raivis Skadiņš

Keyword(s):

Experimental Investigation ◽

Training Data ◽

Fine Tuning ◽

Target Language ◽

Learning Approach ◽

Lazy Learning ◽

Detection Problem ◽

Target Languages ◽

Cross Lingual ◽

Similar Accuracy

Due to recent DNN advancements, many NLP problems can be effectively solved using transformer-based models and supervised data. Unfortunately, such data is not available in some languages. This research is based on assumptions that (1) training data can be obtained by the machine translating it from another language; (2) there are cross-lingual solutions that work without the training data in the target language. Consequently, in this research, we use the English dataset and solve the intent detection problem for five target languages (German, French, Lithuanian, Latvian, and Portuguese). When seeking the most accurate solutions, we investigate BERT-based word and sentence transformers together with eager learning classifiers (CNN, BERT fine-tuning, FFNN) and lazy learning approach (Cosine similarity as the memory-based method). We offer and evaluate several strategies to overcome the data scarcity problem with machine translation, cross-lingual models, and a combination of the previous two. The experimental investigation revealed the robustness of sentence transformers under various cross-lingual conditions. The accuracy equal to ~0.842 is achieved with the English dataset with completely monolingual models is considered our top-line. However, cross-lingual approaches demonstrate similar accuracy levels reaching ~0.831, ~0.829, ~0.853, ~0.831, and ~0.813 on German, French, Lithuanian, Latvian, and Portuguese languages.

Download Full-text

New treebank or repurposed? On the feasibility of cross-lingual parsing of Romance languages with Universal Dependencies

Natural Language Engineering ◽

10.1017/s1351324917000377 ◽

2017 ◽

Vol 24 (1) ◽

pp. 91-122 ◽

Cited By ~ 2

Author(s):

MARCOS GARCIA ◽

CARLOS GÓMEZ-RODRÍGUEZ ◽

MIGUEL A. ALONSO

Keyword(s):

Direct Application ◽

Lessons Learned ◽

Target Language ◽

Romance Languages ◽

Manual Annotation ◽

Training Corpus ◽

Target Languages ◽

Cross Lingual ◽

The Impact ◽

Selection Of

AbstractThis paper addresses the feasibility of cross-lingual parsing with Universal Dependencies (UD) between Romance languages, analyzing its performance when compared to the use of manually annotated resources of the target languages. Several experiments take into account factors such as the lexical distance between the source and target varieties, the impact of delexicalization, the combination of different source treebanks or the adaptation of resources to the target language, among others. The results of these evaluations show that the direct application of a parser from one Romance language to another reaches similar labeled attachment score (LAS) values to those obtained with a manual annotation of about 3,000 tokens in the target language, and unlabeled attachment score (UAS) results equivalent to the use of around 7,000 tokens, depending on the case. These numbers can noticeably increase by performing a focused selection of the source treebanks. Furthermore, the removal of the words in the training corpus (delexicalization) is not useful in most cases of cross-lingual parsing of Romance languages. The lessons learned with the performed experiments were used to build a new UD treebank for Galician, with 1,000 sentences manually corrected after an automatic cross-lingual annotation. Several evaluations in this new resource show that a cross-lingual parser built with the best combination and adaptation of the source treebanks performs better (77 percent LAS and 82 percent UAS) than using more than 16,000 (for LAS results) and more than 20,000 (UAS) manually labeled tokens of Galician.

Download Full-text

Synthetic Treebanking for Cross-Lingual Dependency Parsing

Journal of Artificial Intelligence Research ◽

10.1613/jair.4785 ◽

2016 ◽

Vol 55 ◽

pp. 209-248 ◽

Cited By ~ 7

Author(s):

Jörg Tiedemann ◽

Zeljko Agić

Keyword(s):

Machine Translation ◽

Target Language ◽

Dependency Parsing ◽

Practical Applications ◽

Source Language ◽

Part Of Speech ◽

Statistical Dependency ◽

Target Languages ◽

Cross Lingual ◽

The Impact

How do we parse the languages for which no treebanks are available? This contribution addresses the cross-lingual viewpoint on statistical dependency parsing, in which we attempt to make use of resource-rich source language treebanks to build and adapt models for the under-resourced target languages. We outline the benefits, and indicate the drawbacks of the current major approaches. We emphasize synthetic treebanking: the automatic creation of target language treebanks by means of annotation projection and machine translation. We present competitive results in cross-lingual dependency parsing using a combination of various techniques that contribute to the overall success of the method. We further include a detailed discussion about the impact of part-of-speech label accuracy on parsing results that provide guidance in practical applications of cross-lingual methods for truly under-resourced languages.

Download Full-text

Emoji-Powered Representation Learning for Cross-Lingual Sentiment Classification (Extended Abstract)

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/649 ◽

2020 ◽

Author(s):

Zhenpeng Chen ◽

Sheng Shen ◽

Ziniu Hu ◽

Xuan Lu ◽

Qiaozhu Mei ◽

...

Keyword(s):

Machine Translation ◽

Representation Learning ◽

Sentiment Classification ◽

Target Language ◽

Learning Method ◽

Source Language ◽

Translation Tools ◽

Target Languages ◽

Cross Lingual ◽

Cross Language

Sentiment classification typically relies on a large amount of labeled data. In practice, the availability of labels is highly imbalanced among different languages. To tackle this problem, cross-lingual sentiment classification approaches aim to transfer knowledge learned from one language that has abundant labeled examples (i.e., the source language, usually English) to another language with fewer labels (i.e., the target language). The source and the target languages are usually bridged through off-the-shelf machine translation tools. Through such a channel, cross-language sentiment patterns can be successfully learned from English and transferred into the target languages. This approach, however, often fails to capture sentiment knowledge specific to the target language. In this paper, we employ emojis, which are widely available in many languages, as a new channel to learn both the cross-language and the language-specific sentiment patterns. We propose a novel representation learning method that uses emoji prediction as an instrument to learn respective sentiment-aware representations for each language. The learned representations are then integrated to facilitate cross-lingual sentiment classification.

Download Full-text