UPC: An Open Word-Sense Annotated Parallel Corpora for Machine Translation Study

Van-Hai Vu; Quang-Phuoc Nguyen; Joon-Choul Shin; Cheol-Young Ock

doi:10.3390/app10113904

UPC: An Open Word-Sense Annotated Parallel Corpora for Machine Translation Study

Applied Sciences ◽

10.3390/app10113904 ◽

2020 ◽

Vol 10 (11) ◽

pp. 3904

Author(s):

Van-Hai Vu ◽

Quang-Phuoc Nguyen ◽

Joon-Choul Shin ◽

Cheol-Young Ock

Keyword(s):

Deep Learning ◽

Machine Translation ◽

Ambiguous Word ◽

High Rate ◽

Word Sense ◽

Language Resources ◽

Parallel Corpora ◽

Knowledge Based ◽

Translation Error ◽

Translation Study

Machine translation (MT) has recently attracted much research on various advanced techniques (i.e., statistical-based and deep learning-based) and achieved great results for popular languages. However, the research on it involving low-resource languages such as Korean often suffer from the lack of openly available bilingual language resources. In this research, we built the open extensive parallel corpora for training MT models, named Ulsan parallel corpora (UPC). Currently, UPC contains two parallel corpora consisting of Korean-English and Korean-Vietnamese datasets. The Korean-English dataset has over 969 thousand sentence pairs, and the Korean-Vietnamese parallel corpus consists of over 412 thousand sentence pairs. Furthermore, the high rate of homographs of Korean causes an ambiguous word issue in MT. To address this problem, we developed a powerful word-sense annotation system based on a combination of sub-word conditional probability and knowledge-based methods, named UTagger. We applied UTagger to UPC and used these corpora to train both statistical-based and deep learning-based neural MT systems. The experimental results demonstrated that using UPC, high-quality MT systems (in terms of the Bi-Lingual Evaluation Understudy (BLEU) and Translation Error Rate (TER) score) can be built. Both UPC and UTagger are available for free download and usage.

Download Full-text

Ensemble and Deep Learning for Language-Independent Automatic Selection of Parallel Data

Algorithms ◽

10.3390/a12010026 ◽

2019 ◽

Vol 12 (1) ◽

pp. 26 ◽

Cited By ~ 2

Author(s):

Despoina Mouratidis ◽

Katia Kermanidis

Keyword(s):

Deep Learning ◽

Machine Translation ◽

Research Field ◽

Attribute Selection ◽

Parallel Corpora ◽

Automatic Selection ◽

Parallel Data ◽

First Time ◽

Learning Architectures ◽

Selection Of

Machine translation is used in many applications in everyday life. Due to the increase of translated documents that need to be organized as useful or not (for building a translation model), the automated categorization of texts (classification), is a popular research field of machine learning. This kind of information can be quite helpful for machine translation. Our parallel corpora (English-Greek and English-Italian) are based on educational data, which are quite difficult to translate. We apply two state of the art architectures, Random Forest (RF) and Deeplearnig4j (DL4J), to our data (which constitute three translation outputs). To our knowledge, this is the first time that deep learning architectures are applied to the automatic selection of parallel data. We also propose new string-based features that seem to be effective for the classifier, and we investigate whether an attribute selection method could be used for better classification accuracy. Experimental results indicate an increase of up to 4% (compared to our previous work) using RF and rather satisfactory results using DL4J.

Download Full-text

Multilingual Dependency Parsing: Using Machine Translated Texts Instead of Parallel Corpora

Prague Bulletin of Mathematical Linguistics ◽

10.2478/pralin-2014-0017 ◽

2014 ◽

Vol 102 (1) ◽

pp. 93-104

Author(s):

Ramasamy Loganathan ◽

Mareček David ◽

Žabokrtský Zdenčk

Keyword(s):

Machine Translation ◽

The Other ◽

Target Language ◽

Grammar Induction ◽

Language Resources ◽

Parallel Corpora ◽

Similar Performance ◽

Part Of Speech ◽

Target Languages ◽

Cross Lingual

Abstract This paper revisits the projection-based approach to dependency grammar induction task. Traditional cross-lingual dependency induction tasks one way or the other, depend on the existence of bitexts or target language tools such as part-of-speech (POS) taggers to obtain reasonable parsing accuracy. In this paper, we transfer dependency parsers using only approximate resources, i.e., machine translated bitexts instead of manually created bitexts. We do this by obtaining the the source side of the text from a machine translation (MT) system and then apply transfer approaches to induce parser for the target languages. We further reduce the need for the availability of labeled target language resources by using unsupervised target tagger. We show that our approach consistently outperforms unsupervised parsers by a bigger margin (8.2% absolute), and results in similar performance when compared with delexicalized transfer parsers.

Download Full-text

Word Sense Disambiguation for Improving the Quality of Machine Translation

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.981.153 ◽

2014 ◽

Vol 981 ◽

pp. 153-156

Author(s):

Chun Xiang Zhang ◽

Long Deng ◽

Xue Yao Gao ◽

Li Li Guo

Keyword(s):

Machine Translation ◽

Language Processing ◽

Word Sense Disambiguation ◽

Ambiguous Word ◽

Translation System ◽

Word Sense ◽

Ambiguous Words ◽

Sense Disambiguation ◽

Machine Translation System

Word sense disambiguation is key to many application problems in natural language processing. In this paper, a specific classifier of word sense disambiguation is introduced into machine translation system in order to improve the quality of the output translation. Firstly, translation of ambiguous word is deleted from machine translation of Chinese sentence. Secondly, ambiguous word is disambiguated and the classification labels are translations of ambiguous word. Thirdly, these two translations are combined. 50 Chinese sentences including ambiguous words are collected for test experiments. Experimental results show that the translation quality is improved after the proposed method is applied.

Download Full-text

Abstractive text summarization: enhancing sequence to sequence models using word sense disambiguation and semantic content generalization

Computational Linguistics ◽

10.1162/coli_a_00417 ◽

2021 ◽

pp. 1-41

Author(s):

Panagiotis Kouris ◽

Georgios Alexandridis ◽

Andreas Stafylopatis

Keyword(s):

Deep Learning ◽

Heuristic Algorithms ◽

Word Sense Disambiguation ◽

Text Summarization ◽

Word Sense ◽

Post Processing ◽

Knowledge Resources ◽

Processing Task ◽

Knowledge Based ◽

Sense Disambiguation

Abstract Nowadays, most research conducted in the field of abstractive text summarization focuses on neural-based models alone, without considering their combination with knowledge-based that could further enhance their efficiency. In this direction, this work presents a novel framework that combines sequence to sequence neural-based text summarization along with structure and semantic-based methodologies. The proposed framework is capable of dealing with the problem of out-of-vocabulary or rare words, improving the performance of the deep learning models. The overall methodology is based on a well defined theoretical model of knowledge-based content generalization and deeplearning predictions for generating abstractive summaries. The framework is comprised of three key elements: (i) a pre-processing task, (ii) a machine learning methodology and (iii) a post-processing task. The pre-processing task is a knowledge-based approach, based on ontological knowledge resources, word-sense-disambiguation and namedentity recognition, along with content generalization, that transforms ordinary text into a generalized form. A deep learning model of attentive encoder-decoder architecture, which is expanded to enable a coping and coverage mechanism, as well as reinforcement learning and transformer-based architectures, is trained on a generalized version of text-summary pairs, learning to predict summaries in a generalized form. The post-processing task utilizes knowledge resources, word embeddings, word-sense disambiguation and heuristic algorithms based on text similarity methods in order to transform the generalized version of a predicted summary to a final, humanreadable form. An extensive experimental procedure on three popular datasets evaluates key aspects of the proposed framework, while the obtained results exhibit promising performance, validating the robustness of the proposed approach.

Download Full-text

Semantic approach for building generated virtual-parallel corpora from monolingual texts

Poznan Studies in Contemporary Linguistics ◽

10.1515/psicl-2019-0017 ◽

2019 ◽

Vol 55 (2) ◽

pp. 469-490

Author(s):

Krzysztof Wołk ◽

Agnieszka Wołk ◽

Krzysztof Marasek

Keyword(s):

Machine Translation ◽

Language Processing ◽

Research Process ◽

Levenshtein Distance ◽

Language Resources ◽

Semantic Approach ◽

Parallel Corpora ◽

Parallel Corpus ◽

Comparable Corpora ◽

Translation Systems

Abstract Several natural languages have undergone a great deal of processing, but the problem of limited textual linguistic resources remains. The manual creation of parallel corpora by humans is rather expensive and time consuming, while the language data required for statistical machine translation (SMT) do not exist in adequate quantities for their statistical information to be used to initiate the research process. On the other hand, applying known approaches to build parallel resources from multiple sources, such as comparable or quasi-comparable corpora, is very complicated and provides rather noisy output, which later needs to be further processed and requires in-domain adaptation. To optimize the performance of comparable corpora mining algorithms, it is essential to use a quality parallel corpus for training of a good data classifier. In this research, we have developed a methodology for generating an accurate parallel corpus (Czech-English, Polish-English) from monolingual resources by calculating the compatibility between the results of three machine translation systems. We have created translations of large, single-language resources by applying multiple translation systems and strictly measuring translation compatibility using rules based on the Levenshtein distance. The results produced by this approach were very favorable. The generated corpora successfully improved the quality of SMT systems and seem to be useful for many other natural language processing tasks.

Download Full-text

Translational equivalence in Statistical Machine Translation or meaning as co-occurrence

Linguistica Antverpiensia, New Series – Themes in Translation Studies ◽

10.52034/lanstts.v7i.215 ◽

2021 ◽

Vol 7 ◽

Author(s):

Lieve Macken ◽

Els Lefever

Keyword(s):

Machine Translation ◽

State Of The Art ◽

Word Sense Disambiguation ◽

Statistical Machine Translation ◽

General Purpose ◽

Point Of View ◽

Word Alignment ◽

Word Sense ◽

Parallel Corpora ◽

Current State

In this paper, we will describe the current state-of-the-art of Statistical Machine Translation (SMT), and reflect on how SMT handles meaning. Statistical Machine Translation is a corpus-based approach to MT: it de-rives the required knowledge to generate new translations from corpora. General-purpose SMT systems do not use any formal semantic representa-tion. Instead, they directly extract translationally equivalent words or word sequences – expressions with the same meaning – from bilingual parallel corpora. All statistical translation models are based on the idea of word alignment, i.e., the automatic linking of corresponding words in parallel texts. The first generation SMT systems were word-based. From a linguistic point of view, the major problem with word-based systems is that the mean-ing of a word is often ambiguous, and is determined by its context. Current state-of-the-art SMT-systems try to capture the local contextual dependen-cies by using phrases instead of words as units of translation. In order to solve more complex ambiguity problems (where a broader text scope or even domain information is needed), a Word Sense Disambiguation (WSD) module is integrated in the Machine Translation environment.

Download Full-text