Reusing Monolingual Pre-Trained Models by Cross-Connecting Seq2seq Models for Machine Translation

Jiun Oh; Yong-Suk Choi

doi:10.3390/app11188737

Reusing Monolingual Pre-Trained Models by Cross-Connecting Seq2seq Models for Machine Translation

Applied Sciences ◽

10.3390/app11188737 ◽

2021 ◽

Vol 11 (18) ◽

pp. 8737

Author(s):

Jiun Oh ◽

Yong-Suk Choi

Keyword(s):

Machine Translation ◽

Intermediate Layer ◽

Language Model ◽

Target Language ◽

Source Language ◽

Performance Change ◽

The Cross ◽

Target Languages ◽

Cross Connection

This work uses sequence-to-sequence (seq2seq) models pre-trained on monolingual corpora for machine translation. We pre-train two seq2seq models with monolingual corpora for the source and target languages, then combine the encoder of the source language model and the decoder of the target language model, i.e., the cross-connection. We add an intermediate layer between the pre-trained encoder and the decoder to help the mapping of each other since the modules are pre-trained completely independently. These monolingual pre-trained models can work as a multilingual pre-trained model because one model can be cross-connected with another model pre-trained on any other language, while their capacity is not affected by the number of languages. We will demonstrate that our method improves the translation performance significantly over the random baseline. Moreover, we will analyze the appropriate choice of the intermediate layer, the importance of each part of a pre-trained model, and the performance change along with the size of the bitext.

Download Full-text

Optimizing Tokenization Choice for Machine Translation across Multiple Target Languages

Prague Bulletin of Mathematical Linguistics ◽

10.1515/pralin-2017-0025 ◽

2017 ◽

Vol 108 (1) ◽

pp. 257-269 ◽

Cited By ~ 4

Author(s):

Nasser Zalmout ◽

Nizar Habash

Keyword(s):

Machine Translation ◽

Performance Enhancement ◽

Statistical Machine Translation ◽

Target Language ◽

Source Language ◽

Context Variable ◽

Significant Performance ◽

Morphologically Rich Languages ◽

Target Languages ◽

Language Text

AbstractTokenization is very helpful for Statistical Machine Translation (SMT), especially when translating from morphologically rich languages. Typically, a single tokenization scheme is applied to the entire source-language text and regardless of the target language. In this paper, we evaluate the hypothesis that SMT performance may benefit from different tokenization schemes for different words within the same text, and also for different target languages. We apply this approach to Arabic as a source language, with five target languages of varying morphological complexity: English, French, Spanish, Russian and Chinese. Our results show that different target languages indeed require different source-language schemes; and a context-variable tokenization scheme can outperform a context-constant scheme with a statistically significant performance enhancement of about 1.4 BLEU points.

Download Full-text

Synthetic Treebanking for Cross-Lingual Dependency Parsing

Journal of Artificial Intelligence Research ◽

10.1613/jair.4785 ◽

2016 ◽

Vol 55 ◽

pp. 209-248 ◽

Cited By ~ 7

Author(s):

Jörg Tiedemann ◽

Zeljko Agić

Keyword(s):

Machine Translation ◽

Target Language ◽

Dependency Parsing ◽

Practical Applications ◽

Source Language ◽

Part Of Speech ◽

Statistical Dependency ◽

Target Languages ◽

Cross Lingual ◽

The Impact

How do we parse the languages for which no treebanks are available? This contribution addresses the cross-lingual viewpoint on statistical dependency parsing, in which we attempt to make use of resource-rich source language treebanks to build and adapt models for the under-resourced target languages. We outline the benefits, and indicate the drawbacks of the current major approaches. We emphasize synthetic treebanking: the automatic creation of target language treebanks by means of annotation projection and machine translation. We present competitive results in cross-lingual dependency parsing using a combination of various techniques that contribute to the overall success of the method. We further include a detailed discussion about the impact of part-of-speech label accuracy on parsing results that provide guidance in practical applications of cross-lingual methods for truly under-resourced languages.

Download Full-text

Emoji-Powered Representation Learning for Cross-Lingual Sentiment Classification (Extended Abstract)

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/649 ◽

2020 ◽

Author(s):

Zhenpeng Chen ◽

Sheng Shen ◽

Ziniu Hu ◽

Xuan Lu ◽

Qiaozhu Mei ◽

...

Keyword(s):

Machine Translation ◽

Representation Learning ◽

Sentiment Classification ◽

Target Language ◽

Learning Method ◽

Source Language ◽

Translation Tools ◽

Target Languages ◽

Cross Lingual ◽

Cross Language

Sentiment classification typically relies on a large amount of labeled data. In practice, the availability of labels is highly imbalanced among different languages. To tackle this problem, cross-lingual sentiment classification approaches aim to transfer knowledge learned from one language that has abundant labeled examples (i.e., the source language, usually English) to another language with fewer labels (i.e., the target language). The source and the target languages are usually bridged through off-the-shelf machine translation tools. Through such a channel, cross-language sentiment patterns can be successfully learned from English and transferred into the target languages. This approach, however, often fails to capture sentiment knowledge specific to the target language. In this paper, we employ emojis, which are widely available in many languages, as a new channel to learn both the cross-language and the language-specific sentiment patterns. We propose a novel representation learning method that uses emoji prediction as an instrument to learn respective sentiment-aware representations for each language. The learned representations are then integrated to facilitate cross-lingual sentiment classification.

Download Full-text

Inflection rules for Marathi to English in rule based machine translation

IAES International Journal of Artificial Intelligence (IJ-AI) ◽

10.11591/ijai.v10.i3.pp780-788 ◽

2021 ◽

Vol 10 (3) ◽

pp. 780

Author(s):

Namrata G Kharate ◽

Varsha H Patil

Keyword(s):

Natural Language Processing ◽

Machine Translation ◽

Language Processing ◽

Important Application ◽

Target Language ◽

Rule Based ◽

Parts Of Speech ◽

Source Language ◽

Target Languages ◽

Correct Translation

Machine translation is important application in natural language processing. Machine translation means translation from source language to target language to save the meaning of the sentence. A large amount of research is going on in the area of machine translation. However, research with machine translation remains highly localized to the particular source and target languages as they differ syntactically and morphologically. Appropriate inflections result correct translation. This paper elaborates the rules for inflecting the parts-of-speech and implements the inflection for Marathi to English translation. The inflection of nouns, pronouns, verbs, adjectives are carried out on the basis of semantics of the sentence. The results are discussed with examples.

Download Full-text

Wieloaspektowość międzyjęzykowej ekwiwalencji stałych połączeń wyrazowych a ujęcie onomazjologiczne na przykładzie frazeologizmów typu tęga głowa

Białostockie Archiwum Językowe ◽

10.15290/baj.2020.20.25 ◽

2020 ◽

pp. 333-355

Author(s):

Joanna Szerszunowicz ◽

Keyword(s):

Linguistic Analysis ◽

Lexical Item ◽

The Other ◽

Target Language ◽

Source Language ◽

Multiword Expressions ◽

Other Hand ◽

Minor Significance ◽

The Cross

The aim of this paper is to discuss the usefulness and reliability of the onomasiological approach in the cross-linguistic analysis of fixed multiword expressions based on the example of Polish phrases coined according to the model: ADJECTIVENOM FEM SING + GŁOWA ‘HEAD’ and their English and Italian counterparts. The three corpora are constituted by expressions registered in general and phraseological dictionaries of the respective languages to ensure that the units belong to the canon of Polish, English and Italian phraseological stock. The analysis of units collected for the purpose of the study clearly shows that in order to determine the true picture of cross-linguistic equivalence, the study should be focused on semantics of analysed phrases. Furthermore, the formal aspectmay be of minor significance in some cases due to the similarity of imagery of a source language idiom and the target language lexical item. On the other hand, stylistic value may have a great impact on the relation of cross-linguistic correspondence of the analysed units.

Download Full-text

Controlling Neural Machine Translation Formality with Synthetic Supervision

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6379 ◽

2020 ◽

Vol 34 (05) ◽

pp. 8568-8575

Author(s):

Xing Niu ◽

Marine Carpuat

Keyword(s):

Machine Translation ◽

Target Language ◽

Sentence Pair ◽

English Sentence ◽

Neural Machine Translation ◽

Source Language ◽

Training Scheme ◽

Training Examples ◽

Language Content ◽

Missing Element

This work aims to produce translations that convey source language content at a formality level that is appropriate for a particular audience. Framing this problem as a neural sequence-to-sequence task ideally requires training triplets consisting of a bilingual sentence pair labeled with target language formality. However, in practice, available training examples are limited to English sentence pairs of different styles, and bilingual parallel sentences of unknown formality. We introduce a novel training scheme for multi-task models that automatically generates synthetic training triplets by inferring the missing element on the fly, thus enabling end-to-end training. Comprehensive automatic and human assessments show that our best model outperforms existing models by producing translations that better match desired formality levels while preserving the source meaning.1

Download Full-text

MTIL2017: Machine Translation Using Recurrent Neural Network on Statistical Machine Translation

Journal of Intelligent Systems ◽

10.1515/jisys-2018-0016 ◽

2019 ◽

Vol 28 (3) ◽

pp. 447-453 ◽

Cited By ~ 5

Author(s):

Sainik Kumar Mahata ◽

Dipankar Das ◽

Sivaji Bandyopadhyay

Keyword(s):

Machine Translation ◽

Statistical Machine Translation ◽

Language Model ◽

Target Language ◽

Data Sets ◽

Shared Task ◽

Automatic Translation ◽

External Data ◽

Statistical Mt

Abstract Machine translation (MT) is the automatic translation of the source language to its target language by a computer system. In the current paper, we propose an approach of using recurrent neural networks (RNNs) over traditional statistical MT (SMT). We compare the performance of the phrase table of SMT to the performance of the proposed RNN and in turn improve the quality of the MT output. This work has been done as a part of the shared task problem provided by the MTIL2017. We have constructed the traditional MT model using Moses toolkit and have additionally enriched the language model using external data sets. Thereafter, we have ranked the phrase tables using an RNN encoder-decoder module created originally as a part of the GroundHog project of LISA lab.

Download Full-text

A Knowledge-Based Machine Translation Using AI Technique

International Journal of Software Innovation ◽

10.4018/ijsi.2018070106 ◽

2018 ◽

Vol 6 (3) ◽

pp. 79-92

Author(s):

Sahar A. El-Rahman ◽

Tarek A. El-Shishtawy ◽

Raafat A. El-Kammar

Keyword(s):

Machine Translation ◽

Target Language ◽

Translation System ◽

Module Structure ◽

Source Language ◽

Word Category ◽

Knowledge Based ◽

Language Analysis ◽

Transfer Rules

This article presents a realistic technique for the machine aided translation system. In this technique, the system dictionary is partitioned into a multi-module structure for fast retrieval of Arabic features of English words. Each module is accessed through an interface that includes the necessary morphological rules, which directs the search toward the proper sub-dictionary. Another factor that aids fast retrieval of Arabic features of words is the prediction of the word category, and accesses its sub-dictionary to retrieve the corresponding attributes. The system consists of three main parts, which are the source language analysis, the transfer rules between source language (English) and target language (Arabic), and the generation of the target language. The proposed system is able to translate, some negative forms, demonstrations, and conjunctions, and also adjust nouns, verbs, and adjectives according their attributes. Then, it adds the symptom of Arabic words to generate a correct sentence.

Download Full-text

Multilingual Dependency Parsing: Using Machine Translated Texts Instead of Parallel Corpora

Prague Bulletin of Mathematical Linguistics ◽

10.2478/pralin-2014-0017 ◽

2014 ◽

Vol 102 (1) ◽

pp. 93-104

Author(s):

Ramasamy Loganathan ◽

Mareček David ◽

Žabokrtský Zdenčk

Keyword(s):

Machine Translation ◽

The Other ◽

Target Language ◽

Grammar Induction ◽

Language Resources ◽

Parallel Corpora ◽

Similar Performance ◽

Part Of Speech ◽

Target Languages ◽

Cross Lingual

Abstract This paper revisits the projection-based approach to dependency grammar induction task. Traditional cross-lingual dependency induction tasks one way or the other, depend on the existence of bitexts or target language tools such as part-of-speech (POS) taggers to obtain reasonable parsing accuracy. In this paper, we transfer dependency parsers using only approximate resources, i.e., machine translated bitexts instead of manually created bitexts. We do this by obtaining the the source side of the text from a machine translation (MT) system and then apply transfer approaches to induce parser for the target languages. We further reduce the need for the availability of labeled target language resources by using unsupervised target tagger. We show that our approach consistently outperforms unsupervised parsers by a bigger margin (8.2% absolute), and results in similar performance when compared with delexicalized transfer parsers.

Download Full-text

ISSUES IN TRANSLATION OF BALINESE CULTURAL TERMS INTO ENGLISH

Indonesian EFL Journal ◽

10.25134/ieflj.v6i1.2639 ◽

2020 ◽

Vol 6 (1) ◽

pp. 63

Author(s):

Kadek Putri Yamayanti

Keyword(s):

Target Language ◽

Semantic Features ◽

Source Language ◽

Componential Analysis ◽

Translation Equivalent ◽

Degree Of Equivalence ◽

Cultural Terms ◽

Binary Features ◽

Target Languages ◽

Basic Semantic

This descriptive qualitative study investigates translation equivalent of Balinese cultural terms into English. It is based on the understanding that cultural terms belong to salient part in dealing with translation due to the cultural gap between source and target languages. Therefore, this study is conducted in order to find out the degree of equivalence between Balinese cultural terms and their translations into English in the book entitled Memahami Roh Bali �Desa Adat sebagai Ikon Tri Hita Karana� and its translation in Discovering the Spirit of Bali �Customary Village as Icon of Tri Hita Karana�. In finding the degree of equivalence, componential analysis especially the binary features was applied in terms of confirming the semantic features. The result showed that all translated cultural terms have no exact synonymy into source language. Some semantic features do not occur in target language as a result of lack terms in target language. The translator tends to replace cultural terms in source language into appropriate terms in target language based on his knowledge and experiences even in some cases, it shows the loss and gain information. However, overall, those translated cultural terms still can share some basic semantic features of the source language.Keywords: cultural term; semantic features; equivalence.

Download Full-text