scholarly journals Reusing Monolingual Pre-Trained Models by Cross-Connecting Seq2seq Models for Machine Translation

2021 ◽  
Vol 11 (18) ◽  
pp. 8737
Author(s):  
Jiun Oh ◽  
Yong-Suk Choi

This work uses sequence-to-sequence (seq2seq) models pre-trained on monolingual corpora for machine translation. We pre-train two seq2seq models with monolingual corpora for the source and target languages, then combine the encoder of the source language model and the decoder of the target language model, i.e., the cross-connection. We add an intermediate layer between the pre-trained encoder and the decoder to help the mapping of each other since the modules are pre-trained completely independently. These monolingual pre-trained models can work as a multilingual pre-trained model because one model can be cross-connected with another model pre-trained on any other language, while their capacity is not affected by the number of languages. We will demonstrate that our method improves the translation performance significantly over the random baseline. Moreover, we will analyze the appropriate choice of the intermediate layer, the importance of each part of a pre-trained model, and the performance change along with the size of the bitext.

2017 ◽  
Vol 108 (1) ◽  
pp. 257-269 ◽  
Author(s):  
Nasser Zalmout ◽  
Nizar Habash

AbstractTokenization is very helpful for Statistical Machine Translation (SMT), especially when translating from morphologically rich languages. Typically, a single tokenization scheme is applied to the entire source-language text and regardless of the target language. In this paper, we evaluate the hypothesis that SMT performance may benefit from different tokenization schemes for different words within the same text, and also for different target languages. We apply this approach to Arabic as a source language, with five target languages of varying morphological complexity: English, French, Spanish, Russian and Chinese. Our results show that different target languages indeed require different source-language schemes; and a context-variable tokenization scheme can outperform a context-constant scheme with a statistically significant performance enhancement of about 1.4 BLEU points.


2016 ◽  
Vol 55 ◽  
pp. 209-248 ◽  
Author(s):  
Jörg Tiedemann ◽  
Zeljko Agić

How do we parse the languages for which no treebanks are available? This contribution addresses the cross-lingual viewpoint on statistical dependency parsing, in which we attempt to make use of resource-rich source language treebanks to build and adapt models for the under-resourced target languages. We outline the benefits, and indicate the drawbacks of the current major approaches. We emphasize synthetic treebanking: the automatic creation of target language treebanks by means of annotation projection and machine translation. We present competitive results in cross-lingual dependency parsing using a combination of various techniques that contribute to the overall success of the method. We further include a detailed discussion about the impact of part-of-speech label accuracy on parsing results that provide guidance in practical applications of cross-lingual methods for truly under-resourced languages.


Author(s):  
Zhenpeng Chen ◽  
Sheng Shen ◽  
Ziniu Hu ◽  
Xuan Lu ◽  
Qiaozhu Mei ◽  
...  

Sentiment classification typically relies on a large amount of labeled data. In practice, the availability of labels is highly imbalanced among different languages. To tackle this problem, cross-lingual sentiment classification approaches aim to transfer knowledge learned from one language that has abundant labeled examples (i.e., the source language, usually English) to another language with fewer labels (i.e., the target language). The source and the target languages are usually bridged through off-the-shelf machine translation tools. Through such a channel, cross-language sentiment patterns can be successfully learned from English and transferred into the target languages. This approach, however, often fails to capture sentiment knowledge specific to the target language. In this paper, we employ emojis, which are widely available in many languages, as a new channel to learn both the cross-language and the language-specific sentiment patterns. We propose a novel representation learning method that uses emoji prediction as an instrument to learn respective sentiment-aware representations for each language. The learned representations are then integrated to facilitate cross-lingual sentiment classification.


Author(s):  
Namrata G Kharate ◽  
Varsha H Patil

Machine translation is important application in natural language processing. Machine translation means translation from source language to target language to save the meaning of the sentence. A large amount of research is going on in the area of machine translation. However, research with machine translation remains highly localized to the particular source and target languages as they differ syntactically and morphologically. Appropriate inflections result correct translation. This paper elaborates the rules for inflecting the parts-of-speech and implements the inflection for Marathi to English translation. The inflection of nouns, pronouns, verbs, adjectives are carried out on the basis of semantics of the sentence. The results are discussed with examples.


2020 ◽  
pp. 333-355
Author(s):  
Joanna Szerszunowicz ◽  

The aim of this paper is to discuss the usefulness and reliability of the onomasiological approach in the cross-linguistic analysis of fixed multiword expressions based on the example of Polish phrases coined according to the model: ADJECTIVENOM FEM SING + GŁOWA ‘HEAD’ and their English and Italian counterparts. The three corpora are constituted by expressions registered in general and phraseological dictionaries of the respective languages to ensure that the units belong to the canon of Polish, English and Italian phraseological stock. The analysis of units collected for the purpose of the study clearly shows that in order to determine the true picture of cross-linguistic equivalence, the study should be focused on semantics of analysed phrases. Furthermore, the formal aspectmay be of minor significance in some cases due to the similarity of imagery of a source language idiom and the target language lexical item. On the other hand, stylistic value may have a great impact on the relation of cross-linguistic correspondence of the analysed units.


2020 ◽  
Vol 34 (05) ◽  
pp. 8568-8575
Author(s):  
Xing Niu ◽  
Marine Carpuat

This work aims to produce translations that convey source language content at a formality level that is appropriate for a particular audience. Framing this problem as a neural sequence-to-sequence task ideally requires training triplets consisting of a bilingual sentence pair labeled with target language formality. However, in practice, available training examples are limited to English sentence pairs of different styles, and bilingual parallel sentences of unknown formality. We introduce a novel training scheme for multi-task models that automatically generates synthetic training triplets by inferring the missing element on the fly, thus enabling end-to-end training. Comprehensive automatic and human assessments show that our best model outperforms existing models by producing translations that better match desired formality levels while preserving the source meaning.1


2019 ◽  
Vol 28 (3) ◽  
pp. 447-453 ◽  
Author(s):  
Sainik Kumar Mahata ◽  
Dipankar Das ◽  
Sivaji Bandyopadhyay

Abstract Machine translation (MT) is the automatic translation of the source language to its target language by a computer system. In the current paper, we propose an approach of using recurrent neural networks (RNNs) over traditional statistical MT (SMT). We compare the performance of the phrase table of SMT to the performance of the proposed RNN and in turn improve the quality of the MT output. This work has been done as a part of the shared task problem provided by the MTIL2017. We have constructed the traditional MT model using Moses toolkit and have additionally enriched the language model using external data sets. Thereafter, we have ranked the phrase tables using an RNN encoder-decoder module created originally as a part of the GroundHog project of LISA lab.


2018 ◽  
Vol 6 (3) ◽  
pp. 79-92
Author(s):  
Sahar A. El-Rahman ◽  
Tarek A. El-Shishtawy ◽  
Raafat A. El-Kammar

This article presents a realistic technique for the machine aided translation system. In this technique, the system dictionary is partitioned into a multi-module structure for fast retrieval of Arabic features of English words. Each module is accessed through an interface that includes the necessary morphological rules, which directs the search toward the proper sub-dictionary. Another factor that aids fast retrieval of Arabic features of words is the prediction of the word category, and accesses its sub-dictionary to retrieve the corresponding attributes. The system consists of three main parts, which are the source language analysis, the transfer rules between source language (English) and target language (Arabic), and the generation of the target language. The proposed system is able to translate, some negative forms, demonstrations, and conjunctions, and also adjust nouns, verbs, and adjectives according their attributes. Then, it adds the symptom of Arabic words to generate a correct sentence.


2014 ◽  
Vol 102 (1) ◽  
pp. 93-104
Author(s):  
Ramasamy Loganathan ◽  
Mareček David ◽  
Žabokrtský Zdenčk

Abstract This paper revisits the projection-based approach to dependency grammar induction task. Traditional cross-lingual dependency induction tasks one way or the other, depend on the existence of bitexts or target language tools such as part-of-speech (POS) taggers to obtain reasonable parsing accuracy. In this paper, we transfer dependency parsers using only approximate resources, i.e., machine translated bitexts instead of manually created bitexts. We do this by obtaining the the source side of the text from a machine translation (MT) system and then apply transfer approaches to induce parser for the target languages. We further reduce the need for the availability of labeled target language resources by using unsupervised target tagger. We show that our approach consistently outperforms unsupervised parsers by a bigger margin (8.2% absolute), and results in similar performance when compared with delexicalized transfer parsers.


2020 ◽  
Vol 6 (1) ◽  
pp. 63
Author(s):  
Kadek Putri Yamayanti

This descriptive qualitative study investigates translation equivalent of Balinese cultural terms into English. It is based on the understanding that cultural terms belong to salient part in dealing with translation due to the cultural gap between source and target languages. Therefore, this study is conducted in order to find out the degree of equivalence between Balinese cultural terms and their translations into English in the book entitled Memahami Roh Bali �Desa Adat sebagai Ikon Tri Hita Karana� and its translation in Discovering the Spirit of Bali �Customary Village as Icon of Tri Hita Karana�. In finding the degree of equivalence, componential analysis especially the binary features was applied in terms of confirming the semantic features. The result showed that all translated cultural terms have no exact synonymy into source language. Some semantic features do not occur in target language as a result of lack terms in target language. The translator tends to replace cultural terms in source language into appropriate terms in target language based on his knowledge and experiences even in some cases, it shows the loss and gain information. However, overall, those translated cultural terms still can share some basic semantic features of the source language.Keywords: cultural term; semantic features; equivalence.


Sign in / Sign up

Export Citation Format

Share Document