scholarly journals Scheduled Multi-Task Learning: From Syntax to Translation

2018 ◽  
Vol 6 ◽  
pp. 225-240 ◽  
Author(s):  
Eliyahu Kiperwasser ◽  
Miguel Ballesteros

Neural encoder-decoder models of machine translation have achieved impressive results, while learning linguistic knowledge of both the source and target languages in an implicit end-to-end manner. We propose a framework in which our model begins learning syntax and translation interleaved, gradually putting more focus on translation. Using this approach, we achieve considerable improvements in terms of BLEU score on relatively large parallel corpus (WMT14 English to German) and a low-resource (WIT German to English) setup.

Author(s):  
Rashmini Naranpanawa ◽  
Ravinga Perera ◽  
Thilakshi Fonseka ◽  
Uthayasanker Thayasivam

Neural machine translation (NMT) is a remarkable approach which performs much better than the Statistical machine translation (SMT) models when there is an abundance of parallel corpus. However, vanilla NMT is primarily based upon word-level with a fixed vocabulary. Therefore, low resource morphologically rich languages such as Sinhala are mostly affected by the out of vocabulary (OOV) and Rare word problems. Recent advancements in subword techniques have opened up opportunities for low resource communities by enabling open vocabulary translation. In this paper, we extend our recently published state-of-the-art EN-SI translation system using the transformer and explore standard subword techniques on top of it to identify which subword approach has a greater effect on English Sinhala language pair. Our models demonstrate that subword segmentation strategies along with the state-of-the-art NMT can perform remarkably when translating English sentences into a rich morphology language regardless of a large parallel corpus.


2016 ◽  
Vol 22 (4) ◽  
pp. 517-548 ◽  
Author(s):  
ANN IRVINE ◽  
CHRIS CALLISON-BURCH

AbstractWe use bilingual lexicon induction techniques, which learn translations from monolingual texts in two languages, to build an end-to-end statistical machine translation (SMT) system without the use of any bilingual sentence-aligned parallel corpora. We present detailed analysis of the accuracy of bilingual lexicon induction, and show how a discriminative model can be used to combine various signals of translation equivalence (like contextual similarity, temporal similarity, orthographic similarity and topic similarity). Our discriminative model produces higher accuracy translations than previous bilingual lexicon induction techniques. We reuse these signals of translation equivalence as features on a phrase-based SMT system. These monolingually estimated features enhance low resource SMT systems in addition to allowing end-to-end machine translation without parallel corpora.


2021 ◽  
Vol 2021 ◽  
pp. 1-10
Author(s):  
Michael Adjeisah ◽  
Guohua Liu ◽  
Douglas Omwenga Nyabuga ◽  
Richard Nuetey Nortey ◽  
Jinling Song

Scaling natural language processing (NLP) to low-resourced languages to improve machine translation (MT) performance remains enigmatic. This research contributes to the domain on a low-resource English-Twi translation based on filtered synthetic-parallel corpora. It is often perplexing to learn and understand what a good-quality corpus looks like in low-resource conditions, mainly where the target corpus is the only sample text of the parallel language. To improve the MT performance in such low-resource language pairs, we propose to expand the training data by injecting synthetic-parallel corpus obtained by translating a monolingual corpus from the target language based on bootstrapping with different parameter settings. Furthermore, we performed unsupervised measurements on each sentence pair engaging squared Mahalanobis distances, a filtering technique that predicts sentence parallelism. Additionally, we extensively use three different sentence-level similarity metrics after round-trip translation. Experimental results on a diverse amount of available parallel corpus demonstrate that injecting pseudoparallel corpus and extensive filtering with sentence-level similarity metrics significantly improves the original out-of-the-box MT systems for low-resource language pairs. Compared with existing improvements on the same original framework under the same structure, our approach exhibits tremendous developments in BLEU and TER scores.


2021 ◽  
Vol 2021 ◽  
pp. 1-10
Author(s):  
Yanbo Zhang

Under the current artificial intelligence boom, machine translation is a research direction of natural language processing, which has important scientific research value and practical value. In practical applications, the variability of language, the limited capability of representing semantic information, and the scarcity of parallel corpus resources all constrain machine translation towards practicality and popularization. In this paper, we conduct deep mining of source language text data to express complex, high-level, and abstract semantic information using an appropriate text data representation model; then, for machine translation tasks with a large amount of parallel corpus, I use the capability of annotated datasets to build a more effective migration learning-based end-to-end neural network machine translation model on a supervised algorithm; then, for machine translation tasks with parallel corpus data resource-poor language machine translation tasks, migration learning techniques are used to prevent the overfitting problem of neural networks during training and to improve the generalization ability of end-to-end neural network machine translation models under low-resource conditions. Finally, for language translation tasks where the parallel corpus is extremely scarce but monolingual corpus is sufficient, the research focuses on unsupervised machine translation techniques, which will be a future research trend.


2020 ◽  
pp. 1-22
Author(s):  
Sukanta Sen ◽  
Mohammed Hasanuzzaman ◽  
Asif Ekbal ◽  
Pushpak Bhattacharyya ◽  
Andy Way

Abstract Neural machine translation (NMT) has recently shown promising results on publicly available benchmark datasets and is being rapidly adopted in various production systems. However, it requires high-quality large-scale parallel corpus, and it is not always possible to have sufficiently large corpus as it requires time, money, and professionals. Hence, many existing large-scale parallel corpus are limited to the specific languages and domains. In this paper, we propose an effective approach to improve an NMT system in low-resource scenario without using any additional data. Our approach aims at augmenting the original training data by means of parallel phrases extracted from the original training data itself using a statistical machine translation (SMT) system. Our proposed approach is based on the gated recurrent unit (GRU) and transformer networks. We choose the Hindi–English, Hindi–Bengali datasets for Health, Tourism, and Judicial (only for Hindi–English) domains. We train our NMT models for 10 translation directions, each using only 5–23k parallel sentences. Experiments show the improvements in the range of 1.38–15.36 BiLingual Evaluation Understudy points over the baseline systems. Experiments show that transformer models perform better than GRU models in low-resource scenarios. In addition to that, we also find that our proposed method outperforms SMT—which is known to work better than the neural models in low-resource scenarios—for some translation directions. In order to further show the effectiveness of our proposed model, we also employ our approach to another interesting NMT task, for example, old-to-modern English translation, using a tiny parallel corpus of only 2.7K sentences. For this task, we use publicly available old-modern English text which is approximately 1000 years old. Evaluation for this task shows significant improvement over the baseline NMT.


2020 ◽  
pp. 1-11
Author(s):  
Lin Lin ◽  
Jie Liu ◽  
Xuebing Zhang ◽  
Xiufang Liang

Due to the complexity of English machine translation technology and its broad application prospects, many experts and scholars have invested more energy to analyze it. In view of the complex and changeable English forms, the large difference between Chinese and English word order, and insufficient Chinese-English parallel corpus resources, this paper uses deep learning to complete the conversion between Chinese and English. The research focus of this paper is how to use language pairs with rich parallel corpus resources to improve the performance of Chinese-English neural machine translation, that is, to use multi-task learning to train neural machine translation models. Moreover, this research proposes a low-resource neural machine translation method based on weight sharing, which uses the weight-sharing method to improve the performance of Chinese-English low-resource neural machine translation. In addition, this study designs a control experiment to analyze the effectiveness of this study model. The research results show that the model proposed in this paper has a certain effect.


Author(s):  
Rui Wang ◽  
Xu Tan ◽  
Renqian Luo ◽  
Tao Qin ◽  
Tie-Yan Liu

Neural approaches have achieved state-of-the-art accuracy on machine translation but suffer from the high cost of collecting large scale parallel data. Thus, a lot of research has been conducted for neural machine translation (NMT) with very limited parallel data, i.e., the low-resource setting. In this paper, we provide a survey for low-resource NMT and classify related works into three categories according to the auxiliary data they used: (1) exploiting monolingual data of source and/or target languages, (2) exploiting data from auxiliary languages, and (3) exploiting multi-modal data. We hope that our survey can help researchers to better understand this field and inspire them to design better algorithms, and help industry practitioners to choose appropriate algorithms for their applications.


2021 ◽  
Vol 11 (22) ◽  
pp. 10860
Author(s):  
Mengtao Sun ◽  
Hao Wang ◽  
Mark Pasquine ◽  
Ibrahim A. Hameed

Existing Sequence-to-Sequence (Seq2Seq) Neural Machine Translation (NMT) shows strong capability with High-Resource Languages (HRLs). However, this approach poses serious challenges when processing Low-Resource Languages (LRLs), because the model expression is limited by the training scale of parallel sentence pairs. This study utilizes adversary and transfer learning techniques to mitigate the lack of sentence pairs in LRL corpora. We propose a new Low resource, Adversarial, Cross-lingual (LAC) model for NMT. In terms of the adversary technique, LAC model consists of a generator and discriminator. The generator is a Seq2Seq model that produces the translations from source to target languages, while the discriminator measures the gap between machine and human translations. In addition, we introduce transfer learning on LAC model to help capture the features in rare resources because some languages share the same subject-verb-object grammatical structure. Rather than using the entire pretrained LAC model, we separately utilize the pretrained generator and discriminator. The pretrained discriminator exhibited better performance in all experiments. Experimental results demonstrate that the LAC model achieves higher Bilingual Evaluation Understudy (BLEU) scores and has good potential to augment LRL translations.


Sign in / Sign up

Export Citation Format

Share Document