Pre-Training on Mixed Data for Low-Resource Neural Machine Translation

Wenbo Zhang; Xiao Li; Yating Yang; Rui Dong

doi:10.3390/info12030133

Pre-Training on Mixed Data for Low-Resource Neural Machine Translation

Information ◽

10.3390/info12030133 ◽

2021 ◽

Vol 12 (3) ◽

pp. 133

Author(s):

Wenbo Zhang ◽

Xiao Li ◽

Yating Yang ◽

Rui Dong

Keyword(s):

Machine Translation ◽

Training Model ◽

Fine Tuning ◽

Mixed Data ◽

Training Method ◽

Neural Machine Translation ◽

Translation Model ◽

Training Models ◽

Low Resource ◽

Word Translation

The pre-training fine-tuning mode has been shown to be effective for low resource neural machine translation. In this mode, pre-training models trained on monolingual data are used to initiate translation models to transfer knowledge from monolingual data into translation models. In recent years, pre-training models usually take sentences with randomly masked words as input, and are trained by predicting these masked words based on unmasked words. In this paper, we propose a new pre-training method that still predicts masked words, but randomly replaces some of the unmasked words in the input with their translation words in another language. The translation words are from bilingual data, so that the data for pre-training contains both monolingual data and bilingual data. We conduct experiments on Uyghur-Chinese corpus to evaluate our method. The experimental results show that our method can make the pre-training model have a better generalization ability and help the translation model to achieve better performance. Through a word translation task, we also demonstrate that our method enables the embedding of the translation model to acquire more alignment knowledge.

Download Full-text

Low-Resource Neural Machine Translation Using XLNet Pre-training Model

10.1007/978-3-030-86383-8_40 ◽

2021 ◽

pp. 503-514

Author(s):

Nier Wu ◽

Hongxu Hou ◽

Ziyue Guo ◽

Wei Zheng

Keyword(s):

Machine Translation ◽

Training Model ◽

Neural Machine Translation ◽

Low Resource

Download Full-text

Exploiting Multilingualism through Multistage Fine-Tuning for Low-Resource Neural Machine Translation

10.18653/v1/d19-1146 ◽

2019 ◽

Cited By ~ 2

Author(s):

Raj Dabre ◽

Atsushi Fujita ◽

Chenhui Chu

Keyword(s):

Machine Translation ◽

Fine Tuning ◽

Neural Machine Translation ◽

Low Resource

Download Full-text

Tag-less Back-Translation

10.21203/rs.3.rs-465941/v1 ◽

2021 ◽

Author(s):

Idris Abdulmumin ◽

Bashir Shehu Galadanci ◽

Aliyu Garba

Keyword(s):

Machine Translation ◽

Domain Adaptation ◽

Fine Tuning ◽

Huge Amount ◽

Neural Machine Translation ◽

Translation Model ◽

Parallel Data ◽

Back Translation ◽

Authentic Data ◽

Target Side

Abstract An effective method to generate a large number of parallel sentences for training improved neural machine translation (NMT) systems is the use of the back-translations of the target-side monolingual data. The standard back-translation method has been shown to be unable to efficiently utilize the available huge amount of existing monolingual data because of the inability of translation models to differentiate between the authentic and synthetic parallel data during training. Tagging, or using gates, has been used to enable translation models to distinguish between synthetic and authentic data, improving standard back-translation and also enabling the use of iterative back-translation on language pairs that underperformed using standard back-translation. In this work, we approach back-translation as a domain adaptation problem, eliminating the need for explicit tagging. In the approach - tag-less back-translation - the synthetic and authentic parallel data are treated as out-of-domain and in-domain data respectively and, through pre-training and fine-tuning, the translation model is shown to be able to learn more efficiently from them during training. Experimental results have shown that the approach outperforms the standard and tagged back-translation approaches on low resource English-Vietnamese and English-German neural machine translation.

Download Full-text

Research on Uyghur-Chinese Neural Machine Translation Based on the Transformer at Multistrategy Segmentation Granularity

Mobile Information Systems ◽

10.1155/2021/5744248 ◽

2021 ◽

Vol 2021 ◽

pp. 1-7

Author(s):

Zhiwang Xu ◽

Huibin Qin ◽

Yongzhu Hua

Keyword(s):

Neural Networks ◽

Machine Translation ◽

Semantic Features ◽

Training Method ◽

Chinese Translation ◽

Neural Machine Translation ◽

Translation Model ◽

Parallel Corpus ◽

Model Based ◽

Translation Systems

In recent years, machine translation based on neural networks has become the mainstream method in the field of machine translation, but there are still challenges of insufficient parallel corpus and sparse data in the field of low resource translation. Existing machine translation models are usually trained on word-granularity segmentation datasets. However, different segmentation granularities contain different grammatical and semantic features and information. Only considering word granularity will restrict the efficient training of neural machine translation systems. Aiming at the problem of data sparseness caused by the lack of Uyghur-Chinese parallel corpus and complex Uyghur morphology, this paper proposes a multistrategy segmentation granular training method for syllables, marked syllable, words, and syllable word fusion and targets traditional recurrent neural networks and convolutional neural networks; the disadvantage of the network is to build a Transformer Uyghur-Chinese Neural Machine Translation model based entirely on the multihead self-attention mechanism. In CCMT2019, dimension results on Uyghur-Chinese bilingual datasets show that the effect of multiple translation granularity training method is significantly better than the rest of granularity segmentation translation systems, while the Transformer model can obtain higher BLEU value than Uyghur-Chinese translation model based on Self-Attention-RNN.

Download Full-text

Autoregressive Pre-training Model-Assisted Low-Resource Neural Machine Translation

10.1007/978-3-030-89363-7_4 ◽

2021 ◽

pp. 46-57

Author(s):

Nier Wu ◽

Hongxu Hou ◽

Yatu Ji ◽

Wei Zheng

Keyword(s):

Machine Translation ◽

Training Model ◽

Neural Machine Translation ◽

Low Resource

Download Full-text

Extremely Low-Resource Text Simplification with Pre-trained Transformer Language Model

International Journal of Asian Language Processing ◽

10.1142/s2717554520500010 ◽

2020 ◽

Vol 30 (01) ◽

pp. 2050001

Author(s):

Takumi Maruyama ◽

Kazuhide Yamamoto

Keyword(s):

Machine Translation ◽

Large Scale ◽

State Of The Art ◽

Language Model ◽

Fine Tuning ◽

Neural Machine Translation ◽

Low Resource ◽

Resource Setting ◽

Text Simplification ◽

Low Resource Setting

Inspired by machine translation task, recent text simplification approaches regard a task as a monolingual text-to-text generation, and neural machine translation models have significantly improved the performance of simplification tasks. Although such models require a large-scale parallel corpus, such corpora for text simplification are very few in number and smaller in size compared to machine translation task. Therefore, we have attempted to facilitate the training of simplification rewritings using pre-training from a large-scale monolingual corpus such as Wikipedia articles. In addition, we propose a translation language model to seamlessly conduct a fine-tuning of text simplification from the pre-training of the language model. The experimental results show that the translation language model substantially outperforms a state-of-the-art model under a low-resource setting. In addition, a pre-trained translation language model with only 3000 supervised examples can achieve a performance comparable to that of the state-of-the-art model using 30,000 supervised examples.

Download Full-text

Keeping Models Consistent between Pretraining and Translation for Low-Resource Neural Machine Translation

Future Internet ◽

10.3390/fi12120215 ◽

2020 ◽

Vol 12 (12) ◽

pp. 215

Author(s):

Wenbo Zhang ◽

Xiao Li ◽

Yating Yang ◽

Rui Dong ◽

Gongxu Luo

Keyword(s):

Machine Translation ◽

Language Model ◽

Neural Machine Translation ◽

Translation Model ◽

Parallel Corpus ◽

Model Experiments ◽

Low Resource ◽

Translation Quality ◽

Number Of Layers ◽

Cross Lingual

Recently, the pretraining of models has been successfully applied to unsupervised and semi-supervised neural machine translation. A cross-lingual language model uses a pretrained masked language model to initialize the encoder and decoder of the translation model, which greatly improves the translation quality. However, because of a mismatch in the number of layers, the pretrained model can only initialize part of the decoder’s parameters. In this paper, we use a layer-wise coordination transformer and a consistent pretraining translation transformer instead of a vanilla transformer as the translation model. The former has only an encoder, and the latter has an encoder and a decoder, but the encoder and decoder have exactly the same parameters. Both models can guarantee that all parameters in the translation model can be initialized by the pretrained model. Experiments on the Chinese–English and English–German datasets show that compared with the vanilla transformer baseline, our models achieve better performance with fewer parameters when the parallel corpus is small.

Download Full-text

Neural machine translation with a polysynthetic low resource language

Machine Translation ◽

10.1007/s10590-020-09255-9 ◽

2020 ◽

Vol 34 (4) ◽

pp. 325-346

Author(s):

John E. Ortega ◽

Richard Castro Mamani ◽

Kyunghyun Cho

Keyword(s):

Machine Translation ◽

Neural Machine Translation ◽

Low Resource

Download Full-text

Analyzing Subword Techniques to Improve English to Sinhala Neural Machine Translation

International Journal of Asian Language Processing ◽

10.1142/s2717554520500174 ◽

2021 ◽

pp. 2050017

Author(s):

Rashmini Naranpanawa ◽

Ravinga Perera ◽

Thilakshi Fonseka ◽

Uthayasanker Thayasivam

Keyword(s):

Machine Translation ◽

State Of The Art ◽

Statistical Machine Translation ◽

Translation System ◽

Rare Word ◽

Neural Machine Translation ◽

Parallel Corpus ◽

Low Resource ◽

Word Level ◽

Morphologically Rich Languages

Neural machine translation (NMT) is a remarkable approach which performs much better than the Statistical machine translation (SMT) models when there is an abundance of parallel corpus. However, vanilla NMT is primarily based upon word-level with a fixed vocabulary. Therefore, low resource morphologically rich languages such as Sinhala are mostly affected by the out of vocabulary (OOV) and Rare word problems. Recent advancements in subword techniques have opened up opportunities for low resource communities by enabling open vocabulary translation. In this paper, we extend our recently published state-of-the-art EN-SI translation system using the transformer and explore standard subword techniques on top of it to identify which subword approach has a greater effect on English Sinhala language pair. Our models demonstrate that subword segmentation strategies along with the state-of-the-art NMT can perform remarkably when translating English sentences into a rich morphology language regardless of a large parallel corpus.

Download Full-text

Improving thai-lao neural machine translation with similarity lexicon

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-212236 ◽

2021 ◽

pp. 1-10

Author(s):

Zhiqiang Yu ◽

Yuxin Huang ◽

Junjun Guo

Keyword(s):

Machine Translation ◽

Semantic Information ◽

Neural Machine Translation ◽

Low Resource ◽

Translation Quality ◽

Decoder Architecture ◽

Baseline System ◽

Input Sentence ◽

Resource Conditions ◽

Language Pair

It has been shown that the performance of neural machine translation (NMT) drops starkly in low-resource conditions. Thai-Lao is a typical low-resource language pair of tiny parallel corpus, leading to suboptimal NMT performance on it. However, Thai and Lao have considerable similarities in linguistic morphology and have bilingual lexicon which is relatively easy to obtain. To use this feature, we first build a bilingual similarity lexicon composed of pairs of similar words. Then we propose a novel NMT architecture to leverage the similarity between Thai and Lao. Specifically, besides the prevailing sentence encoder, we introduce an extra similarity lexicon encoder into the conventional encoder-decoder architecture, by which the semantic information carried by the similarity lexicon can be represented. We further provide a simple mechanism in the decoder to balance the information representations delivered from the input sentence and the similarity lexicon. Our approach can fully exploit linguistic similarity carried by the similarity lexicon to improve translation quality. Experimental results demonstrate that our approach achieves significant improvements over the state-of-the-art Transformer baseline system and previous similar works.

Download Full-text