Inferring Shallow-Transfer Machine Translation Rules from Small Parallel Corpora

Journal of Artificial Intelligence Research ◽

10.1613/jair.2735 ◽

2009 ◽

Vol 34 ◽

pp. 605-635 ◽

Cited By ~ 11

Author(s):

F. Sánchez-Martínez ◽

M. L. Forcada

Keyword(s):

Open Source ◽

Machine Translation ◽

Parallel Corpora ◽

Bilingual Dictionary ◽

Translation Quality ◽

Statistical Mt ◽

Transfer Rules ◽

Word Translation ◽

And Control ◽

Free Open Source

This paper describes a method for the automatic inference of structural transfer rules to be used in a shallow-transfer machine translation (MT) system from small parallel corpora. The structural transfer rules are based on alignment templates, like those used in statistical MT. Alignment templates are extracted from sentence-aligned parallel corpora and extended with a set of restrictions which are derived from the bilingual dictionary of the MT system and control their application as transfer rules. The experiments conducted using three different language pairs in the free/open-source MT platform Apertium show that translation quality is improved as compared to word-for-word translation (when no transfer rules are used), and that the resulting translation quality is close to that obtained using hand-coded transfer rules. The method we present is entirely unsupervised and benefits from information in the rest of modules of the MT system in which the inferred rules are applied.

Download Full-text

RuLearn: an Open-source Toolkit for the Automatic Inference of Shallow-transfer Rules for Machine Translation

Prague Bulletin of Mathematical Linguistics ◽

10.1515/pralin-2016-0018 ◽

2016 ◽

Vol 106 (1) ◽

pp. 193-204

Author(s):

Víctor M. Sánchez-Cartagena ◽

Juan Antonio Pérez-Ortiz ◽

Felipe Sánchez-Martínez

Keyword(s):

Open Source ◽

Machine Translation ◽

Statistical Machine Translation ◽

Rule Based ◽

Parallel Corpora ◽

Linguistic Resources ◽

Data Sparseness ◽

Translation Quality ◽

Transfer Rules ◽

Automatic Inference

Abstract This paper presents ruLearn, an open-source toolkit for the automatic inference of rules for shallow-transfer machine translation from scarce parallel corpora and morphological dictionaries. ruLearn will make rule-based machine translation a very appealing alternative for under-resourced language pairs because it avoids the need for human experts to handcraft transfer rules and requires, in contrast to statistical machine translation, a small amount of parallel corpora (a few hundred parallel sentences proved to be sufficient). The inference algorithm implemented by ruLearn has been recently published by the same authors in Computer Speech & Language (volume 32). It is able to produce rules whose translation quality is similar to that obtained by using hand-crafted rules. ruLearn generates rules that are ready for their use in the Apertium platform, although they can be easily adapted to other platforms. When the rules produced by ruLearn are used together with a hybridisation strategy for integrating linguistic resources from shallow-transfer rule-based machine translation into phrase-based statistical machine translation (published by the same authors in Journal of Artificial Intelligence Research, volume 55), they help to mitigate data sparseness. This paper also shows how to use ruLearn and describes its implementation.

Download Full-text

Recurrent Stacking of Layers for Compact Neural Machine Translation Models

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33016292 ◽

2019 ◽

Vol 33 ◽

pp. 6292-6299 ◽

Cited By ~ 2

Author(s):

Raj Dabre ◽

Atsushi Fujita

Keyword(s):

Machine Translation ◽

Single Layer ◽

Training Data ◽

Neural Machine Translation ◽

Parallel Corpora ◽

Translation Quality ◽

Sequence Generation ◽

Sequence Modeling ◽

Back Translation

In encoder-decoder based sequence-to-sequence modeling, the most common practice is to stack a number of recurrent, convolutional, or feed-forward layers in the encoder and decoder. While the addition of each new layer improves the sequence generation quality, this also leads to a significant increase in the number of parameters. In this paper, we propose to share parameters across all layers thereby leading to a recurrently stacked sequence-to-sequence model. We report on an extensive case study on neural machine translation (NMT) using our proposed method, experimenting with a variety of datasets. We empirically show that the translation quality of a model that recurrently stacks a single-layer 6 times, despite its significantly fewer parameters, approaches that of a model that stacks 6 different layers. We also show how our method can benefit from a prevalent way for improving NMT, i.e., extending training data with pseudo-parallel corpora generated by back-translation. We then analyze the effects of recurrently stacked layers by visualizing the attentions of models that use recurrently stacked layers and models that do not. Finally, we explore the limits of parameter sharing where we share even the parameters between the encoder and decoder in addition to recurrent stacking of layers.

Download Full-text

Recent advances in Apertium, a free/open-source rule-based machine translation platform for low-resource languages

Machine Translation ◽

10.1007/s10590-021-09260-6 ◽

2021 ◽

Author(s):

Tanmai Khanna ◽

Jonathan N. Washington ◽

Francis M. Tyers ◽

Sevilay Bayatlı ◽

Daniel G. Swanson ◽

...

Keyword(s):

Open Source ◽

Machine Translation ◽

Lexical Selection ◽

Rule Based ◽

Low Resource ◽

Language Technology ◽

Language Data ◽

Recursive Structures ◽

Platform Translation ◽

Free Open Source

AbstractThis paper presents an overview of Apertium, a free and open-source rule-based machine translation platform. Translation in Apertium happens through a pipeline of modular tools, and the platform continues to be improved as more language pairs are added. Several advances have been implemented since the last publication, including some new optional modules: a module that allows rules to process recursive structures at the structural transfer stage, a module that deals with contiguous and discontiguous multi-word expressions, and a module that resolves anaphora to aid translation. Also highlighted is the hybridisation of Apertium through statistical modules that augment the pipeline, and statistical methods that augment existing modules. This includes morphological disambiguation, weighted structural transfer, and lexical selection modules that learn from limited data. The paper also discusses how a platform like Apertium can be a critical part of access to language technology for so-called low-resource languages, which might be ignored or deemed unapproachable by popular corpus-based translation technologies. Finally, the paper presents some of the released and unreleased language pairs, concluding with a brief look at some supplementary Apertium tools that prove valuable to users as well as language developers. All Apertium-related code, including language data, is free/open-source and available at https://github.com/apertium.

Download Full-text

Apertium: a free/open-source platform for rule-based machine translation

Machine Translation ◽

10.1007/s10590-011-9090-0 ◽

2011 ◽

Vol 25 (2) ◽

pp. 127-144 ◽

Cited By ~ 53

Author(s):

Mikel L. Forcada ◽

Mireia Ginestí-Rosell ◽

Jacob Nordfalk ◽

Jim O’Regan ◽

Sergio Ortiz-Rojas ◽

...

Keyword(s):

Open Source ◽

Machine Translation ◽

Rule Based ◽

Free Open Source

Download Full-text

Corpus Augmentation for Neural Machine Translation with Chinese-Japanese Parallel Corpora

Applied Sciences ◽

10.3390/app9102036 ◽

2019 ◽

Vol 9 (10) ◽

pp. 2036

Author(s):

Jinyi Zhang ◽

Tadahiro Matsumoto

Keyword(s):

Machine Translation ◽

Scientific Paper ◽

Training Data ◽

Word Alignment ◽

Sentence Pair ◽

Neural Machine Translation ◽

Parallel Corpora ◽

Translation Quality ◽

Parallel Data ◽

Source Sentence

The translation quality of Neural Machine Translation (NMT) systems depends strongly on the training data size. Sufficient amounts of parallel data are, however, not available for many language pairs. This paper presents a corpus augmentation method, which has two variations: one is for all language pairs, and the other is for the Chinese-Japanese language pair. The method uses both source and target sentences of the existing parallel corpus and generates multiple pseudo-parallel sentence pairs from a long parallel sentence pair containing punctuation marks as follows: (1) split the sentence pair into parallel partial sentences; (2) back-translate the target partial sentences; and (3) replace each partial sentence in the source sentence with the back-translated target partial sentence to generate pseudo-source sentences. The word alignment information, which is used to determine the split points, is modified with “shared Chinese character rates” in segments of the sentence pairs. The experiment results of the Japanese-Chinese and Chinese-Japanese translation with ASPEC-JC (Asian Scientific Paper Excerpt Corpus, Japanese-Chinese) show that the method substantially improves translation performance. We also supply the code (see Supplementary Materials) that can reproduce our proposed method.

Download Full-text

Bridging the “gApp”: improving neural machine translation systems for multiword expression detection

Yearbook of Phraseology ◽

10.1515/phras-2020-0005 ◽

2020 ◽

Vol 11 (1) ◽

pp. 61-80

Author(s):

Carlos Manuel Hidalgo-Ternero ◽

Gloria Corpas Pastor

Keyword(s):

Open Source ◽

Machine Translation ◽

Automatic Identification ◽

Neural Machine Translation ◽

Multiword Expressions ◽

Continuous Form ◽

Text Preprocessing ◽

Translation Systems ◽

Free Open Source

AbstractThe present research introduces the tool gApp, a Python-based text preprocessing system for the automatic identification and conversion of discontinuous multiword expressions (MWEs) into their continuous form in order to enhance neural machine translation (NMT). To this end, an experiment with semi-fixed verb–noun idiomatic combinations (VNICs) will be carried out in order to evaluate to what extent gApp can optimise the performance of the two main free open-source NMT systems —Google Translate and DeepL— under the challenge of MWE discontinuity in the Spanish into English directionality. In the light of our promising results, the study concludes with suggestions on how to further optimise MWE-aware NMT systems.

Download Full-text

Sulis: An Open Source Transfer Decoder for Deep Syntactic Statistical Machine Translation

Prague Bulletin of Mathematical Linguistics ◽

10.2478/v10108-010-0005-7 ◽

2010 ◽

Vol 93 (1) ◽

pp. 17-26 ◽

Cited By ~ 1

Author(s):

Yvette Graham

Keyword(s):

Open Source ◽

Linear Combination ◽

Machine Translation ◽

Statistical Machine Translation ◽

Language Model ◽

Beam Search ◽

Translation Model ◽

Transfer Rules ◽

Log Linear

Sulis: An Open Source Transfer Decoder for Deep Syntactic Statistical Machine Translation In this paper, we describe an open source transfer decoder for Deep Syntactic Transfer-Based Statistical Machine Translation. Transfer decoding involves the application of transfer rules to a SL structure. The N-best TL structures are found via a beam search of TL hypothesis structures which are ranked via a log-linear combination of feature scores, such as translation model and dependency-based language model.

Download Full-text

AN EVALUATION OF THE TRANSLATION OF THE FILM “RIO” BASED ON NEWMARK’S MODEL

VNU Journal of Foreign Studies ◽

10.25073/2525-2445/vnufs.4247 ◽

2018 ◽

Vol 34 (2) ◽

Author(s):

Tran Thi Ngan

Keyword(s):

Open Source ◽

Observational Methods ◽

Research Instrument ◽

Translation Quality ◽

Proper Nouns ◽

Research Findings ◽

Cross Platform ◽

Free Open Source

The study evaluated the translation quality of the Vietnamese version of the film “Rio”, which was translated and dubbed in the project of MegaStar Media Ltd. Company, Vietnam in April 2011. To reach its aim, the study used four methods including analysis and comparison, which were based on Newmark’s model. In addition, statistical and observational methods were also applied to examine the synchronisation of each utterance and its translated version. The research instrument was “Aegisub”, a free open-source cross-platform subtitle editing program designed for timing and styling of subtitles. The researcher’s purpose was to see how well utterances in both versions of the flm are synchronised with each other. The research findings showed that in general, the film was well-translated in terms of structures, proper nouns, hierarchical pronouns, borrowed words and puns. Weaknesses of the translation were found in the title and several mistranslations. The study also revealed that the translated utterances were synchronised with the original ones quite well, especially in terms of duration.

Download Full-text

Free/open-source machine translation: preface

Machine Translation ◽

10.1007/s10590-011-9113-x ◽

2011 ◽

Vol 25 (2) ◽

pp. 83-86 ◽

Cited By ~ 1

Author(s):

Felipe Sánchez-Martínez ◽

Mikel L. Forcada

Keyword(s):

Open Source ◽

Machine Translation ◽

Free Open Source

Download Full-text

Margin Infused Relaxed Algorithm for Moses

Prague Bulletin of Mathematical Linguistics ◽

10.2478/v10108-011-0012-3 ◽

2011 ◽

Vol 96 (1) ◽

pp. 69-78 ◽

Cited By ~ 5

Author(s):

Eva Hasler ◽

Barry Haddow ◽

Philipp Koehn

Keyword(s):

Open Source ◽

Machine Translation ◽

Error Rate ◽

Statistical Machine Translation ◽

Experimental Results ◽

Minimum Error ◽

Feature Sets ◽

Translation Quality ◽

Core Feature ◽

Minimum Error Rate Training

Margin Infused Relaxed Algorithm for Moses We describe an open-source implementation of the Margin Infused Relaxed Algorithm (MIRA) for statistical machine translation (SMT). The implementation is part of the Moses toolkit and can be used as an alternative to standard minimum error rate training (MERT). A description of the implementation and its usage on core feature sets as well as large, sparse feature sets is given and we report experimental results comparing the performance of MIRA with MERT in terms of translation quality and stability.

Download Full-text