Preordering using a Target-Language Parser via Cross-Language Syntactic Projection for Statistical Machine Translation

Isao Goto; Masao Utiyama; Eiichiro Sumita; Sadao Kurohashi

doi:10.1145/2699925

An Experimental Platform for Cross-Language Document Retrieval

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.284-287.3325 ◽

2013 ◽

Vol 284-287 ◽

pp. 3325-3329

Author(s):

Long Yue Wang ◽

Derek F. Wong ◽

Lidia S. Chao

Keyword(s):

Machine Translation ◽

Statistical Machine Translation ◽

Document Retrieval ◽

Training Data ◽

Target Language ◽

Source Language ◽

Experimental Platform ◽

Precision Evaluation ◽

Query Generation ◽

Cross Language

This paper presents a proposed Cross-Language Document Retrieval experimental platform integrated with preprocessing of training data, document translation, query generation, document retrieval and precision evaluation modules. Given a certain document in source language, it will be translated into target language by statistical machine translation module which is trained by selected training data. The query generation module then selects the most relevant words in the translated version of the document as searching query. After all the documents in the target language are ranked by the document retrieval module, the system will choose the N-best documents as its target language versions. Finally, the results can be evaluated by precision evaluator, which can reflect the merits of the strategies. Experimental results showed that this platform was effective and achieved very good performance.

Download Full-text

Analysis Accuracy of Similar Word Based Clustering (EWSB) Algorithm on Machine Translator Bahasa Indonesia-Minang

Kinetik Game Technology Information System Computer Network Computing Electronics and Control ◽

10.22219/kinetik.v3i3.241 ◽

2018 ◽

Vol 3 (3) ◽

Author(s):

Herry Sujaini

Keyword(s):

Machine Translation ◽

Clustering Algorithm ◽

Statistical Machine Translation ◽

Target Language ◽

Word Similarity ◽

Similar Word ◽

Word Clustering ◽

Translation Accuracy ◽

Bahasa Indonesia

Extended Word Similarity Based (EWSB) Clustering is a word clustering algorithm based on the value of words similarity obtained from the computation of a corpus. One of the benefits of clustering with this algorithm is to improve the translation of a statistical machine translation. Previous research proved that EWSB algorithm could improve the Indonesian-English translator, where the algorithm was applied to Indonesian language as target language.This paper discusses the results of a research using EWSB algorithm on a Indonesian to Minang statistical machine translator, where the algorithm is applied to Minang language as the target language. The research obtained resulted that the EWSB algorithm is quite effective when used in Minang language as the target language. The results of this study indicate that EWSB algorithm can improve the translation accuracy by 6.36%.

Download Full-text

Optimizing Tokenization Choice for Machine Translation across Multiple Target Languages

Prague Bulletin of Mathematical Linguistics ◽

10.1515/pralin-2017-0025 ◽

2017 ◽

Vol 108 (1) ◽

pp. 257-269 ◽

Cited By ~ 4

Author(s):

Nasser Zalmout ◽

Nizar Habash

Keyword(s):

Machine Translation ◽

Performance Enhancement ◽

Statistical Machine Translation ◽

Target Language ◽

Source Language ◽

Context Variable ◽

Significant Performance ◽

Morphologically Rich Languages ◽

Target Languages ◽

Language Text

AbstractTokenization is very helpful for Statistical Machine Translation (SMT), especially when translating from morphologically rich languages. Typically, a single tokenization scheme is applied to the entire source-language text and regardless of the target language. In this paper, we evaluate the hypothesis that SMT performance may benefit from different tokenization schemes for different words within the same text, and also for different target languages. We apply this approach to Arabic as a source language, with five target languages of varying morphological complexity: English, French, Spanish, Russian and Chinese. Our results show that different target languages indeed require different source-language schemes; and a context-variable tokenization scheme can outperform a context-constant scheme with a statistically significant performance enhancement of about 1.4 BLEU points.

Download Full-text

MTIL2017: Machine Translation Using Recurrent Neural Network on Statistical Machine Translation

Journal of Intelligent Systems ◽

10.1515/jisys-2018-0016 ◽

2019 ◽

Vol 28 (3) ◽

pp. 447-453 ◽

Cited By ~ 5

Author(s):

Sainik Kumar Mahata ◽

Dipankar Das ◽

Sivaji Bandyopadhyay

Keyword(s):

Machine Translation ◽

Statistical Machine Translation ◽

Language Model ◽

Target Language ◽

Data Sets ◽

Shared Task ◽

Automatic Translation ◽

External Data ◽

Statistical Mt

Abstract Machine translation (MT) is the automatic translation of the source language to its target language by a computer system. In the current paper, we propose an approach of using recurrent neural networks (RNNs) over traditional statistical MT (SMT). We compare the performance of the phrase table of SMT to the performance of the proposed RNN and in turn improve the quality of the MT output. This work has been done as a part of the shared task problem provided by the MTIL2017. We have constructed the traditional MT model using Moses toolkit and have additionally enriched the language model using external data sets. Thereafter, we have ranked the phrase tables using an RNN encoder-decoder module created originally as a part of the GroundHog project of LISA lab.

Download Full-text

Predicting and Using a Pragmatic Component of Lexical Aspect

Linguistic Issues in Language Technology ◽

10.33011/lilt.v13i.1389 ◽

2016 ◽

Vol 13 ◽

Author(s):

Sharid Loáiciga ◽

Cristina Grisot

Keyword(s):

Machine Translation ◽

Statistical Machine Translation ◽

Target Language ◽

Translation System ◽

Automatic Annotation ◽

Parallel Corpora ◽

Linguistic Data ◽

Lexical Aspect ◽

Machine Translation System ◽

Automatic Systems

This paper proposes a method for improving the results of a statistical Machine Translation system using boundedness, a pragmatic component of the verbal phrase’s lexical aspect. First, the paper presents manual and automatic annotation experiments for lexical aspect in English-French parallel corpora. It will be shown that this aspectual property is identified and classified with ease both by humans and by automatic systems. Second, Statistical Machine Translation experiments using the boundedness annotations are presented. These experiments show that the information regarding lexical aspect is useful to improve the output of a Machine Translation system in terms of better choices of verbal tenses in the target language, as well as better lexical choices. Ultimately, this work aims at providing a method for the automatic annotation of data with boundedness information and at contributing to Machine Translation by taking into account linguistic data.

Download Full-text

Topic-Based Dissimilarity and Sensitivity Models for Translation Rule Selection

Journal of Artificial Intelligence Research ◽

10.1613/jair.4265 ◽

2014 ◽

Vol 50 ◽

pp. 1-30 ◽

Cited By ~ 3

Author(s):

M. Zhang ◽

X. Xiao ◽

D. Xiong ◽

Q. Liu

Keyword(s):

Machine Translation ◽

Topic Model ◽

Statistical Machine Translation ◽

Model Space ◽

Target Language ◽

Translation Quality ◽

Rule Selection ◽

Translation Rule ◽

Selection Experiments ◽

Target Side

Translation rule selection is a task of selecting appropriate translation rules for an ambiguous source-language segment. As translation ambiguities are pervasive in statistical machine translation, we introduce two topic-based models for translation rule selection which incorporates global topic information into translation disambiguation. We associate each synchronous translation rule with source- and target-side topic distributions.With these topic distributions, we propose a topic dissimilarity model to select desirable (less dissimilar) rules by imposing penalties for rules with a large value of dissimilarity of their topic distributions to those of given documents. In order to encourage the use of non-topic specific translation rules, we also present a topic sensitivity model to balance translation rule selection between generic rules and topic-specific rules. Furthermore, we project target-side topic distributions onto the source-side topic model space so that we can benefit from topic information of both the source and target language. We integrate the proposed topic dissimilarity and sensitivity model into hierarchical phrase-based machine translation for synchronous translation rule selection. Experiments show that our topic-based translation rule selection model can substantially improve translation quality.

Download Full-text

Language Models for Machine Translation: Original vs. Translated Texts

Computational Linguistics ◽

10.1162/coli_a_00111 ◽

2012 ◽

Vol 38 (4) ◽

pp. 799-825 ◽

Cited By ~ 14

Author(s):

Gennadi Lembersky ◽

Noam Ordan ◽

Shuly Wintner

Keyword(s):

Machine Translation ◽

Statistical Machine Translation ◽

Translation Studies ◽

Language Models ◽

Target Language ◽

Reference Set ◽

Original Target

We investigate the differences between language models compiled from original target-language texts and those compiled from texts manually translated to the target language. Corroborating established observations of Translation Studies, we demonstrate that the latter are significantly better predictors of translated sentences than the former, and hence fit the reference set better. Furthermore, translated texts yield better language models for statistical machine translation than original texts.

Download Full-text

The Alignment Template Approach to Statistical Machine Translation

Computational Linguistics ◽

10.1162/0891201042544884 ◽

2004 ◽

Vol 30 (4) ◽

pp. 417-449 ◽

Cited By ~ 212

Author(s):

Franz Josef Och ◽

Hermann Ney

Keyword(s):

Machine Translation ◽

Word Order ◽

Search Algorithm ◽

Statistical Machine Translation ◽

Target Language ◽

Linear Modeling ◽

Translation Model ◽

Log Linear ◽

Translation Systems ◽

English Canadian

A phrase-based statistical machine translation approach — the alignment template approach — is described. This translation approach allows for general many-to-many relations between words. Thereby, the context of words is taken into account in the translation model, and local changes in word order from source to target language can be learned explicitly. The model is described using a log-linear modeling approach, which is a generalization of the often used source-channel approach. Thereby, the model is easier to extend than classical statistical machine translation systems. We describe in detail the process for learning phrasal translations, the feature functions used, and the search algorithm. The evaluation of this approach is performed on three different tasks. For the German-English speech Verbmobil task, we analyze the effect of various system components. On the French-English Canadian Hansards task, the alignment template system obtains significantly better results than a single-word-based translation model. In the Chinese-English 2002 National Institute of Standards and Technology (NIST) machine translation evaluation it yields statistically significantly better NIST scores than all competing research and commercial translation systems.

Download Full-text

Exploring cross-language statistical machine translation for closely related South Slavic languages

10.3115/v1/w14-4210 ◽

2014 ◽

Cited By ~ 1

Author(s):

Maja Popović ◽

Nikola Ljubešić

Keyword(s):

Machine Translation ◽

Statistical Machine Translation ◽

Slavic Languages ◽

Cross Language

Download Full-text

Improving Statistical Machine Translation by Adapting Translation Models to Translationese

Computational Linguistics ◽

10.1162/coli_a_00159 ◽

2013 ◽

Vol 39 (4) ◽

pp. 999-1023 ◽

Cited By ~ 5

Author(s):

Gennadi Lembersky ◽

Noam Ordan ◽

Shuly Wintner

Keyword(s):

Machine Translation ◽

Statistical Machine Translation ◽

Target Language ◽

Common Assumption ◽

Parallel Corpora ◽

The Common ◽

The Right ◽

Translation Systems ◽

Parallel Texts

Translation models used for statistical machine translation are compiled from parallel corpora that are manually translated. The common assumption is that parallel texts are symmetrical: The direction of translation is deemed irrelevant and is consequently ignored. Much research in Translation Studies indicates that the direction of translation matters, however, as translated language (translationese) has many unique properties. It has already been shown that phrase tables constructed from parallel corpora translated in the same direction as the translation task outperform those constructed from corpora translated in the opposite direction. We reconfirm that this is indeed the case, but emphasize the importance of also using texts translated in the “wrong” direction. We take advantage of information pertaining to the direction of translation in constructing phrase tables by adapting the translation model to the special properties of translationese. We explore two adaptation techniques: First, we create a mixture model by interpolating phrase tables trained on texts translated in the “right” and the “wrong” directions. The weights for the interpolation are determined by minimizing perplexity. Second, we define entropy-based measures that estimate the correspondence of target-language phrases to translationese, thereby eliminating the need to annotate the parallel corpus with information pertaining to the direction of translation. We show that incorporating these measures as features in the phrase tables of statistical machine translation systems results in consistent, statistically significant improvement in the quality of the translation.

Download Full-text