Integrating a Discriminative Classifier into Phrase-based and Hierarchical Decoding

Aleš Tamchyna; Fabienne Braune; Alexander Fraser; Marine Carpuat; Hal Daumé iii; Chris Quirk

doi:10.2478/pralin-2014-0002

Integrating a Discriminative Classifier into Phrase-based and Hierarchical Decoding

Prague Bulletin of Mathematical Linguistics ◽

10.2478/pralin-2014-0002 ◽

2014 ◽

Vol 101 (1) ◽

pp. 29-41

Author(s):

Aleš Tamchyna ◽

Fabienne Braune ◽

Alexander Fraser ◽

Marine Carpuat ◽

Hal Daumé iii ◽

...

Keyword(s):

Open Source ◽

Machine Translation ◽

State Of The Art ◽

Statistical Machine Translation ◽

Sentence Context ◽

Discriminative Models ◽

Current State ◽

Source Sentence ◽

Independence Assumptions

Abstract Current state-of-the-art statistical machine translation (SMT) relies on simple feature functions which make independence assumptions at the level of phrases or hierarchical rules. However, it is well-known that discriminative models can benefit from rich features extracted from the source sentence context outside of the applied phrase or hierarchical rule, which is available at decoding time. We present a framework for the open-source decoder Moses that allows discriminative models over source context to easily be trained on a large number of examples and then be included as feature functions in decoding.

Download Full-text

Otedama: Fast Rule-Based Pre-Ordering for Machine Translation

Prague Bulletin of Mathematical Linguistics ◽

10.1515/pralin-2016-0015 ◽

2016 ◽

Vol 106 (1) ◽

pp. 159-168 ◽

Cited By ~ 1

Author(s):

Julian Hitschler ◽

Laura Jehl ◽

Sariya Karimova ◽

Mayumi Ohta ◽

Benjamin Körner ◽

...

Keyword(s):

Open Source ◽

Machine Translation ◽

State Of The Art ◽

Statistical Machine Translation ◽

Training Data ◽

Translation System ◽

Rule Based ◽

Machine Translation System ◽

Target Languages ◽

Established Technique

Abstract We present Otedama, a fast, open-source tool for rule-based syntactic pre-ordering, a well established technique in statistical machine translation. Otedama implements both a learner for pre-ordering rules, as well as a component for applying these rules to parsed sentences. Our system is compatible with several external parsers and capable of accommodating many source and all target languages in any machine translation paradigm which uses parallel training data. We demonstrate improvements on a patent translation task over a state-of-the-art English-Japanese hierarchical phrase-based machine translation system. We compare Otedama with an existing syntax-based pre-ordering system, showing comparable translation performance at a runtime speedup of a factor of 4.5-10.

Download Full-text

Hierarchical Phrase-Based Translation with Jane 2

Prague Bulletin of Mathematical Linguistics ◽

10.2478/v10108-012-0007-8 ◽

2012 ◽

Vol 98 (1) ◽

pp. 37-50

Author(s):

Matthias Huck ◽

Jan-Thorsten Peter ◽

Markus Freitag ◽

Stephan Peitz ◽

Hermann Ney

Keyword(s):

Open Source ◽

Machine Translation ◽

State Of The Art ◽

Statistical Machine Translation ◽

Experimental Results ◽

Insertion And Deletion

Hierarchical Phrase-Based Translation with Jane 2 In this paper, we give a survey of several recent extensions to hierarchical phrase-based machine translation that have been implemented in version 2 of Jane, RWTH's open source statistical machine translation toolkit. We focus on the following techniques: Insertion and deletion models, lexical scoring variants, reordering extensions with non-lexicalized reordering rules and with a discriminative lexicalized reordering model, and soft string-to-dependency hierarchical machine translation. We describe the fundamentals of each of these techniques and present experimental results obtained with Jane 2 to confirm their usefulness in state-of-the-art hierarchical phrase-based translation (HPBT).

Download Full-text

Translational equivalence in Statistical Machine Translation or meaning as co-occurrence

Linguistica Antverpiensia, New Series – Themes in Translation Studies ◽

10.52034/lanstts.v7i.215 ◽

2021 ◽

Vol 7 ◽

Author(s):

Lieve Macken ◽

Els Lefever

Keyword(s):

Machine Translation ◽

State Of The Art ◽

Word Sense Disambiguation ◽

Statistical Machine Translation ◽

General Purpose ◽

Point Of View ◽

Word Alignment ◽

Word Sense ◽

Parallel Corpora ◽

Current State

In this paper, we will describe the current state-of-the-art of Statistical Machine Translation (SMT), and reflect on how SMT handles meaning. Statistical Machine Translation is a corpus-based approach to MT: it de-rives the required knowledge to generate new translations from corpora. General-purpose SMT systems do not use any formal semantic representa-tion. Instead, they directly extract translationally equivalent words or word sequences – expressions with the same meaning – from bilingual parallel corpora. All statistical translation models are based on the idea of word alignment, i.e., the automatic linking of corresponding words in parallel texts. The first generation SMT systems were word-based. From a linguistic point of view, the major problem with word-based systems is that the mean-ing of a word is often ambiguous, and is determined by its context. Current state-of-the-art SMT-systems try to capture the local contextual dependen-cies by using phrases instead of words as units of translation. In order to solve more complex ambiguity problems (where a broader text scope or even domain information is needed), a Word Sense Disambiguation (WSD) module is integrated in the Machine Translation environment.

Download Full-text

Analyzing Subword Techniques to Improve English to Sinhala Neural Machine Translation

International Journal of Asian Language Processing ◽

10.1142/s2717554520500174 ◽

2021 ◽

pp. 2050017

Author(s):

Rashmini Naranpanawa ◽

Ravinga Perera ◽

Thilakshi Fonseka ◽

Uthayasanker Thayasivam

Keyword(s):

Machine Translation ◽

State Of The Art ◽

Statistical Machine Translation ◽

Translation System ◽

Rare Word ◽

Neural Machine Translation ◽

Parallel Corpus ◽

Low Resource ◽

Word Level ◽

Morphologically Rich Languages

Neural machine translation (NMT) is a remarkable approach which performs much better than the Statistical machine translation (SMT) models when there is an abundance of parallel corpus. However, vanilla NMT is primarily based upon word-level with a fixed vocabulary. Therefore, low resource morphologically rich languages such as Sinhala are mostly affected by the out of vocabulary (OOV) and Rare word problems. Recent advancements in subword techniques have opened up opportunities for low resource communities by enabling open vocabulary translation. In this paper, we extend our recently published state-of-the-art EN-SI translation system using the transformer and explore standard subword techniques on top of it to identify which subword approach has a greater effect on English Sinhala language pair. Our models demonstrate that subword segmentation strategies along with the state-of-the-art NMT can perform remarkably when translating English sentences into a rich morphology language regardless of a large parallel corpus.

Download Full-text

An Open-Source Web-Based Tool for Resource-Agnostic Interactive Translation Prediction

Prague Bulletin of Mathematical Linguistics ◽

10.2478/pralin-2014-0015 ◽

2014 ◽

Vol 102 (1) ◽

pp. 69-80 ◽

Cited By ~ 2

Author(s):

Torregrosa Daniel ◽

Forcada Mikel L. ◽

Pérez-Ortiz Juan Antonio

Keyword(s):

Open Source ◽

Machine Translation ◽

Web Application ◽

Statistical Machine Translation ◽

Black Box ◽

Translation System ◽

Web Tool ◽

Web Based ◽

Strongly Coupled ◽

Machine Translation System

Abstract We present a web-based open-source tool for interactive translation prediction (ITP) and describe its underlying architecture. ITP systems assist human translators by making context-based computer-generated suggestions as they type. Most of the ITP systems in literature are strongly coupled with a statistical machine translation system that is conveniently adapted to provide the suggestions. Our system, however, follows a resource-agnostic approach and suggestions are obtained from any unmodified black-box bilingual resource. This paper reviews our ITP method and describes the architecture of Forecat, a web tool, partly based on the recent technology of web components, that eases the use of our ITP approach in any web application requiring this kind of translation assistance. We also evaluate the performance of our method when using an unmodified Moses-based statistical machine translation system as the bilingual resource.

Download Full-text

Multi-engine machine translation with an open-source decoder for statistical machine translation

10.3115/1626355.1626381 ◽

2007 ◽

Cited By ~ 2

Author(s):

Yu Chen ◽

Andreas Eisele ◽

Christian Federmann ◽

Eva Hasler ◽

Michael Jellinghaus ◽

...

Keyword(s):

Open Source ◽

Machine Translation ◽

Statistical Machine Translation

Download Full-text

Integration of a Multilingual Preordering Component into a Commercial SMT Platform

Prague Bulletin of Mathematical Linguistics ◽

10.1515/pralin-2017-0009 ◽

2017 ◽

Vol 108 (1) ◽

pp. 61-72

Author(s):

Anita Ramm ◽

Riccardo Superbo ◽

Dimitar Shterionov ◽

Tony O’Dowd ◽

Alexander Fraser

Keyword(s):

Open Source ◽

Machine Translation ◽

Long Range ◽

Significant Role ◽

Processing Speed ◽

Statistical Machine Translation ◽

Neural Machine Translation ◽

Open Source Tool

AbstractWe present a multilingual preordering component tailored for a commercial Statistical Machine translation platform. In commercial settings, issues such as processing speed as well as the ability to adapt models to the customers’ needs play a significant role and have a big impact on the choice of approaches that are added to the custom pipeline to deal with specific problems such as long-range reorderings.We developed a fast and customisable preordering component, also available as an open-source tool, which comes along with a generic implementation that is restricted neither to the translation platform nor to the Machine Translation paradigm. We test preordering on three language pairs: English →Japanese/German/Chinese for both Statistical Machine Translation (SMT) and Neural Machine Translation (NMT). Our experiments confirm previously reported improvements in the SMT output when the models are trained on preordered data, but they also show that preordering does not improve NMT.

Download Full-text

Translation of Medical Texts using Neural Networks

International Journal of Reliable and Quality E-Healthcare ◽

10.4018/ijrqeh.2016100104 ◽

2016 ◽

Vol 5 (4) ◽

pp. 51-66 ◽

Cited By ~ 5

Author(s):

Krzysztof Wolk ◽

Krzysztof P. Marasek

Keyword(s):

Machine Translation ◽

Statistical Machine Translation ◽

European Medicines Agency ◽

Translation System ◽

Training Methods ◽

Neural Machine Translation ◽

Machine Translation System ◽

Source Sentence ◽

Parallel Text ◽

Translation Systems

The quality of machine translation is rapidly evolving. Today one can find several machine translation systems on the web that provide reasonable translations, although the systems are not perfect. In some specific domains, the quality may decrease. A recently proposed approach to this domain is neural machine translation. It aims at building a jointly-tuned single neural network that maximizes translation performance, a very different approach from traditional statistical machine translation. Recently proposed neural machine translation models often belong to the encoder-decoder family in which a source sentence is encoded into a fixed length vector that is, in turn, decoded to generate a translation. The present research examines the effects of different training methods on a Polish-English Machine Translation system used for medical data. The European Medicines Agency parallel text corpus was used as the basis for training of neural and statistical network-based translation systems. A comparison and implementation of a medical translator is the main focus of our experiments.

Download Full-text

Named entity recognition for Polish

Poznan Studies in Contemporary Linguistics ◽

10.1515/psicl-2019-0010 ◽

2019 ◽

Vol 55 (2) ◽

pp. 239-269

Author(s):

Michał Marcińczuk ◽

Aleksander Wawer

Keyword(s):

Open Source ◽

State Of The Art ◽

Proper Names ◽

Named Entity Recognition ◽

Entity Recognition ◽

Coarse Grained ◽

Named Entity ◽

Current State ◽

Annotated Corpora ◽

Available Resources

Abstract In this article we discuss the current state-of-the-art for named entity recognition for Polish. We present publicly available resources and open-source tools for named entity recognition. The overview includes various kind of resources, i.e. guidelines, annotated corpora (NKJP, KPWr, CEN, PST) and lexicons (NELexiconS, PNET, Gazetteer). We present the major NER tools for Polish (Sprout, NERF, Liner2, Parallel LSTM-CRFs and PolDeepNer) and discuss their performance on the reference datasets. In the article we cover identification of named entity mentions in the running text, local and global entity categorization, fine- and coarse-grained categorization and lemmatization of proper names.

Download Full-text

Joshua 6: A phrase-based and hierarchical statistical machine translation system

Prague Bulletin of Mathematical Linguistics ◽

10.1515/pralin-2015-0009 ◽

2015 ◽

Vol 104 (1) ◽

pp. 5-16 ◽

Cited By ~ 1

Author(s):

Matt Post ◽

Yuan Cao ◽

Gaurav Kumar

Keyword(s):

Open Source ◽

Machine Translation ◽

Large Scale ◽

Statistical Machine Translation ◽

End Users ◽

Translation System ◽

Tight Coupling ◽

Single Function ◽

Black Boxes ◽

Machine Translation System

Abstract We describe the version six release of Joshua, an open-source statistical machine translation toolkit. The main difference from release five is the introduction of a simple, unlexicalized, phrase-based stack decoder. This phrase-based decoder shares a hypergraph format with the syntax-based systems, permitting a tight coupling with the existing codebase of feature functions and hypergraph tools. Joshua 6 also includes a number of large-scale discriminative tuners and a simplified sparse feature function interface with reflection-based loading, which allows new features to be used by writing a single function. Finally, Joshua includes a number of simplifications and improvements focused on usability for both researchers and end-users, including the release of language packs — precompiled models that can be run as black boxes.

Download Full-text