A Systematic Comparison of Data Selection Criteria for SMT Domain Adaptation

The Scientific World JOURNAL ◽

10.1155/2014/745485 ◽

2014 ◽

Vol 2014 ◽

pp. 1-10 ◽

Cited By ~ 5

Author(s):

Longyue Wang ◽

Derek F. Wong ◽

Lidia S. Chao ◽

Yi Lu ◽

Junwen Xing

Keyword(s):

Domain Adaptation ◽

Statistical Machine Translation ◽

Real Life ◽

Training Data ◽

Data Selection ◽

Domain Specific ◽

Combination Methods ◽

Depth Analysis ◽

Similarity Measuring ◽

The Individual

Data selection has shown significant improvements in effective use of training data by extracting sentences from large general-domain corpora to adapt statistical machine translation (SMT) systems to in-domain data. This paper performs an in-depth analysis of three different sentence selection techniques. The first one is cosine tf-idf, which comes from the realm of information retrieval (IR). The second is perplexity-based approach, which can be found in the field of language modeling. These two data selection techniques applied to SMT have been already presented in the literature. However, edit distance for this task is proposed in this paper for the first time. After investigating the individual model, a combination of all three techniques is proposed at both corpus level and model level. Comparative experiments are conducted on Hong Kong law Chinese-English corpus and the results indicate the following: (i) the constraint degree of similarity measuring is not monotonically related to domain-specific translation quality; (ii) the individual selection models fail to perform effectively and robustly; but (iii) bilingual resources and combination methods are helpful to balance out-of-vocabulary (OOV) and irrelevant data; (iv) finally, our method achieves the goal to consistently boost the overall translation performance that can ensure optimal quality of a real-life SMT system.

Download Full-text

Adapting to All Domains at Once: Rewarding Domain Invariance in SMT

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00086 ◽

2016 ◽

Vol 4 ◽

pp. 99-112 ◽

Cited By ~ 5

Author(s):

Hoang Cuong ◽

Khalil Sima’an ◽

Ivan Titov

Keyword(s):

Machine Translation ◽

Domain Adaptation ◽

Statistical Machine Translation ◽

Small Sample ◽

Training Data ◽

User Needs ◽

Target Domain ◽

Training Time ◽

Feature Weights ◽

Domain Specific

Existing work on domain adaptation for statistical machine translation has consistently assumed access to a small sample from the test distribution (target domain) at training time. In practice, however, the target domain may not be known at training time or it may change to match user needs. In such situations, it is natural to push the system to make safer choices, giving higher preference to domain-invariant translations, which work well across domains, over risky domain-specific alternatives. We encode this intuition by (1) inducing latent subdomains from the training data only; (2) introducing features which measure how specialized phrases are to individual induced sub-domains; (3) estimating feature weights on out-of-domain data (rather than on the target domain). We conduct experiments on three language pairs and a number of different domains. We observe consistent improvements over a baseline which does not explicitly reward domain invariance.

Download Full-text

Neural Networks Classifier for Data Selection in Statistical Machine Translation

Prague Bulletin of Mathematical Linguistics ◽

10.1515/pralin-2017-0027 ◽

2017 ◽

Vol 108 (1) ◽

pp. 283-294 ◽

Cited By ~ 1

Author(s):

Álvaro Peris ◽

Mara Chinea-Ríos ◽

Francisco Casacuberta

Keyword(s):

Neural Networks ◽

Machine Translation ◽

Domain Adaptation ◽

Statistical Machine Translation ◽

Data Selection ◽

Target Domain ◽

Translation Quality ◽

Bilingual Corpora ◽

Proper Estimation ◽

Adaptation Field

AbstractCorpora are precious resources, as they allow for a proper estimation of statistical machine translation models. Data selection is a variant of the domain adaptation field, aimed to extract those sentences from an out-of-domain corpus that are the most useful to translate a different target domain. We address the data selection problem in statistical machine translation as a classification task. We present a new method, based on neural networks, able to deal with monolingual and bilingual corpora. Empirical results show that our data selection method provides slightly better translation quality, compared to a state-of-the-art method (cross-entropy), requiring substantially less data. Moreover, the results obtained are coherent across different language pairs, demonstrating the robustness of our proposal.

Download Full-text

Unsupervised Adversarial Domain Adaptation with Error-Correcting Boundaries and Feature Adaption Metric for Remote-Sensing Scene Classification

Remote Sensing ◽

10.3390/rs13071270 ◽

2021 ◽

Vol 13 (7) ◽

pp. 1270

Author(s):

Chenhui Ma ◽

Dexuan Sha ◽

Xiaodong Mu

Keyword(s):

Remote Sensing ◽

Domain Adaptation ◽

Training Data ◽

Scene Classification ◽

Target Domain ◽

Domain Specific ◽

Invariant Features ◽

Distribution Matching ◽

Feature Adaptation ◽

Public Datasets

Unsupervised domain adaptation (UDA) based on adversarial learning for remote-sensing scene classification has become a research hotspot because of the need to alleviating the lack of annotated training data. Existing methods train classifiers according to their ability to distinguish features from source or target domains. However, they suffer from the following two limitations: (1) the classifier is trained on source samples and forms a source-domain-specific boundary, which ignores features from the target domain and (2) semantically meaningful features are merely built from the adversary of a generator and a discriminator, which ignore selecting the domain invariant features. These issues limit the distribution matching performance of source and target domains, since each domain has its distinctive characteristic. To resolve these issues, we propose a framework with error-correcting boundaries and feature adaptation metric. Specifically, we design an error-correcting boundaries mechanism to build target-domain-specific classifier boundaries via multi-classifiers and error-correcting discrepancy loss, which significantly distinguish target samples and reduce their distinguished uncertainty. Then, we employ a feature adaptation metric structure to enhance the adaptation of ambiguous features via shallow layers of the backbone convolutional neural network and alignment loss, which automatically learns domain invariant features. The experimental results on four public datasets outperform other UDA methods of remote-sensing scene classification.

Download Full-text

Improved feature decay algorithms for statistical machine translation

Natural Language Engineering ◽

10.1017/s1351324920000467 ◽

2020 ◽

pp. 1-21

Author(s):

Alberto Poncelas ◽

Gideon Maillette de Buy Wenniger ◽

Andy Way

Keyword(s):

Statistical Machine Translation ◽

Training Data ◽

Data Selection ◽

Training Set ◽

Excellent Performance ◽

Test Set ◽

Translation Quality ◽

Novel Approach ◽

Parallel Data ◽

Machine Learning Applications

Abstract In machine-learning applications, data selection is of crucial importance if good runtime performance is to be achieved. In a scenario where the test set is accessible when the model is being built, training instances can be selected so they are the most relevant for the test set. Feature Decay Algorithms (FDA) are a technique for data selection that has demonstrated excellent performance in a number of tasks. This method maximizes the diversity of the n-grams in the training set by devaluing those ones that have already been included. We focus on this method to undertake deeper research on how to select better training data instances. We give an overview of FDA and propose improvements in terms of speed and quality. Using German-to-English parallel data, first we create a novel approach that decreases the execution time of FDA when multiple computation units are available. In addition, we obtain improvements on translation quality by extending FDA using information from the parallel corpus that is generally ignored.

Download Full-text

Machine Translation System for the Industry Domain and Croatian Language

Journal of information and organizational sciences ◽

10.31341/jios.44.1.2 ◽

2020 ◽

Vol 44 (1) ◽

pp. 33-50

Author(s):

Ivan Dunđer

Keyword(s):

Machine Translation ◽

Computational Linguistics ◽

Technology Development ◽

Domain Adaptation ◽

Statistical Machine Translation ◽

Translation System ◽

Domain Specific ◽

Extensive Evaluation ◽

Machine Translation System ◽

Translation Systems

Machine translation is increasingly becoming a hot research topic in information and communication sciences, computer science and computational linguistics, due to the fact that it enables communication and transferring of meaning across different languages. As the Croatian language can be considered low-resourced in terms of available services and technology, development of new domain-specific machine translation systems is important, especially due to raised interest and needs of industry, academia and everyday users. Machine translation is not perfect, but it is crucial to assure acceptable quality, which is purpose-dependent. In this research, different statistical machine translation systems were built – but one system utilized domain adaptation in particular, with the intention of boosting the output of machine translation. Afterwards, extensive evaluation has been performed – in form of applying several automatic quality metrics and human evaluation with focus on various aspects. Evaluation is done in order to assess the quality of specific machine-translated text.

Download Full-text

Reinforced Training Data Selection for Domain Adaptation

10.18653/v1/p19-1189 ◽

2019 ◽

Author(s):

Miaofeng Liu ◽

Yan Song ◽

Hongbin Zou ◽

Tong Zhang

Keyword(s):

Domain Adaptation ◽

Training Data ◽

Data Selection ◽

Selection For ◽

Training Data Selection

Download Full-text

Quality estimation-guided supplementary data selection for domain adaptation of statistical machine translation

Machine Translation ◽

10.1007/s10590-014-9165-9 ◽

2014 ◽

Vol 29 (2) ◽

pp. 77-100 ◽

Cited By ~ 1

Author(s):

Pratyush Banerjee ◽

Raphael Rubino ◽

Johann Roturier ◽

Josef van Genabith

Keyword(s):

Machine Translation ◽

Domain Adaptation ◽

Statistical Machine Translation ◽

Data Selection ◽

Supplementary Data ◽

Quality Estimation ◽

Selection For

Download Full-text

MetaMT, a Meta Learning Method Leveraging Multiple Domain Data for Low Resource Machine Translation

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6339 ◽

2020 ◽

Vol 34 (05) ◽

pp. 8245-8252

Author(s):

Rumeng Li ◽

Xun Wang ◽

Hong Yu

Keyword(s):

Machine Translation ◽

Domain Adaptation ◽

Training Data ◽

Model Parameters ◽

Parallel Corpora ◽

Training Strategy ◽

Low Resource ◽

Domain Specific ◽

Representational Space ◽

Meta Learning

Neural machine translation (NMT) models have achieved state-of-the-art translation quality with a large quantity of parallel corpora available. However, their performance suffers significantly when it comes to domain-specific translations, in which training data are usually scarce. In this paper, we present a novel NMT model with a new word embedding transition technique for fast domain adaption. We propose to split parameters in the model into two groups: model parameters and meta parameters. The former are used to model the translation while the latter are used to adjust the representational space to generalize the model to different domains. We mimic the domain adaptation of the machine translation model to low-resource domains using multiple translation tasks on different domains. A new training strategy based on meta-learning is developed along with the proposed model to update the model parameters and meta parameters alternately. Experiments on datasets of different domains showed substantial improvements of NMT performances on a limited amount of data.

Download Full-text

Lexical-Constraint-Aware Neural Machine Translation via Data Augmentation

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/496 ◽

2020 ◽

Author(s):

Guanhua Chen ◽

Yun Chen ◽

Yong Wang ◽

Victor O.K. Li

Keyword(s):

Machine Translation ◽

Data Augmentation ◽

Search Algorithm ◽

Statistical Machine Translation ◽

Training Data ◽

Neural Machine Translation ◽

Training Corpus ◽

Domain Specific ◽

Source Sentence

Leveraging lexical constraint is extremely significant in domain-specific machine translation and interactive machine translation. Previous studies mainly focus on extending beam search algorithm or augmenting the training corpus by replacing source phrases with the corresponding target translation. These methods either suffer from the heavy computation cost during inference or depend on the quality of the bilingual dictionary pre-specified by user or constructed with statistical machine translation. In response to these problems, we present a conceptually simple and empirically effective data augmentation approach in lexical constrained neural machine translation. Specifically, we make constraint-aware training data by first randomly sampling the phrases of the reference as constraints, and then packing them together into the source sentence with a separation symbol. Extensive experiments on several language pairs demonstrate that our approach achieves superior translation results over the existing systems, improving translation of constrained sentences without hurting the unconstrained ones.

Download Full-text

General-purpose statistical translation engine and domain specific texts

Terminology ◽

10.1075/term.10.1.07lan ◽

2004 ◽

Vol 10 (1) ◽

pp. 131-153 ◽

Cited By ~ 3

Author(s):

Philippe Langlais ◽

Michael Carl

Keyword(s):

Error Rate ◽

Statistical Machine Translation ◽

Real Life ◽

General Purpose ◽

Word Error Rate ◽

Accurate Evaluation ◽

Domain Specific ◽

The Past ◽

The One ◽

Open Question

The past decade has witnessed exciting work in the field of Statistical Machine Translation (SMT). However, accurate evaluation of its potential in real-life contexts is still an open question. In this study, we investigate the behavior of an SMT engine faced with a corpus far different from the one it has been trained on. We show that terminological databases are obvious resources that should be used to boost the performance of a statistical engine. We propose and evaluate one way of integrating terminology into a SMT engine which yields a significant reduction in word error rate.

Download Full-text