Adapting to All Domains at Once: Rewarding Domain Invariance in SMT

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00086 ◽

2016 ◽

Vol 4 ◽

pp. 99-112 ◽

Cited By ~ 5

Author(s):

Hoang Cuong ◽

Khalil Sima’an ◽

Ivan Titov

Keyword(s):

Machine Translation ◽

Domain Adaptation ◽

Statistical Machine Translation ◽

Small Sample ◽

Training Data ◽

User Needs ◽

Target Domain ◽

Training Time ◽

Feature Weights ◽

Domain Specific

Existing work on domain adaptation for statistical machine translation has consistently assumed access to a small sample from the test distribution (target domain) at training time. In practice, however, the target domain may not be known at training time or it may change to match user needs. In such situations, it is natural to push the system to make safer choices, giving higher preference to domain-invariant translations, which work well across domains, over risky domain-specific alternatives. We encode this intuition by (1) inducing latent subdomains from the training data only; (2) introducing features which measure how specialized phrases are to individual induced sub-domains; (3) estimating feature weights on out-of-domain data (rather than on the target domain). We conduct experiments on three language pairs and a number of different domains. We observe consistent improvements over a baseline which does not explicitly reward domain invariance.

Download Full-text

Neural Networks Classifier for Data Selection in Statistical Machine Translation

Prague Bulletin of Mathematical Linguistics ◽

10.1515/pralin-2017-0027 ◽

2017 ◽

Vol 108 (1) ◽

pp. 283-294 ◽

Cited By ~ 1

Author(s):

Álvaro Peris ◽

Mara Chinea-Ríos ◽

Francisco Casacuberta

Keyword(s):

Neural Networks ◽

Machine Translation ◽

Domain Adaptation ◽

Statistical Machine Translation ◽

Data Selection ◽

Target Domain ◽

Translation Quality ◽

Bilingual Corpora ◽

Proper Estimation ◽

Adaptation Field

AbstractCorpora are precious resources, as they allow for a proper estimation of statistical machine translation models. Data selection is a variant of the domain adaptation field, aimed to extract those sentences from an out-of-domain corpus that are the most useful to translate a different target domain. We address the data selection problem in statistical machine translation as a classification task. We present a new method, based on neural networks, able to deal with monolingual and bilingual corpora. Empirical results show that our data selection method provides slightly better translation quality, compared to a state-of-the-art method (cross-entropy), requiring substantially less data. Moreover, the results obtained are coherent across different language pairs, demonstrating the robustness of our proposal.

Download Full-text

Unsupervised Adversarial Domain Adaptation with Error-Correcting Boundaries and Feature Adaption Metric for Remote-Sensing Scene Classification

Remote Sensing ◽

10.3390/rs13071270 ◽

2021 ◽

Vol 13 (7) ◽

pp. 1270

Author(s):

Chenhui Ma ◽

Dexuan Sha ◽

Xiaodong Mu

Keyword(s):

Remote Sensing ◽

Domain Adaptation ◽

Training Data ◽

Scene Classification ◽

Target Domain ◽

Domain Specific ◽

Invariant Features ◽

Distribution Matching ◽

Feature Adaptation ◽

Public Datasets

Unsupervised domain adaptation (UDA) based on adversarial learning for remote-sensing scene classification has become a research hotspot because of the need to alleviating the lack of annotated training data. Existing methods train classifiers according to their ability to distinguish features from source or target domains. However, they suffer from the following two limitations: (1) the classifier is trained on source samples and forms a source-domain-specific boundary, which ignores features from the target domain and (2) semantically meaningful features are merely built from the adversary of a generator and a discriminator, which ignore selecting the domain invariant features. These issues limit the distribution matching performance of source and target domains, since each domain has its distinctive characteristic. To resolve these issues, we propose a framework with error-correcting boundaries and feature adaptation metric. Specifically, we design an error-correcting boundaries mechanism to build target-domain-specific classifier boundaries via multi-classifiers and error-correcting discrepancy loss, which significantly distinguish target samples and reduce their distinguished uncertainty. Then, we employ a feature adaptation metric structure to enhance the adaptation of ambiguous features via shallow layers of the backbone convolutional neural network and alignment loss, which automatically learns domain invariant features. The experimental results on four public datasets outperform other UDA methods of remote-sensing scene classification.

Download Full-text

A Systematic Comparison of Data Selection Criteria for SMT Domain Adaptation

The Scientific World JOURNAL ◽

10.1155/2014/745485 ◽

2014 ◽

Vol 2014 ◽

pp. 1-10 ◽

Cited By ~ 5

Author(s):

Longyue Wang ◽

Derek F. Wong ◽

Lidia S. Chao ◽

Yi Lu ◽

Junwen Xing

Keyword(s):

Domain Adaptation ◽

Statistical Machine Translation ◽

Real Life ◽

Training Data ◽

Data Selection ◽

Domain Specific ◽

Combination Methods ◽

Depth Analysis ◽

Similarity Measuring ◽

The Individual

Data selection has shown significant improvements in effective use of training data by extracting sentences from large general-domain corpora to adapt statistical machine translation (SMT) systems to in-domain data. This paper performs an in-depth analysis of three different sentence selection techniques. The first one is cosine tf-idf, which comes from the realm of information retrieval (IR). The second is perplexity-based approach, which can be found in the field of language modeling. These two data selection techniques applied to SMT have been already presented in the literature. However, edit distance for this task is proposed in this paper for the first time. After investigating the individual model, a combination of all three techniques is proposed at both corpus level and model level. Comparative experiments are conducted on Hong Kong law Chinese-English corpus and the results indicate the following: (i) the constraint degree of similarity measuring is not monotonically related to domain-specific translation quality; (ii) the individual selection models fail to perform effectively and robustly; but (iii) bilingual resources and combination methods are helpful to balance out-of-vocabulary (OOV) and irrelevant data; (iv) finally, our method achieves the goal to consistently boost the overall translation performance that can ensure optimal quality of a real-life SMT system.

Download Full-text

Machine Translation System for the Industry Domain and Croatian Language

Journal of information and organizational sciences ◽

10.31341/jios.44.1.2 ◽

2020 ◽

Vol 44 (1) ◽

pp. 33-50

Author(s):

Ivan Dunđer

Keyword(s):

Machine Translation ◽

Computational Linguistics ◽

Technology Development ◽

Domain Adaptation ◽

Statistical Machine Translation ◽

Translation System ◽

Domain Specific ◽

Extensive Evaluation ◽

Machine Translation System ◽

Translation Systems

Machine translation is increasingly becoming a hot research topic in information and communication sciences, computer science and computational linguistics, due to the fact that it enables communication and transferring of meaning across different languages. As the Croatian language can be considered low-resourced in terms of available services and technology, development of new domain-specific machine translation systems is important, especially due to raised interest and needs of industry, academia and everyday users. Machine translation is not perfect, but it is crucial to assure acceptable quality, which is purpose-dependent. In this research, different statistical machine translation systems were built – but one system utilized domain adaptation in particular, with the intention of boosting the output of machine translation. Afterwards, extensive evaluation has been performed – in form of applying several automatic quality metrics and human evaluation with focus on various aspects. Evaluation is done in order to assess the quality of specific machine-translated text.

Download Full-text

MetaMT, a Meta Learning Method Leveraging Multiple Domain Data for Low Resource Machine Translation

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6339 ◽

2020 ◽

Vol 34 (05) ◽

pp. 8245-8252

Author(s):

Rumeng Li ◽

Xun Wang ◽

Hong Yu

Keyword(s):

Machine Translation ◽

Domain Adaptation ◽

Training Data ◽

Model Parameters ◽

Parallel Corpora ◽

Training Strategy ◽

Low Resource ◽

Domain Specific ◽

Representational Space ◽

Meta Learning

Neural machine translation (NMT) models have achieved state-of-the-art translation quality with a large quantity of parallel corpora available. However, their performance suffers significantly when it comes to domain-specific translations, in which training data are usually scarce. In this paper, we present a novel NMT model with a new word embedding transition technique for fast domain adaption. We propose to split parameters in the model into two groups: model parameters and meta parameters. The former are used to model the translation while the latter are used to adjust the representational space to generalize the model to different domains. We mimic the domain adaptation of the machine translation model to low-resource domains using multiple translation tasks on different domains. A new training strategy based on meta-learning is developed along with the proposed model to update the model parameters and meta parameters alternately. Experiments on datasets of different domains showed substantial improvements of NMT performances on a limited amount of data.

Download Full-text

Lexical-Constraint-Aware Neural Machine Translation via Data Augmentation

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/496 ◽

2020 ◽

Author(s):

Guanhua Chen ◽

Yun Chen ◽

Yong Wang ◽

Victor O.K. Li

Keyword(s):

Machine Translation ◽

Data Augmentation ◽

Search Algorithm ◽

Statistical Machine Translation ◽

Training Data ◽

Neural Machine Translation ◽

Training Corpus ◽

Domain Specific ◽

Source Sentence

Leveraging lexical constraint is extremely significant in domain-specific machine translation and interactive machine translation. Previous studies mainly focus on extending beam search algorithm or augmenting the training corpus by replacing source phrases with the corresponding target translation. These methods either suffer from the heavy computation cost during inference or depend on the quality of the bilingual dictionary pre-specified by user or constructed with statistical machine translation. In response to these problems, we present a conceptually simple and empirically effective data augmentation approach in lexical constrained neural machine translation. Specifically, we make constraint-aware training data by first randomly sampling the phrases of the reference as constraints, and then packing them together into the source sentence with a separation symbol. Extensive experiments on several language pairs demonstrate that our approach achieves superior translation results over the existing systems, improving translation of constrained sentences without hurting the unconstrained ones.

Download Full-text

A survey of domain adaptation for statistical machine translation

Machine Translation ◽

10.1007/s10590-018-9216-8 ◽

2017 ◽

Vol 31 (4) ◽

pp. 187-224

Author(s):

Hoang Cuong ◽

Khalil Sima’an

Keyword(s):

Machine Translation ◽

Domain Adaptation ◽

Statistical Machine Translation

Download Full-text

An intelligent fault diagnosis method based on domain adaptation for rolling bearings under variable load conditions

Proceedings of the Institution of Mechanical Engineers Part C Journal of Mechanical Engineering Science ◽

10.1177/09544062211032995 ◽

2021 ◽

pp. 095440622110329

Author(s):

Jianqun Zhang ◽

Qing Zhang ◽

Xianrong Qin ◽

Yuantao Sun

Keyword(s):

Feature Extraction ◽

Fault Diagnosis ◽

Domain Adaptation ◽

Rolling Bearing ◽

Training Data ◽

Variable Load ◽

K Nearest Neighbor ◽

Target Domain ◽

Bearing Faults ◽

Load Conditions

To identify rolling bearing faults under variable load conditions, a method named DISA-KNN is proposed in this paper, which is based on the strategy of feature extraction-domain adaptation-classification. To be specific, the time-domain and frequency-domain indicators are used for feature extraction. Discriminative and domain invariant subspace alignment (DISA) is used to minimize the data distributions’ discrepancies between the training data (source domain) and testing data (target domain). K-nearest neighbor (KNN) is applied to identify rolling bearing faults. DISA-KNN’s validation is proved by the experimental signal collected under different load conditions. The identification accuracies obtained by the DISA-KNN method are more than 90% on four datasets, including one dataset with 99.5% accuracy. The strength of the proposed method is further highlighted by comparisons with the other 8 methods. These results reveal that the proposed method is promising for the rolling bearing fault diagnosis in real rotating machinery.

Download Full-text

Paraphrasing Training Data for Statistical Machine Translation

Journal of Natural Language Processing ◽

10.5715/jnlp.17.3_101 ◽

2010 ◽

Vol 17 (3) ◽

pp. 101-122 ◽

Cited By ~ 2

Author(s):

Eric Nichols ◽

Francis Bond ◽

D. Scott Appling ◽

Yuji Matsumoto

Keyword(s):

Machine Translation ◽

Statistical Machine Translation ◽

Training Data

Download Full-text

An Iterative Multi-Source Mutual Knowledge Transfer Framework for Machine Reading Comprehension

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/525 ◽

2020 ◽

Cited By ~ 1

Author(s):

Xin Liu ◽

Kai Liu ◽

Xiang Li ◽

Jinsong Su ◽

Yubin Ge ◽

...

Keyword(s):

Reading Comprehension ◽

Knowledge Transfer ◽

Training Data ◽

Target Domain ◽

Domain Specific ◽

Mutual Knowledge ◽

Benchmark Datasets ◽

Knowledge Distillation ◽

The Many ◽

Machine Reading

The lack of sufficient training data in many domains, poses a major challenge to the construction of domain-specific machine reading comprehension (MRC) models with satisfying performance. In this paper, we propose a novel iterative multi-source mutual knowledge transfer framework for MRC. As an extension of the conventional knowledge transfer with one-to-one correspondence, our framework focuses on the many-to-many mutual transfer, which involves synchronous executions of multiple many-to-one transfers in an iterative manner.Specifically, to update a target-domain MRC model, we first consider other domain-specific MRC models as individual teachers, and employ knowledge distillation to train a multi-domain MRC model, which is differentially required to fit the training data and match the outputs of these individual models according to their domain-level similarities to the target domain. After being initialized by the multi-domain MRC model, the target-domain MRC model is fine-tuned to match both its training data and the output of its previous best model simultaneously via knowledge distillation. Compared with previous approaches, our framework can continuously enhance all domain-specific MRC models by enabling each model to iteratively and differentially absorb the domain-shared knowledge from others. Experimental results and in-depth analyses on several benchmark datasets demonstrate the effectiveness of our framework.

Download Full-text