Surveying word boundary factor in Chinese - Vietnamese statistical machine translation

Phuoc Thanh Tran; Dien Dinh

doi:10.32508/stdj.v18i2.1133

Surveying word boundary factor in Chinese - Vietnamese statistical machine translation

Science and Technology Development Journal ◽

10.32508/stdj.v18i2.1133 ◽

2015 ◽

Vol 18 (2) ◽

pp. 70-78

Author(s):

Phuoc Thanh Tran ◽

Dien Dinh

Keyword(s):

Machine Translation ◽

Statistical Machine Translation ◽

Word Segmentation ◽

Experimental Results ◽

Word Boundary ◽

Experimental Result ◽

Future Research

In isolating languages such as Chinese and Vietnamese, words are not separated by spaces, a word can include one or more spelling words. Segmenting word or not before training and translating process is a problem that need to be considered. In this paper, we will survey the effect of word boundary factor in the translation result of Chinese-Vietnamese statistical machine translation (SMT). The experimental result of this paper will be the basis for word segmentation improvement in future research which increase machine translation performance. We surveyed on two experiments: word segmentation (WS) and word un-segmentation (WUS) on the corpus of 8,000 and 12,000 sentence pairs. Based on the experimental results, we found that both of WS corpus and WUS corpus have their own advantages and defects. We propose integrating the advantages of these two methods in SMT

Download Full-text

Bilingually Motivated Word Segmentation for Statistical Machine Translation

ACM Transactions on Asian Language Information Processing ◽

10.1145/1526252.1526255 ◽

2009 ◽

Vol 8 (2) ◽

pp. 1-24 ◽

Cited By ~ 1

Author(s):

Yanjun Ma ◽

Andy Way

Keyword(s):

Machine Translation ◽

Statistical Machine Translation ◽

Word Segmentation

Download Full-text

Integrating Multi-source Bilingual Information for Chinese Word Segmentation in Statistical Machine Translation

Lecture Notes in Computer Science - Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data ◽

10.1007/978-3-642-41491-6_7 ◽

2013 ◽

pp. 61-72

Author(s):

Wei Chen ◽

Wei Wei ◽

Zhenbiao Chen ◽

Bo Xu

Keyword(s):

Machine Translation ◽

Statistical Machine Translation ◽

Word Segmentation ◽

Chinese Word ◽

Chinese Word Segmentation

Download Full-text

Universal Word Segmentation: Implementation and Interpretation

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00033 ◽

2018 ◽

Vol 6 ◽

pp. 421-435 ◽

Cited By ~ 3

Author(s):

Yan Shao ◽

Christian Hardmeier ◽

Joakim Nivre

Keyword(s):

State Of The Art ◽

Word Segmentation ◽

Experimental Results ◽

Word Boundary ◽

Writing Systems ◽

Low Level ◽

Wide Range ◽

Segmentation Accuracy ◽

Small Set

Word segmentation is a low-level NLP task that is non-trivial for a considerable number of languages. In this paper, we present a sequence tagging framework and apply it to word segmentation for a wide range of languages with different writing systems and typological characteristics. Additionally, we investigate the correlations between various typological factors and word segmentation accuracy. The experimental results indicate that segmentation accuracy is positively related to word boundary markers and negatively to the number of unique non-segmental terms. Based on the analysis, we design a small set of language-specific settings and extensively evaluate the segmentation system on the Universal Dependencies datasets. Our model obtains state-of-the-art accuracies on all the UD languages. It performs substantially better on languages that are non-trivial to segment, such as Chinese, Japanese, Arabic and Hebrew, when compared to previous work.

Download Full-text

A Comparative Study of Google Translate Translations: An Error Analysis of English-to-Persian and Persian-to-English Translations

English Language Teaching ◽

10.5539/elt.v9n3p13 ◽

2016 ◽

Vol 9 (3) ◽

pp. 13 ◽

Cited By ~ 1

Author(s):

Hadis Ghasemi ◽

Mahmood Hashemian

Keyword(s):

Error Analysis ◽

Machine Translation ◽

English Translation ◽

Statistical Machine Translation ◽

Future Research ◽

Comparison Study ◽

Chi Square ◽

Passive Voice ◽

English Translations ◽

Chi Square Test

<p>Both lack of time and the need to translate texts for numerous reasons brought about an increase in studying machine translation with a history spanning over 65 years. During the last decades, Google Translate, as a statistical machine translation (SMT), was in the center of attention for supporting 90 languages. Although there are many studies on Google Translate, few researchers have considered Persian-English translation pairs. This study used Keshavarzʼs (1999) model of error analysis to carry out a comparison study between the raw English-Persian translations and Persian-English translations from Google Translate. Based on the criteria presented in the model, 100 systematically selected sentences from an interpreter app called Motarjem Hamrah were translated by Google Translate and then evaluated and brought in different tables. Results of analyzing and tabulating the frequencies of the errors together with conducting a chi-square test showed no significant differences between the qualities of Google Translate from English to Persian and Persian to English. In addition, lexicosemantic and active/passive voice errors were the most and least frequent errors, respectively. Directions for future research are recognized in the paper for the improvements of the system.</p>

Download Full-text

The Critical Technology Development Status of Machine Translation

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.791-793.1622 ◽

2013 ◽

Vol 791-793 ◽

pp. 1622-1625

Author(s):

Dan Han ◽

Zhi Han Yu

Keyword(s):

Machine Translation ◽

Technology Development ◽

Statistical Machine Translation ◽

Word Segmentation ◽

Translation System ◽

Chinese Word ◽

Segmentation Method ◽

Chinese Word Segmentation ◽

Critical Technology ◽

Translation Machine

In this article, we mainly introduce some basic concepts about machine translation. Machine translation means translating a natural language text to another by software. It can be divided into two categories: rule-based and corpus-based. IBM's statistical machine translation, Microsoft's multi-language machine translation project, AT & T's voice translation system and CMUs PANGLOSS system are three typical machine translation systems. Due to sentences are constructed by words continuously in Chinese. Chinese word segmentation is very essential. Three methods of Chinese word segmentation: segmentation methods based on string matching, segmentation method based on the understanding and segmentation method based on the statistics.

Download Full-text

Parallel Treebanking Spanish-Quechua

Linguistic Issues in Language Technology ◽

10.33011/lilt.v7i.1285 ◽

2012 ◽

Vol 7 ◽

Author(s):

Annette Rios ◽

Anne Göhring ◽

Martin Volk

Keyword(s):

Machine Translation ◽

Statistical Machine Translation ◽

Word Segmentation ◽

Word Alignment ◽

Alignment Quality ◽

Important Prerequisite ◽

Bilingual Lexicon ◽

Preliminary Work ◽

First Impression ◽

Agglutinative Language

Parallel treebanking is greatly facilitated by automatic word alignment. We work on building a trilingual treebank for German, Spanish and Quechua. We ran different alignment experiments on parallel Spanish-Quechua texts, measured the alignment quality, and compared these results to the figures we obtained aligning a comparable corpus of Spanish-German texts. This preliminary work has shown us the best word segmentation to use for the agglutinative language Quechua with respect to alignment. We also acquired a first impression about how well Quechua can be aligned to Spanish, an important prerequisite for bilingual lexicon extraction, parallel treebanking or statistical machine translation.

Download Full-text

An Improved Statistical Machine Translation Method for United Chinese-Japanese Word Segmentation

Proceedings of the 2016 4th International Conference on Electrical & Electronics Engineering and Computer Science (ICEEECS 2016) ◽

10.2991/iceeecs-16.2016.1 ◽

2016 ◽

Author(s):

Xiaowei Wang ◽

Jinke Wang

Keyword(s):

Machine Translation ◽

Statistical Machine Translation ◽

Word Segmentation ◽

Translation Method ◽

Japanese Word

Download Full-text

Improving Statistical Parser by Recognition of Chinese Number and Quantifier Prefix in Machine Translation

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.427-429.1841 ◽

2013 ◽

Vol 427-429 ◽

pp. 1841-1844

Author(s):

Wen Xiong ◽

Yao Hong Jin ◽

Zhi Ying Liu

Keyword(s):

Experimental Data ◽

Machine Translation ◽

Word Segmentation ◽

Experimental Results ◽

Maximum Matching ◽

Recognition Method ◽

Matching Method ◽

Rule Based ◽

Processing Module ◽

Special Language

By studying the Chinese number and quantifier prefix (CNQP) as a special language phenomenon in machine translation, this paper presents a CNQP recognition method, which is rule based and independent of word segmentation. The method expressed CNQPs compositions using Backus-Naur Form (BNF), and took the numeral as the active information and the quantifiers as the boundaries of the CNQPs. To avoid the word segmentation noise, a forward maximum matching method was used for obtaining the compositions of the CNQPs, which can be fed into the statistical parser for the analysis of the Chinese sentences. The experimental results indicate the proposed method as a pre-processing module can effectively improve the parsing results of the statistical parser without retraining on experimental data constructed manually, which can further enhance the translation qualities.

Download Full-text

Unsupervised Arabic word segmentation and statistical machine translation

Qatar Foundation Annual Research Forum Proceedings ◽

10.5339/qfarf.2013.ictp-050 ◽

2013 ◽

pp. ICTP 050 ◽

Cited By ~ 1

Author(s):

Hanan Alshikhabobakr

Keyword(s):

Machine Translation ◽

Statistical Machine Translation ◽

Word Segmentation ◽

Arabic Word

Download Full-text

Hierarchical Phrase-Based Translation with Jane 2

Prague Bulletin of Mathematical Linguistics ◽

10.2478/v10108-012-0007-8 ◽

2012 ◽

Vol 98 (1) ◽

pp. 37-50

Author(s):

Matthias Huck ◽

Jan-Thorsten Peter ◽

Markus Freitag ◽

Stephan Peitz ◽

Hermann Ney

Keyword(s):

Open Source ◽

Machine Translation ◽

State Of The Art ◽

Statistical Machine Translation ◽

Experimental Results ◽

Insertion And Deletion

Hierarchical Phrase-Based Translation with Jane 2 In this paper, we give a survey of several recent extensions to hierarchical phrase-based machine translation that have been implemented in version 2 of Jane, RWTH's open source statistical machine translation toolkit. We focus on the following techniques: Insertion and deletion models, lexical scoring variants, reordering extensions with non-lexicalized reordering rules and with a discriminative lexicalized reordering model, and soft string-to-dependency hierarchical machine translation. We describe the fundamentals of each of these techniques and present experimental results obtained with Jane 2 to confirm their usefulness in state-of-the-art hierarchical phrase-based translation (HPBT).

Download Full-text