scholarly journals Surveying word boundary factor in Chinese - Vietnamese statistical machine translation

2015 ◽  
Vol 18 (2) ◽  
pp. 70-78
Author(s):  
Phuoc Thanh Tran ◽  
Dien Dinh

In isolating languages such as Chinese and Vietnamese, words are not separated by spaces, a word can include one or more spelling words. Segmenting word or not before training and translating process is a problem that need to be considered. In this paper, we will survey the effect of word boundary factor in the translation result of Chinese-Vietnamese statistical machine translation (SMT). The experimental result of this paper will be the basis for word segmentation improvement in future research which increase machine translation performance. We surveyed on two experiments: word segmentation (WS) and word un-segmentation (WUS) on the corpus of 8,000 and 12,000 sentence pairs. Based on the experimental results, we found that both of WS corpus and WUS corpus have their own advantages and defects. We propose integrating the advantages of these two methods in SMT

2018 ◽  
Vol 6 ◽  
pp. 421-435 ◽  
Author(s):  
Yan Shao ◽  
Christian Hardmeier ◽  
Joakim Nivre

Word segmentation is a low-level NLP task that is non-trivial for a considerable number of languages. In this paper, we present a sequence tagging framework and apply it to word segmentation for a wide range of languages with different writing systems and typological characteristics. Additionally, we investigate the correlations between various typological factors and word segmentation accuracy. The experimental results indicate that segmentation accuracy is positively related to word boundary markers and negatively to the number of unique non-segmental terms. Based on the analysis, we design a small set of language-specific settings and extensively evaluate the segmentation system on the Universal Dependencies datasets. Our model obtains state-of-the-art accuracies on all the UD languages. It performs substantially better on languages that are non-trivial to segment, such as Chinese, Japanese, Arabic and Hebrew, when compared to previous work.


2016 ◽  
Vol 9 (3) ◽  
pp. 13 ◽  
Author(s):  
Hadis Ghasemi ◽  
Mahmood Hashemian

<p>Both lack of time and the need to translate texts for numerous reasons brought about an increase in studying machine translation with a history spanning over 65 years. During the last decades, Google Translate, as a statistical machine translation (SMT), was in the center of attention for supporting 90 languages. Although there are many studies on Google Translate, few researchers have considered Persian-English translation pairs. This study used Keshavarzʼs (1999) model of error analysis to carry out a comparison study between the raw English-Persian translations and Persian-English translations from Google Translate. Based on the criteria presented in the model, 100 systematically selected sentences from an interpreter app called Motarjem Hamrah were translated by Google Translate and then evaluated and brought in different tables. Results of analyzing and tabulating the frequencies of the errors together with conducting a chi-square test showed no significant differences between the qualities of Google Translate from English to Persian and Persian to English. In addition, lexicosemantic and active/passive voice errors were the most and least frequent errors, respectively. Directions for future research are recognized in the paper for the improvements of the system.</p>


2013 ◽  
Vol 791-793 ◽  
pp. 1622-1625
Author(s):  
Dan Han ◽  
Zhi Han Yu

In this article, we mainly introduce some basic concepts about machine translation. Machine translation means translating a natural language text to another by software. It can be divided into two categories: rule-based and corpus-based. IBM's statistical machine translation, Microsoft's multi-language machine translation project, AT & T's voice translation system and CMUs PANGLOSS system are three typical machine translation systems. Due to sentences are constructed by words continuously in Chinese. Chinese word segmentation is very essential. Three methods of Chinese word segmentation: segmentation methods based on string matching, segmentation method based on the understanding and segmentation method based on the statistics.


2012 ◽  
Vol 7 ◽  
Author(s):  
Annette Rios ◽  
Anne Göhring ◽  
Martin Volk

Parallel treebanking is greatly facilitated by automatic word alignment. We work on building a trilingual treebank for German, Spanish and Quechua. We ran different alignment experiments on parallel Spanish-Quechua texts, measured the alignment quality, and compared these results to the figures we obtained aligning a comparable corpus of Spanish-German texts. This preliminary work has shown us the best word segmentation to use for the agglutinative language Quechua with respect to alignment. We also acquired a first impression about how well Quechua can be aligned to Spanish, an important prerequisite for bilingual lexicon extraction, parallel treebanking or statistical machine translation.


2013 ◽  
Vol 427-429 ◽  
pp. 1841-1844
Author(s):  
Wen Xiong ◽  
Yao Hong Jin ◽  
Zhi Ying Liu

By studying the Chinese number and quantifier prefix (CNQP) as a special language phenomenon in machine translation, this paper presents a CNQP recognition method, which is rule based and independent of word segmentation. The method expressed CNQPs compositions using Backus-Naur Form (BNF), and took the numeral as the active information and the quantifiers as the boundaries of the CNQPs. To avoid the word segmentation noise, a forward maximum matching method was used for obtaining the compositions of the CNQPs, which can be fed into the statistical parser for the analysis of the Chinese sentences. The experimental results indicate the proposed method as a pre-processing module can effectively improve the parsing results of the statistical parser without retraining on experimental data constructed manually, which can further enhance the translation qualities.


2012 ◽  
Vol 98 (1) ◽  
pp. 37-50
Author(s):  
Matthias Huck ◽  
Jan-Thorsten Peter ◽  
Markus Freitag ◽  
Stephan Peitz ◽  
Hermann Ney

Hierarchical Phrase-Based Translation with Jane 2 In this paper, we give a survey of several recent extensions to hierarchical phrase-based machine translation that have been implemented in version 2 of Jane, RWTH's open source statistical machine translation toolkit. We focus on the following techniques: Insertion and deletion models, lexical scoring variants, reordering extensions with non-lexicalized reordering rules and with a discriminative lexicalized reordering model, and soft string-to-dependency hierarchical machine translation. We describe the fundamentals of each of these techniques and present experimental results obtained with Jane 2 to confirm their usefulness in state-of-the-art hierarchical phrase-based translation (HPBT).


Sign in / Sign up

Export Citation Format

Share Document