Parse tree-based machine translation for less-used languages

Jernej Vičič; Andrej Brodnik

doi:10.51936/tulf1279

Word Sense Based Hindi-Tamil Statistical Machine Translation

International Journal of Intelligent Information Technologies ◽

10.4018/ijiit.2018010102 ◽

2018 ◽

Vol 14 (1) ◽

pp. 17-27

Author(s):

Vimal Kumar K. ◽

Divakar Yadav

Keyword(s):

Machine Translation ◽

Language Processing ◽

Semantic Analysis ◽

Target Language ◽

Translation System ◽

Great Success ◽

Source Language ◽

Additional Information ◽

Part Of Speech ◽

Preceding Word

Corpus based natural language processing has emerged with great success in recent years. It is not only used for languages like English, French, Spanish, and Hindi but also is widely used for languages like Tamil, Telugu etc. This paper focuses to increase the accuracy of machine translation from Hindi to Tamil by considering the word's sense as well as its part-of-speech. This system works on word by word translation from Hindi to Tamil language which makes use of additional information such as the preceding words, the current word's part of speech and the word's sense itself. For such a translation system, the frequency of words occurring in the corpus, the tagging of the input words and the probability of the preceding word of the tagged words are required. Wordnet is used to identify various synonym for the words specified in the source language. Among these words, the one which is more relevant to the word specified in source language is considered for the translation to target language. The introduction of the additional information such as part-of-speech tag, preceding word information and semantic analysis has greatly improved the accuracy of the system.

Download Full-text

Machine and web translator for English to Bangla using natural language processes

Daffodil International University Journal of Science and Technology ◽

10.3329/diujst.v5i1.4382 ◽

1970 ◽

Vol 5 (1) ◽

pp. 53-61

Author(s):

Mohammad Hasibul Haque ◽

Md Fokhray Hossain ◽

ANM Fauzul Hossain

Keyword(s):

Natural Language ◽

Language Processing ◽

Parse Tree ◽

Translation System ◽

Web Pages ◽

Natural Languages ◽

Pos Tagging ◽

Web Contents ◽

Massive Number ◽

Translation Methods

The modern web contents are mostly written in English and developing a system with the facility of translating web pages from English to Bangla that can aid the massive number of people of Bangladesh. It is very important to introduce Natural Language Processing (NLP) and is required to developing a solution of web translator. It is a technique that deals with understanding natural languages and natural language generation. It is really a challenging job to building a Web Translator with 100% efficiency and our proposed Web Translator basically uses Machine Translator as its mother concern. This paper represents an optimal way for English to Bangla machine and the Web translation & translation methods are used by translator. Naturally there are three stages for MT but here we propose a translation system which includes 4 stages, such as, POS tagging, Generating parse tree, Transfer English parse tree to Bengali parse tree and Translate English to Bangla and apply AI. An innovation initiative has scope of being upgraded in future and hopefully this work will assist to develop more improved English to Bangla Web Translator. Keywords: Machine Translator, Web Translator, POS Tagging, Parsing, HTML Parsing, Verb Mapping DOI: 10.3329/diujst.v5i1.4382 Daffodil International University Journal of Science and Technology Vol.5(1) 2010 pp.53-61

Download Full-text

A Knowledge-Based Machine Translation Using AI Technique

International Journal of Software Innovation ◽

10.4018/ijsi.2018070106 ◽

2018 ◽

Vol 6 (3) ◽

pp. 79-92

Author(s):

Sahar A. El-Rahman ◽

Tarek A. El-Shishtawy ◽

Raafat A. El-Kammar

Keyword(s):

Machine Translation ◽

Target Language ◽

Translation System ◽

Module Structure ◽

Source Language ◽

Word Category ◽

Knowledge Based ◽

Language Analysis ◽

Transfer Rules

This article presents a realistic technique for the machine aided translation system. In this technique, the system dictionary is partitioned into a multi-module structure for fast retrieval of Arabic features of English words. Each module is accessed through an interface that includes the necessary morphological rules, which directs the search toward the proper sub-dictionary. Another factor that aids fast retrieval of Arabic features of words is the prediction of the word category, and accesses its sub-dictionary to retrieve the corresponding attributes. The system consists of three main parts, which are the source language analysis, the transfer rules between source language (English) and target language (Arabic), and the generation of the target language. The proposed system is able to translate, some negative forms, demonstrations, and conjunctions, and also adjust nouns, verbs, and adjectives according their attributes. Then, it adds the symptom of Arabic words to generate a correct sentence.

Download Full-text

Two approaches to compilation of bilingual multi-word terminology lists from lexical resources

Natural Language Engineering ◽

10.1017/s1351324919000615 ◽

2020 ◽

Vol 26 (4) ◽

pp. 455-479

Author(s):

Branislava Šandrih ◽

Cvetana Krstev ◽

Ranka Stanković

Keyword(s):

Information Science ◽

Target Language ◽

Support Vector ◽

Word Alignment ◽

Lexical Resources ◽

Terminology Extraction ◽

Source Language ◽

Term Extraction ◽

Two Parameters ◽

Shallow Parser

AbstractIn this paper, we present two approaches and the implemented system for bilingual terminology extraction that rely on an aligned bilingual domain corpus, a terminology extractor for a target language, and a tool for chunk alignment. The two approaches differ in the way terminology for the source language is obtained: the first relies on an existing domain terminology lexicon, while the second one uses a term extraction tool. For both approaches, four experiments were performed with two parameters being varied. In the experiments presented in this paper, the source language was English, and the target language Serbian, and a selected domain was Library and Information Science, for which an aligned corpus exists, as well as a bilingual terminological dictionary. For term extraction, we used the FlexiTerm tool for the source language and a shallow parser for the target language, while for word alignment we used GIZA++. The evaluation results show that for the first approach the F1 score varies from 29.43% to 51.15%, while for the second it varies from 61.03% to 71.03%. On the basis of the evaluation results, we developed a binary classifier that decides whether a candidate pair, composed of aligned source and target terms, is valid. We trained and evaluated different classifiers on a list of manually labeled candidate pairs obtained after the implementation of our extraction system. The best results in a fivefold cross-validation setting were achieved with the Radial Basis Function Support Vector Machine classifier, giving a F1 score of 82.09% and accuracy of 78.49%.

Download Full-text

Large-scale Word Alignment Using Soft Dependency Cohesion Constraints

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00228 ◽

2013 ◽

Vol 1 ◽

pp. 291-300 ◽

Cited By ~ 1

Author(s):

Zhiguo Wang ◽

Chengqing Zong

Keyword(s):

Large Scale ◽

Target Language ◽

Model Parameters ◽

Word Alignment ◽

Soft Constraint ◽

Alignment Quality ◽

Source Language ◽

Discriminative Models ◽

Translation Quality ◽

Gibbs Sampling Algorithm

Dependency cohesion refers to the observation that phrases dominated by disjoint dependency subtrees in the source language generally do not overlap in the target language. It has been verified to be a useful constraint for word alignment. However, previous work either treats this as a hard constraint or uses it as a feature in discriminative models, which is ineffective for large-scale tasks. In this paper, we take dependency cohesion as a soft constraint, and integrate it into a generative model for large-scale word alignment experiments. We also propose an approximate EM algorithm and a Gibbs sampling algorithm to estimate model parameters in an unsupervised manner. Experiments on large-scale Chinese-English translation tasks demonstrate that our model achieves improvements in both alignment quality and translation quality.

Download Full-text

Maintaining Sentiment Polarity in Translation of User-Generated Content

Prague Bulletin of Mathematical Linguistics ◽

10.1515/pralin-2017-0010 ◽

2017 ◽

Vol 108 (1) ◽

pp. 73-84 ◽

Cited By ~ 2

Author(s):

Pintu Lohar ◽

Haithem Afli ◽

Andy Way

Keyword(s):

Empirical Evaluation ◽

Sentiment Classification ◽

Target Language ◽

User Generated Content ◽

Translation Process ◽

Source Language ◽

Translation Quality ◽

Per Se ◽

Positive Sentiment ◽

Share Information

Abstract The advent of social media has shaken the very foundations of how we share information, with Twitter, Facebook, and Linkedin among many well-known social networking platforms that facilitate information generation and distribution. However, the maximum 140-character restriction in Twitter encourages users to (sometimes deliberately) write somewhat informally in most cases. As a result, machine translation (MT) of user-generated content (UGC) becomes much more difficult for such noisy texts. In addition to translation quality being affected, this phenomenon may also negatively impact sentiment preservation in the translation process. That is, a sentence with positive sentiment in the source language may be translated into a sentence with negative or neutral sentiment in the target language. In this paper, we analyse both sentiment preservation and MT quality per se in the context of UGC, focusing especially on whether sentiment classification helps improve sentiment preservation in MT of UGC. We build four different experimental setups for tweet translation (i) using a single MT model trained on the whole Twitter parallel corpus, (ii) using multiple MT models based on sentiment classification, (iii) using MT models including additional out-of-domain data, and (iv) adding MT models based on the phrase-table fill-up method to accompany the sentiment translation models with an aim of improving MT quality and at the same time maintaining sentiment polarity preservation. Our empirical evaluation shows that despite a slight deterioration in MT quality, our system significantly outperforms the Baseline MT system (without using sentiment classification) in terms of sentiment preservation. We also demonstrate that using an MT engine that conveys a sentiment different from that of the UGC can even worsen both the translation quality and sentiment preservation.

Download Full-text

Fachkommunikation, Popularisierung, Übersetzung: Empirische Vergleiche am Beispiel der Nominalphrase im Englischen und Deutschen

Linguistik Online ◽

10.13092/lo.39.480 ◽

2009 ◽

Vol 39 (3) ◽

Author(s):

Silvia Hansen-Schirra ◽

Sandra Hansen ◽

Sascha Wolfer ◽

Lars Konieczny

Keyword(s):

Language Contact ◽

Noun Phrases ◽

Target Language ◽

Structure Information ◽

Comparable Corpora ◽

Source Language ◽

Part Of Speech ◽

The One ◽

Insight Into ◽

Empirical Insight

This article examines the contrasts and commonalities between languages for specific purposes (LSP) and their popularizations on the one hand and the frequency patterns of LSP register features in English and German on the other. For this purpose corpora of expert-expert and expert-lay communication are annotated for part-of-speech and phrase structure information. On this basis, the frequencies of pre- and post-modifications in complex noun phrases are statistically investigated and compared for English and German. Moreover, using parallel and comparable corpora it is tested whether English-German translations obey the register norms of the target language or whether the LSP frequency patterns of the source language Ñshine throughì. The results provide an empirical insight into language contact phenomena involving specialized communication.

Download Full-text

A Bicolano-to-Tagalog Transfer-Based Machine Translation System

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.b1062.0882s819 ◽

2019 ◽

Vol 8 (2S8) ◽

pp. 1324-1330

Keyword(s):

Machine Translation ◽

Syntactic Structure ◽

Internal Representation ◽

Target Language ◽

Translation System ◽

Source Language ◽

Three Phase ◽

Transfer Rules ◽

Machine Translation System ◽

Overall Performance

The Bicolano-Tagalog Transfer-based Machine Translation System is a unidirectional machine translator for languages Bicolano and Tagalog. The transfer-based approach is divided into three phase: Pre-Processing Analysis, Morphological Transfer, and Sentence Generation. The system analyze first the source language (Bicolano) input to create some internal representation. This includes the tokenizer, stemmer, POS tag and parser. Through transfer rules, it then typically manipulates this internal representation to transfer parsed source language syntactic structure into target language syntactic structure. Finally, the system generates Tagalog sentence from own morphological and syntactic information. Each phase will undergo training and evaluation test for the competence of end-results. Overall performance shows a 71.71% accuracy rate.

Download Full-text

Word Sense Based Hindi-Tamil Statistical Machine Translation

Natural Language Processing ◽

10.4018/978-1-7998-0951-7.ch021 ◽

2020 ◽

pp. 410-421

Author(s):

Vimal Kumar K. ◽

Divakar Yadav

Keyword(s):

Machine Translation ◽

Language Processing ◽

Semantic Analysis ◽

Translation System ◽

Great Success ◽

Source Language ◽

Additional Information ◽

Part Of Speech ◽

Preceding Word ◽

The One

Corpus based natural language processing has emerged with great success in recent years. It is not only used for languages like English, French, Spanish, and Hindi but also is widely used for languages like Tamil, Telugu etc. This paper focuses to increase the accuracy of machine translation from Hindi to Tamil by considering the word's sense as well as its part-of-speech. This system works on word by word translation from Hindi to Tamil language which makes use of additional information such as the preceding words, the current word's part of speech and the word's sense itself. For such a translation system, the frequency of words occurring in the corpus, the tagging of the input words and the probability of the preceding word of the tagged words are required. Wordnet is used to identify various synonym for the words specified in the source language. Among these words, the one which is more relevant to the word specified in source language is considered for the translation to target language. The introduction of the additional information such as part-of-speech tag, preceding word information and semantic analysis has greatly improved the accuracy of the system.

Download Full-text

Source Language Adaptation Approaches for Resource-Poor Machine Translation

Computational Linguistics ◽

10.1162/coli_a_00248 ◽

2016 ◽

Vol 42 (2) ◽

pp. 277-306 ◽

Cited By ~ 8

Author(s):

Pidong Wang ◽

Preslav Nakov ◽

Hwee Tou Ng

Keyword(s):

Machine Translation ◽

Statistical Machine Translation ◽

Target Language ◽

Source Language ◽

World Languages ◽

Word Level ◽

Resource Poor ◽

Morphological Variants ◽

Cross Lingual ◽

Translation Systems

Most of the world languages are resource-poor for statistical machine translation; still, many of them are actually related to some resource-rich language. Thus, we propose three novel, language-independent approaches to source language adaptation for resource-poor statistical machine translation. Specifically, we build improved statistical machine translation models from a resource-poor language POOR into a target language TGT by adapting and using a large bitext for a related resource-rich language RICH and the same target language TGT. We assume a small POOR–TGT bitext from which we learn word-level and phrase-level paraphrases and cross-lingual morphological variants between the resource-rich and the resource-poor language. Our work is of importance for resource-poor machine translation because it can provide a useful guideline for people building machine translation systems for resource-poor languages. Our experiments for Indonesian/Malay–English translation show that using the large adapted resource-rich bitext yields 7.26 BLEU points of improvement over the unadapted one and 3.09 BLEU points over the original small bitext. Moreover, combining the small POOR–TGT bitext with the adapted bitext outperforms the corresponding combinations with the unadapted bitext by 1.93–3.25 BLEU points. We also demonstrate the applicability of our approaches to other languages and domains.

Download Full-text