Dutch Parallel Corpus: A Balanced Copyright-Cleared Parallel Corpus

Lieve Macken; Orphée De Clercq; Hans Paulussen

doi:10.7202/1006182ar

Dutch Parallel Corpus: A Balanced Copyright-Cleared Parallel Corpus

Meta Journal des traducteurs ◽

10.7202/1006182ar ◽

2011 ◽

Vol 56 (2) ◽

pp. 374-390 ◽

Cited By ~ 23

Author(s):

Lieve Macken ◽

Orphée De Clercq ◽

Hans Paulussen

Keyword(s):

Research Community ◽

Web Interface ◽

Parallel Corpora ◽

Parallel Corpus ◽

Text Type ◽

Language Technology ◽

Text Types ◽

French And English ◽

Class Information ◽

Research Domains

This paper presents the Dutch Parallel Corpus, a high-quality parallel corpus for Dutch, French and English consisting of more than ten million words. The corpus contains five different text types and is balanced with respect to text type and translation direction. All texts included in the corpus have been cleared from copyright. We discuss the importance of parallel corpora in various research domains and contrast the Dutch Parallel Corpus with existing parallel corpora. The Dutch Parallel Corpus distinguishes itself from other parallel corpora by having a balanced composition and by its availability to the wide research community, thanks to its copyright clearance. All texts in the corpus are sentence-aligned and further enriched with basic linguistic annotations (lemmas and word class information). Approximately 25,000 words of the Dutch-English part have been manually aligned at the sub-sentential level. Rich metadata facilitates the navigability of the corpus and enables users to select the texts that satisfy their needs. The entire corpus is released as full texts in XML format and is also available via a web interface, which supports basic and complex search queries and presents the results as parallel concordances. The corpus will be distributed by the Flemish-Dutch Human Language Technology Agency (TST-Centrale).

Download Full-text

Domains, text types, aspect marking and English-Chinese translation

Languages in Contrast ◽

10.1075/lic.2.2.05mce ◽

1999 ◽

Vol 2 (2) ◽

pp. 211-229 ◽

Cited By ~ 8

Author(s):

Tony McEnery ◽

Richard Xiao

Keyword(s):

Chinese Translation ◽

Parallel Corpus ◽

Text Type ◽

Text Types ◽

Reference Corpus ◽

Aspect Markers ◽

Chinese Texts

This paper uses an English-Chinese parallel corpus, an L1 Chinese comparable corpus, and an L1 Chinese reference corpus to examine how aspectual meanings in English are translated into Chinese and explore the effects of domains, text types and translation on aspect marking. We will show that while English and Chinese both mark aspect grammatically, the aspect system in the two languages differs considerably. Even though Chinese, as an aspect language, is rich in aspect markers, covert marking (LVM) is a frequent and important strategy in Chinese discourse. The distribution of aspect markers varies significantly across domain and text type. The study also sheds new light on the translation effect by contrasting aspect marking in translated Chinese texts and L1 Chinese texts.

Download Full-text

Arabic-English Parallel Corpus: A New Resource for Translation Training and Language Teaching

10.31235/osf.io/rek3w ◽

2017 ◽

Author(s):

Arab World English Journal ◽

Hind M. Alotaibi

Keyword(s):

Language Teaching ◽

Data Driven ◽

Text Segmentation ◽

Web Interface ◽

King Saud University ◽

Parallel Corpora ◽

Parallel Corpus ◽

Source Language ◽

User Friendly ◽

Ongoing Project

Parallel corpora can be defined as collections of aligned, translated texts of two or more languages. They play a major role in translation and contrastive studies, and are also becoming popular in translation training and language teaching, with the advent of the data-driven learning (DDL) approach. Despite their significance, however, Arabic seems to lack a satisfactory general-use parallel corpus resource. The literature describes few Arabic–English parallel corpora, and these few are usually inaccurate and/or expensive. Some are small in size, while others are restricted in terms of genre, failing to meet the requirements of many academics and researchers. This paper describes an ongoing project at the College of Languages and Translation, King Saud University, to compile a 10-million-word Arabic–English parallel corpus to be used as a resource for translation training and language teaching. The bidirectional corpus can be used to compare translated and source language and identify differences. The corpus has been manually verified at different stages, including translation, text segmentation, alignment, and file preparation; it is available as full-text in XML format and through a user-friendly web interface that provides a concordancer to support bilingual search queries and several filtering options.

Download Full-text

The case of InterCorp, a multilingual parallel corpus

International Journal of Corpus Linguistics ◽

10.1075/ijcl.17.3.05cer ◽

2012 ◽

Vol 17 (3) ◽

pp. 411-427 ◽

Cited By ~ 18

Author(s):

František Čermák ◽

Alexandr Rosen

Keyword(s):

Comparative Research ◽

Native Speakers ◽

Web Interface ◽

Search Interface ◽

Research Opportunities ◽

Parallel Corpus ◽

Linguistic Annotation ◽

Text Types ◽

Online Searches

This paper introduces InterCorp, a parallel corpus including texts in Czech and 27 other languages, available for online searches via a web interface. After discussing some issues and merits of a multilingual resource we argue that it has an important role especially for languages with fewer native speakers, supporting both comparative research and studies of the language from the perspective of other languages. We proceed with an overview of the corpus — the strategy and criteria for including new texts, the representation of available languages and text types, linguistic annotation, and a sketch of pre-processing issues. Finally, we present the search interface and suggest some research opportunities.

Download Full-text

Use of monolingual and comparable corpora in the classroom to translate adverbial connectors

Cadernos de Tradução ◽

10.5007/2175-7968.2016v36nesp1p147 ◽

2016 ◽

Vol 36 (1) ◽

pp. 147

Author(s):

Beatriz Sánchez Cárdenas ◽

Pamela Faber

Keyword(s):

Lexical Selection ◽

Parallel Corpora ◽

Text Type ◽

Comparable Corpora ◽

Bilingual Dictionaries ◽

Text Production ◽

Before And After ◽

Semantic Prosody ◽

Semantic Values

http://dx.doi.org/10.5007/2175-7968.2016v36nesp1p147Research in terminology has traditionally focused on nouns. Considerably less attention has been paid to other grammatical categories such as adverbs. However, these words can also be problematic for the novice translator, who tends to use the translation correspondences in bilingual dictionaries without realizing that formal equivalence is not necessarily the same as textual equivalence. However, semantic values, acquired in context, go far beyond dictionary meaning and are related to phenomena such as semantic prosody and preferences of lexical selection that can vary, depending on text type and specialized domain.This research explored the reasons why certain adverbial discourse connectors, apparently easy to translate, are a source of translation problems that cannot be easily resolved with a bilingual dictionary. Moreover, this study analyzed the use of parallel corpora in the translation classroom and how it can increase the quality of text production. For this purpose, we compared student translations before and after receiving training on the use of corpus analysis tools

Download Full-text

The role of text type and strategy use in L2 lexical inferencing

IRAL - International Review of Applied Linguistics in Language Teaching ◽

10.1515/iral-2015-0054 ◽

2018 ◽

Vol 56 (2) ◽

pp. 231-252

Author(s):

Ming-yueh Shen

Keyword(s):

Strategy Use ◽

Text Structure ◽

Expository Texts ◽

First Year ◽

Efl Learners ◽

Text Type ◽

Lexical Inferencing ◽

Text Types ◽

Quantitative Analyses

Abstract This study aimed to determine as to whether or not the text type and strategy usage affect the EFL learners’ lexical inferencing performance. The participants were comprised of 87 first-year English majors at a technical university. Data were collected from (1) a lexical inferencing test with excerpts of narrative and expository texts, for which both multiple-choice and definition tasks were designed, respectively, and then (2) the responses from the learners’ self-reported strategy usage. The quantitative analyses demonstrated that the text types significantly affected the EFL learners’ lexical inferencing performance, in which the EFL learners performed better for the narrative excerpt than for the expository texts. However, significant coefficients between the strategy use and the lexical inferencing performance were not found in this study. The results further implied that the text structure and the lexical inferencing strategies should be explicitly taught to the EFL learners.

Download Full-text

Semantics, contrastive linguistics and parallel corpora

Cognitive Studies | Études cognitives ◽

10.11649/cs.2014.009 ◽

2014 ◽

pp. 85-100

Author(s):

Violetta Koseska

Keyword(s):

Lexical Semantics ◽

Semantic Annotation ◽

Semantic Structure ◽

Automatic Annotation ◽

Parallel Corpora ◽

Parallel Corpus ◽

Linguistic Form ◽

Semantic Categories ◽

Contrastive Linguistics

Semantics, contrastive linguistics and parallel corporaIn view of the ambiguity of the term “semantics”, the author shows the differences between the traditional lexical semantics and the contemporary semantics in the light of various semantic schools. She examines semantics differently in connection with contrastive studies where the description must necessary go from the meaning towards the linguistic form, whereas in traditional contrastive studies the description proceeded from the form towards the meaning. This requirement regarding theoretical contrastive studies necessitates construction of a semantic interlanguage, rather than only singling out universal semantic categories expressed with various language means. Such studies can be strongly supported by parallel corpora. However, in order to make them useful for linguists in manual and computer translations, as well as in the development of dictionaries, including online ones, we need not only formal, often automatic, annotation of texts, but also semantic annotation - which is unfortunately manual. In the article we focus on semantic annotation concerning time, aspect and quantification of names and predicates in the whole semantic structure of the sentence on the example of the “Polish-Bulgarian-Russian parallel corpus”.

Download Full-text

Pseudotext Injection and Advance Filtering of Low-Resource Corpus for Neural Machine Translation

Computational Intelligence and Neuroscience ◽

10.1155/2021/6682385 ◽

2021 ◽

Vol 2021 ◽

pp. 1-10

Author(s):

Michael Adjeisah ◽

Guohua Liu ◽

Douglas Omwenga Nyabuga ◽

Richard Nuetey Nortey ◽

Jinling Song

Keyword(s):

Machine Translation ◽

Language Processing ◽

Training Data ◽

Target Language ◽

Similarity Metrics ◽

Mahalanobis Distances ◽

Parallel Corpora ◽

Parallel Corpus ◽

Low Resource ◽

Sentence Level

Scaling natural language processing (NLP) to low-resourced languages to improve machine translation (MT) performance remains enigmatic. This research contributes to the domain on a low-resource English-Twi translation based on filtered synthetic-parallel corpora. It is often perplexing to learn and understand what a good-quality corpus looks like in low-resource conditions, mainly where the target corpus is the only sample text of the parallel language. To improve the MT performance in such low-resource language pairs, we propose to expand the training data by injecting synthetic-parallel corpus obtained by translating a monolingual corpus from the target language based on bootstrapping with different parameter settings. Furthermore, we performed unsupervised measurements on each sentence pair engaging squared Mahalanobis distances, a filtering technique that predicts sentence parallelism. Additionally, we extensively use three different sentence-level similarity metrics after round-trip translation. Experimental results on a diverse amount of available parallel corpus demonstrate that injecting pseudoparallel corpus and extensive filtering with sentence-level similarity metrics significantly improves the original out-of-the-box MT systems for low-resource language pairs. Compared with existing improvements on the same original framework under the same structure, our approach exhibits tremendous developments in BLEU and TER scores.

Download Full-text

Using ParaConc to extract bilingual terminology from parallel corpora: A case of English and Ndebele

Literator ◽

10.4102/lit.v37i2.1278 ◽

2016 ◽

Vol 37 (1) ◽

Author(s):

Ketiwe Ndhlovu

Keyword(s):

Media Law ◽

Parallel Corpora ◽

Parallel Corpus ◽

African Languages ◽

Bilingual Dictionary ◽

Bilingual Dictionaries ◽

Key Word ◽

Science And Education ◽

Frequency Feature

The development of African languages into languages of science and technology is dependent on action being taken to promote the use of these languages in specialised fields such as technology, commerce, administration, media, law, science and education among others. One possible way of developing African languages is the compilation of specialised dictionaries (Chabata 2013). This article explores how parallel corpora can be interrogated using a bilingual concordancer (ParaConc) to extract bilingual terminology that can be used to create specialised bilingual dictionaries. An English–Ndebele Parallel Corpus was used as a resource and through ParaConc, an alphabetic list was compiled from which headwords and possible translations were sought. These translations provided possible terms for entry in a bilingual dictionary. The frequency feature and ‘hot words’ tool in ParaConc were used to determine the suitability of terms for inclusion in the dictionary and for identifying possible synonyms, respectively. Since parallel corpora are aligned and data are presented in context (Key Word in Context), it was possible to draw examples showing how headwords are used. Using this approach produced results quickly and accurately, whilst minimising the process of translating terms manually. It was noted that the quality of the dictionary is dependent on the quality of the corpus, hence the need for creating a representative and clean corpus needs to be emphasised. Although technology has multiple benefits in dictionary making, the research underscores the importance of collaboration between lexicographers, translators, subject experts and target communities so that representative dictionaries are created.

Download Full-text

The Web as a Parallel Corpus

Computational Linguistics ◽

10.1162/089120103322711578 ◽

2003 ◽

Vol 29 (3) ◽

pp. 349-380 ◽

Cited By ~ 178

Author(s):

Philip Resnik ◽

Noah A. Smith

Keyword(s):

Language Processing ◽

Large Scale ◽

Structural Features ◽

Classification Performance ◽

Internet Archive ◽

Parallel Corpora ◽

Parallel Corpus ◽

Original Algorithm ◽

Parallel Text ◽

The Web

Parallel corpora have become an essential resource for work in multilingual natural language processing. In this article, we report on our work using the STRAND system for mining parallel text on the World Wide Web, first reviewing the original algorithm and results and then presenting a set of significant enhancements. These enhancements include the use of supervised learning based on structural features of documents to improve classification performance, a new content-based measure of translational equivalence, and adaptation of the system to take advantage of the Internet Archive for mining parallel text from the Web on a large scale. Finally, the value of these techniques is demonstrated in the construction of a significant parallel corpus for a low-density language pair.

Download Full-text

Improving Machine Translation Performance by Exploiting Non-Parallel Corpora

Computational Linguistics ◽

10.1162/089120105775299168 ◽

2005 ◽

Vol 31 (4) ◽

pp. 477-504 ◽

Cited By ~ 104

Author(s):

Dragos Stefan Munteanu ◽

Daniel Marcu

Keyword(s):

Machine Translation ◽

Statistical Machine Translation ◽

Translation System ◽

Parallel Corpora ◽

Parallel Corpus ◽

Scarce Resources ◽

Parallel Data ◽

Machine Translation System ◽

Novel Method ◽

Arabic And English

We present a novel method for discovering parallel sentences in comparable, non-parallel corpora. We train a maximum entropy classifier that, given a pair of sentences, can reliably determine whether or not they are translations of each other. Using this approach, we extract parallel data from large Chinese, Arabic, and English non-parallel newspaper corpora. We evaluate the quality of the extracted data by showing that it improves the performance of a state-of-the-art statistical machine translation system. We also show that a good-quality MT system can be built from scratch by starting with a very small parallel corpus (100,000 words) and exploiting a large non-parallel corpus. Thus, our method can be applied with great benefit to language pairs for which only scarce resources are available.

Download Full-text