A scalable approach to building a parallel corpus from the web

Building an Italian-Chinese Parallel Corpus for Machine Translation from the Web

Proceedings of the 6th EAI International Conference on Smart Objects and Technologies for Social Good ◽

10.1145/3411170.3411258 ◽

2020 ◽

Author(s):

Rita Tse ◽

Silvia Mirri ◽

Su-Kit Tang ◽

Giovanni Pau ◽

Paola Salomoni

Keyword(s):

Machine Translation ◽

Parallel Corpus ◽

The Web

Download Full-text

Constructing a Large-Scale English-Persian Parallel Corpus

Meta Journal des traducteurs ◽

10.7202/029804ar ◽

2009 ◽

Vol 54 (1) ◽

pp. 181-188 ◽

Cited By ~ 10

Author(s):

Tayebeh Mosavi Miangah

Keyword(s):

Large Scale ◽

Target Language ◽

Translation Memory ◽

Web Documents ◽

Parallel Corpus ◽

Translation Quality ◽

Text Corpora ◽

Develop Software ◽

General Translation ◽

The Web

Abstract In recent years the exploitation of large text corpora in solving various kinds of linguistic problems, including those of translation, is commonplace. Yet a large-scale English-Persian corpus is still unavailable, because of certain difficulties and the amount of work required to overcome them. The project reported here is an attempt to constitute an English-Persian parallel corpus composed of digital texts and Web documents containing little or no noise. The Internet is useful because translations of existing texts are often published on the Web. The task is to find parallel pages in English and Persian, to judge their translation quality, and to download and align them. The corpus so created is of course open; that is, more material can be added as the need arises. One of the main activities associated with building such a corpus is to develop software for parallel concordancing, in which a user can enter a search string in one language and see all the citations for that string in it and corresponding sentences in the target language. Our intention is to construct general translation memory software using the present English-Persian parallel corpus.

Download Full-text

Automatic Acquisition of Chinese–English Parallel Corpus from the Web

Lecture Notes in Computer Science - Advances in Information Retrieval ◽

10.1007/11735106_37 ◽

2006 ◽

pp. 420-431 ◽

Cited By ~ 26

Author(s):

Ying Zhang ◽

Ke Wu ◽

Jianfeng Gao ◽

Phil Vines

Keyword(s):

Parallel Corpus ◽

Automatic Acquisition ◽

The Web

Download Full-text

The Web as a Parallel Corpus

Computational Linguistics ◽

10.1162/089120103322711578 ◽

2003 ◽

Vol 29 (3) ◽

pp. 349-380 ◽

Cited By ~ 178

Author(s):

Philip Resnik ◽

Noah A. Smith

Keyword(s):

Language Processing ◽

Large Scale ◽

Structural Features ◽

Classification Performance ◽

Internet Archive ◽

Parallel Corpora ◽

Parallel Corpus ◽

Original Algorithm ◽

Parallel Text ◽

The Web

Parallel corpora have become an essential resource for work in multilingual natural language processing. In this article, we report on our work using the STRAND system for mining parallel text on the World Wide Web, first reviewing the original algorithm and results and then presenting a set of significant enhancements. These enhancements include the use of supervised learning based on structural features of documents to improve classification performance, a new content-based measure of translational equivalence, and adaptation of the system to take advantage of the Internet Archive for mining parallel text from the Web on a large scale. Finally, the value of these techniques is demonstrated in the construction of a significant parallel corpus for a low-density language pair.

Download Full-text

The IJS-ELAN Slovene-English Parallel Corpus

International Journal of Corpus Linguistics ◽

10.1075/ijcl.7.1.01erj ◽

2002 ◽

Vol 7 (1) ◽

pp. 1-20 ◽

Cited By ~ 4

Author(s):

Tomaž Erjavec

Keyword(s):

Text Encoding ◽

Language Engineering ◽

Parallel Corpus ◽

Feature Structures ◽

The Web ◽

The Eu

The paper presents an annotated parallel Slovene-English corpus developed in the scope of the EU ELAN project. The IJS-ELAN corpus was compiled to be a widely distributable dataset for language engineering and for translation and terminology studies. The corpus contains 1 million words from fifteen recent terminology-rich texts. The corpus is sentence aligned and word-tagged with context disambiguated morphosyntactic descriptions and lemmas. These descriptions model simple feature structures, the structure of which is shared between Slovene and English. The corpus is encoded according to the Guidelines for Text Encoding and Interchange and is freely available on the Web for downloading. Additionally, access to IJS-ELAN is available via a powerful Web concordancer.

Download Full-text

Automatic Acquisition of Large-Scale Academic Bilingual Parallel Corpus from the Web

2009 International Conference on Asian Language Processing ◽

10.1109/ialp.2009.75 ◽

2009 ◽

Author(s):

Han Yong ◽

Li Yu ◽

He Xiaoning ◽

Yang Muyun ◽

Lei Guohua

Keyword(s):

Large Scale ◽

Parallel Corpus ◽

Automatic Acquisition ◽

The Web

Download Full-text

Conceptual analysis of parallel corpus collected from the Web

Journal of the American Society for Information Science and Technology ◽

10.1002/asi.20326 ◽

2006 ◽

Vol 57 (5) ◽

pp. 632-644 ◽

Cited By ~ 7

Author(s):

Kar Wing Li ◽

Christopher C. Yang

Keyword(s):

Conceptual Analysis ◽

Parallel Corpus ◽

The Web

Download Full-text

The corpus, its users and their needs

International Journal of Corpus Linguistics ◽

10.1075/ijcl.12.3.03san ◽

2007 ◽

Vol 12 (3) ◽

pp. 335-374 ◽

Cited By ~ 5

Author(s):

Diana Santos ◽

Ana Frankenberg-Garcia

Keyword(s):

User Studies ◽

Log Analysis ◽

Web Interface ◽

Language Resources ◽

Parallel Corpus ◽

Null Results ◽

Search Modes ◽

The Web

COMPARA is a bidirectional parallel corpus of English and Portuguese, currently with 3 million words. The corpus was launched in 2000 and at present it is possibly the largest edited parallel corpus publicly available on the Web, with roughly 6,000 corpus queries per month. This paper summarizes an analysis of six years of corpus use. We begin by looking at user studies for language resources, especially corpora, and then we provide a snapshot of COMPARA’s users and their behaviour based on log analysis. Particular emphasis is given to the language interface preferred by users (Portuguese and English are possible), the choice between the Simple and Complex Search modes, the reasons underlying null-results and behaviour after restricted output. The data has pointed us to cases where COMPARA’s Web interface can be improved, and provided insights about our users and the problems they face, although further studies that distinguish between different kinds of users remain necessary.

Download Full-text