Conceptual analysis of parallel corpus collected from the Web

Kar Wing Li; Christopher C. Yang

doi:10.1002/asi.20326

Building an Italian-Chinese Parallel Corpus for Machine Translation from the Web

Proceedings of the 6th EAI International Conference on Smart Objects and Technologies for Social Good ◽

10.1145/3411170.3411258 ◽

2020 ◽

Author(s):

Rita Tse ◽

Silvia Mirri ◽

Su-Kit Tang ◽

Giovanni Pau ◽

Paola Salomoni

Keyword(s):

Machine Translation ◽

Parallel Corpus ◽

The Web

Download Full-text

Constructing a Large-Scale English-Persian Parallel Corpus

Meta Journal des traducteurs ◽

10.7202/029804ar ◽

2009 ◽

Vol 54 (1) ◽

pp. 181-188 ◽

Cited By ~ 10

Author(s):

Tayebeh Mosavi Miangah

Keyword(s):

Large Scale ◽

Target Language ◽

Translation Memory ◽

Web Documents ◽

Parallel Corpus ◽

Translation Quality ◽

Text Corpora ◽

Develop Software ◽

General Translation ◽

The Web

Abstract In recent years the exploitation of large text corpora in solving various kinds of linguistic problems, including those of translation, is commonplace. Yet a large-scale English-Persian corpus is still unavailable, because of certain difficulties and the amount of work required to overcome them. The project reported here is an attempt to constitute an English-Persian parallel corpus composed of digital texts and Web documents containing little or no noise. The Internet is useful because translations of existing texts are often published on the Web. The task is to find parallel pages in English and Persian, to judge their translation quality, and to download and align them. The corpus so created is of course open; that is, more material can be added as the need arises. One of the main activities associated with building such a corpus is to develop software for parallel concordancing, in which a user can enter a search string in one language and see all the citations for that string in it and corresponding sentences in the target language. Our intention is to construct general translation memory software using the present English-Persian parallel corpus.

Download Full-text

A scalable approach to building a parallel corpus from the web

10.21437/interspeech.2011-554 ◽

2011 ◽

Author(s):

Vivek Kumar Rangarajan Sridhar ◽

Luciano Barbosa ◽

Srinivas Bangalore

Keyword(s):

Parallel Corpus ◽

The Web

Download Full-text

Automatic Acquisition of Chinese–English Parallel Corpus from the Web

Lecture Notes in Computer Science - Advances in Information Retrieval ◽

10.1007/11735106_37 ◽

2006 ◽

pp. 420-431 ◽

Cited By ~ 26

Author(s):

Ying Zhang ◽

Ke Wu ◽

Jianfeng Gao ◽

Phil Vines

Keyword(s):

Parallel Corpus ◽

Automatic Acquisition ◽

The Web

Download Full-text

The Web as a Parallel Corpus

Computational Linguistics ◽

10.1162/089120103322711578 ◽

2003 ◽

Vol 29 (3) ◽

pp. 349-380 ◽

Cited By ~ 178

Author(s):

Philip Resnik ◽

Noah A. Smith

Keyword(s):

Language Processing ◽

Large Scale ◽

Structural Features ◽

Classification Performance ◽

Internet Archive ◽

Parallel Corpora ◽

Parallel Corpus ◽

Original Algorithm ◽

Parallel Text ◽

The Web

Parallel corpora have become an essential resource for work in multilingual natural language processing. In this article, we report on our work using the STRAND system for mining parallel text on the World Wide Web, first reviewing the original algorithm and results and then presenting a set of significant enhancements. These enhancements include the use of supervised learning based on structural features of documents to improve classification performance, a new content-based measure of translational equivalence, and adaptation of the system to take advantage of the Internet Archive for mining parallel text from the Web on a large scale. Finally, the value of these techniques is demonstrated in the construction of a significant parallel corpus for a low-density language pair.

Download Full-text

The IJS-ELAN Slovene-English Parallel Corpus

International Journal of Corpus Linguistics ◽

10.1075/ijcl.7.1.01erj ◽

2002 ◽

Vol 7 (1) ◽

pp. 1-20 ◽

Cited By ~ 4

Author(s):

Tomaž Erjavec

Keyword(s):

Text Encoding ◽

Language Engineering ◽

Parallel Corpus ◽

Feature Structures ◽

The Web ◽

The Eu

The paper presents an annotated parallel Slovene-English corpus developed in the scope of the EU ELAN project. The IJS-ELAN corpus was compiled to be a widely distributable dataset for language engineering and for translation and terminology studies. The corpus contains 1 million words from fifteen recent terminology-rich texts. The corpus is sentence aligned and word-tagged with context disambiguated morphosyntactic descriptions and lemmas. These descriptions model simple feature structures, the structure of which is shared between Slovene and English. The corpus is encoded according to the Guidelines for Text Encoding and Interchange and is freely available on the Web for downloading. Additionally, access to IJS-ELAN is available via a powerful Web concordancer.

Download Full-text

Automatic Acquisition of Large-Scale Academic Bilingual Parallel Corpus from the Web

2009 International Conference on Asian Language Processing ◽

10.1109/ialp.2009.75 ◽

2009 ◽

Author(s):

Han Yong ◽

Li Yu ◽

He Xiaoning ◽

Yang Muyun ◽

Lei Guohua

Keyword(s):

Large Scale ◽

Parallel Corpus ◽

Automatic Acquisition ◽

The Web

Download Full-text

Sharing-collaborative economy in tourism: A bibliometric analysis and perspectives for the post-pandemic era

Tourism Economics ◽

10.1177/13548166211035712 ◽

2021 ◽

pp. 135481662110357

Author(s):

Natalia Vila-Lopez ◽

Inés Küster-Boluda

Keyword(s):

Conceptual Analysis ◽

Web Of Science ◽

Citation Index ◽

Sharing Economy ◽

Science Mapping ◽

Future Directions ◽

H Index ◽

Word Analysis ◽

Index Analysis ◽

The Web

Sharing economy research has risen exponentially during the last 4 years. Although several theoretical revisions on this topic have been developed, a conceptual analysis based on bibliometric techniques and science mapping tools is lacking. Within this framework, this article has two aims: (i) to carry on a performance analysis to identify the outstanding themes and (ii) to visually present the scientific structure by topics of research in sharing-collaborative economy as well as its evolution to identify future directions. The resources in the Web of Science Citation Index were used. Intelligent techniques and, more specifically, the SciMAT tool (based on co-word analysis and h-index analysis) were applied using a sample of 940 indexed papers from 2010 to 2020 (with 10.652 global citations). Our results show that the new post-pandemic era requires the sharing economy industry to investigate alternative ways: to improve trust, to innovate, to search for authenticity and experiences, to attend tourist motivations based on sustainability, and to use big data and manage overtourism.

Download Full-text

Augmented conceptual analysis of the web

CHI '97 extended abstracts on Human factors in computing systems looking to the future - CHI '97 ◽

10.1145/1120212.1120359 ◽

1997 ◽

Author(s):

Wendy A. Kellogg ◽

Jakob Nielsen

Keyword(s):

Conceptual Analysis ◽

The Web

Download Full-text

TRANSCODING BETWEEN HYPER-ANTAGONISTIC MILIEUS: STUDIES ON THE CROSS-PLATFORM RELATIONS BETWEEN RADICAL POLITICAL WEB SUBCULTURES

AoIR Selected Papers of Internet Research ◽

10.5210/spir.v2020i0.11123 ◽

2020 ◽

Author(s):

Sal Hagen ◽

Marc Tuters ◽

Stijn Peeters ◽

Emillie De Keulenaar ◽

Jack Wilson ◽

...

Keyword(s):

Social Media ◽

Conceptual Analysis ◽

Data Driven ◽

Far Right ◽

Cross Platform ◽

The Cross ◽

Anglo American ◽

Empirical Approaches ◽

The Web ◽

Half Decade

This panel brings together research into the cross-platform relations between radical Web subcultures and how they are constitutive of “hyper-antagonistic” politics in broader Web discourses. The papers share a concern with vernacular practices of “fringe” platforms favoured by an insurgent far-right movement and their relations to more “mainstream” social media. They engage with the concept of “transcoding between milieus” (Deleuze & Guattari 1987, 322) as a means to empirically describe multiple transversal processes across different strata of the Web in which “one milieu serves as the basis for another” (313). All papers ground their conceptual analysis in data-driven empirical approaches using historical datasets ranging from “mainstream” platforms like YouTube, to more “fringe” spaces like 4chan. The papers furthermore all use 4chan’s far-right /pol/ board as a reference point for a vernacular “hyper-antagonistic” style that emerged out of this period – a style that has often been related to the “alt-right”. Together, the four papers in this panel offer insights into the apparent insurgency of far-right subcultures within broader online discourse in the Anglo-American context over the course of the last half decade. Each does so with a particular focus, ranging from subcultural conflict between Tumblr and 4chan, the transcoding of the “Kekistan” meme between 4chan and YouTube, the emergence of far-right vernacular in the comments of Breitbart News, and the robustness of hyper-antagonistic discourse after deplatforming measures.

Download Full-text