Development and Research in Lithuanian Language Technologies (2016–2020)

Frontiers in Artificial Intelligence and Applications - Human Language Technologies – The Baltic Perspective ◽

10.3233/faia200625 ◽

2020 ◽

Author(s):

Andrius Utka ◽

Jurgita Vaičenonienė ◽

Monika Briedienė ◽

Tomas Krilavičius

Keyword(s):

Machine Translation ◽

Language Resources ◽

Research Topics ◽

Research Production ◽

Language Technology ◽

Research Infrastructures ◽

Scientific Papers ◽

Language Technologies

The paper presents an overview of the development and research in Lithuanian language technologies for the period 2016–2020. The most significant national and international LT related initiatives, projects, research infrastructures, language resources and tools are discussed. The paper also surveys research production in the field of language technology for the Lithuanian language. The provided analysis of scientific papers shows that machine translation and speech technologies were the most trending research topics in 2016–2019.

Download Full-text

UPC: An Open Word-Sense Annotated Parallel Corpora for Machine Translation Study

Applied Sciences ◽

10.3390/app10113904 ◽

2020 ◽

Vol 10 (11) ◽

pp. 3904

Author(s):

Van-Hai Vu ◽

Quang-Phuoc Nguyen ◽

Joon-Choul Shin ◽

Cheol-Young Ock

Keyword(s):

Deep Learning ◽

Machine Translation ◽

Ambiguous Word ◽

High Rate ◽

Word Sense ◽

Language Resources ◽

Parallel Corpora ◽

Knowledge Based ◽

Translation Error ◽

Translation Study

Machine translation (MT) has recently attracted much research on various advanced techniques (i.e., statistical-based and deep learning-based) and achieved great results for popular languages. However, the research on it involving low-resource languages such as Korean often suffer from the lack of openly available bilingual language resources. In this research, we built the open extensive parallel corpora for training MT models, named Ulsan parallel corpora (UPC). Currently, UPC contains two parallel corpora consisting of Korean-English and Korean-Vietnamese datasets. The Korean-English dataset has over 969 thousand sentence pairs, and the Korean-Vietnamese parallel corpus consists of over 412 thousand sentence pairs. Furthermore, the high rate of homographs of Korean causes an ambiguous word issue in MT. To address this problem, we developed a powerful word-sense annotation system based on a combination of sub-word conditional probability and knowledge-based methods, named UTagger. We applied UTagger to UPC and used these corpora to train both statistical-based and deep learning-based neural MT systems. The experimental results demonstrated that using UPC, high-quality MT systems (in terms of the Bi-Lingual Evaluation Understudy (BLEU) and Translation Error Rate (TER) score) can be built. Both UPC and UTagger are available for free download and usage.

Download Full-text

Observatory for Language Resources and Machine Translation in Europe – LT_Observatory

Lecture Notes in Computer Science - Future and Emerging Trends in Language Technology. Machine Learning and Big Data ◽

10.1007/978-3-319-69365-1_2 ◽

2017 ◽

pp. 20-37

Author(s):

Bente Maegaard ◽

Claus Povlsen ◽

Sussi Olsen ◽

Lina Henriksen ◽

Margaretha Mazura ◽

...

Keyword(s):

Machine Translation ◽

Language Resources

Download Full-text

Recent advances in Apertium, a free/open-source rule-based machine translation platform for low-resource languages

Machine Translation ◽

10.1007/s10590-021-09260-6 ◽

2021 ◽

Author(s):

Tanmai Khanna ◽

Jonathan N. Washington ◽

Francis M. Tyers ◽

Sevilay Bayatlı ◽

Daniel G. Swanson ◽

...

Keyword(s):

Open Source ◽

Machine Translation ◽

Lexical Selection ◽

Rule Based ◽

Low Resource ◽

Language Technology ◽

Language Data ◽

Recursive Structures ◽

Platform Translation ◽

Free Open Source

AbstractThis paper presents an overview of Apertium, a free and open-source rule-based machine translation platform. Translation in Apertium happens through a pipeline of modular tools, and the platform continues to be improved as more language pairs are added. Several advances have been implemented since the last publication, including some new optional modules: a module that allows rules to process recursive structures at the structural transfer stage, a module that deals with contiguous and discontiguous multi-word expressions, and a module that resolves anaphora to aid translation. Also highlighted is the hybridisation of Apertium through statistical modules that augment the pipeline, and statistical methods that augment existing modules. This includes morphological disambiguation, weighted structural transfer, and lexical selection modules that learn from limited data. The paper also discusses how a platform like Apertium can be a critical part of access to language technology for so-called low-resource languages, which might be ignored or deemed unapproachable by popular corpus-based translation technologies. Finally, the paper presents some of the released and unreleased language pairs, concluding with a brief look at some supplementary Apertium tools that prove valuable to users as well as language developers. All Apertium-related code, including language data, is free/open-source and available at https://github.com/apertium.

Download Full-text

Digitizing Humanities in South Africa: Computational linguistic resources, training, and community building

Pop! Public. Open. Participatory ◽

10.54590/pop.2020.007 ◽

2020 ◽

Vol 02 ◽

Author(s):

Rooweither Mabuya ◽

Dimakatso Mathe ◽

Mmasibidi Setaka ◽

Menno van Zaanen

Keyword(s):

Social Sciences ◽

South Africa ◽

Computational Linguistic ◽

South African ◽

Digital Humanities ◽

Community Building ◽

Language Resources ◽

Language Technology ◽

Humanities And Social Sciences ◽

Digital Language

South Africa has eleven official languages. However, not all have received similar amounts of attention. In particular, for many of the languages, only a limited number of digital language resources (data sets and computational tools) exist. This scarcity hinders (computational) research in the fields of humanities and social sciences for these languages. Additionally, using existing computational linguistics tools in a practical setting requires expert knowledge on the usage of these tools. In South Africa, only a small number of people currently have this expertise, further limiting the type of research that relies on computational linguistic tools. The South African Centre for Digital Language Resources (SADiLaR) aims to enable and enhance research in the area of language technology by focusing on the development, management, and distribution of digital language resources for all South African languages. Additionally, it aims to build research capacity, specifically in the field of digital humanities. This requires several challenges to be resolved that we cluster under resources, training, and community building. SADiLaR hosts a repository of existing digital language resources and supports the development of new resources. Additionally, it provides training on the use of these resources, specifically for (but not limited to) researchers in the fields of humanities and social sciences. Through this training, SADiLaR tries to build a community of practice to boost information sharing in the area of digital humanities.

Download Full-text

GlobalMind

Selected Readings on Global Information Technology ◽

10.4018/978-1-60566-116-2.ch014 ◽

2011 ◽

pp. 205-229

Author(s):

Hyemin Chung ◽

Henry Lieberman

Keyword(s):

Cultural Differences ◽

Machine Translation ◽

Common Sense ◽

Cross Cultural ◽

Cultural Communication ◽

Language Resources ◽

Language Differences ◽

Cross Cultural Communication ◽

Similarities And Differences ◽

Common Sense Knowledge

The need for more effective communication between people of different countries has increased as travel and communications bring more of the world’s people together. Communication is often difficult because of both language differences and cultural differences. Attempts to bridge these differences include many attempts to perform machine translation or provide language resources such as dictionaries or phrase books; however, many problems related to cultural and conceptual differences still remain. Automated mechanisms to analyze cultural similarities and differences might be used to improve traditional machine translators and as aids to cross-cultural communication. This article presents an approach to automatically compute cultural differences by comparing databases of common-sense knowledge in different languages and cultures. Global- Mind provides an interface for acquiring databases of common-sense knowledge from users who speak different languages. It implements inference modules to compute the cultural similarities and differences between these databases. In this article, the design of the GlobalMind databases, the implementation of its inference modules, as well as an evaluation of GlobalMind are described.

Download Full-text

Physical and remote access to the European Volcano Research Infrastructures as a strategy to promote the community building: efforts, challenges, and results.

10.5194/egusphere-egu2020-13157 ◽

2020 ◽

Author(s):

Letizia Spampinato ◽

Giuseppe Puglisi

Keyword(s):

Community Building ◽

Volcanic Rocks ◽

Remote Access ◽

Research Topics ◽

Scientific Communities ◽

Research Infrastructures ◽

Physical Access ◽

Selection Of ◽

Volcano Observatories ◽

The Ideal

Indeed, nowadays data sharing via internet is one of the most used approaches to networking scientific communities. However, the opportunity to physically access Research Infrastructures (RIs) and their installations and facilities is potentially the most powerful mean to build up a community. Physically access, in fact, makes the ideal conditions for the RI&#8217;s providers and users to work side by side on specific research topics. This is recently the case of the European trans-national access activities promoted in order to allow and push the volcanology community to use either the volcano observatories, to carry out experiments or fieldworks, or laboratories, for exploiting analytical and computational facilities, belonging to the main European volcano research institutions.The EUROVOLC project has granted the access to 11 RIs for an overall of 45 installations, including single facilities of pools of mobile instrumentation and of laboratories, and remote access to collections of volcanic rocks, of 5 European countries (France, Iceland, Italy, Portugal, and Spain). In the frame of the project, the trans-national access offer has come from 7 partners (IMO, UI, INGV&CNR, CIVISA, IPGP, and CSIC) acting in 7 WPs (13, 14, 16, 16, 17, 18, and 19).The EUROVOLC work-plan has foreseen two calls, one in 2018 and the other in 2019, allowing users to apply for access the RIs, and the effective physical access in 2019 and 2020, respectively. Each call has been managed according to a stepwise process based on an excellence-driven criterion, in which the roles of the various actors and the schedule have been previously defined.This contribution aims at presenting the management and coordination efforts related to the trans-national access activities in the frame of EUROVOLC including the preparation and the launch of the 1st call, the process of the selection of the proposals, the feedback from the management of the 1st call, the preparation of the 2nd call, and a critical analysis for improving the management of the 2nd call.

Download Full-text

Les publications japonaises sur la traduction : un aperçu

Meta Journal des traducteurs ◽

10.7202/002917ar ◽

2002 ◽

Vol 33 (1) ◽

pp. 115-126 ◽

Cited By ~ 3

Author(s):

Daniel Gile

Keyword(s):

Machine Translation ◽

Language Learners ◽

General Public ◽

Public Speaking ◽

Japanese Language ◽

Translation Theory ◽

The World ◽

Scientific Papers ◽

Didactic Texts ◽

Japanese Translation

Summary Japanese publications on translation are markedly more numerous than Western publications. They are aimed at the general public rather than at professionals or academics, and few are truly scientific or academic. They deal with the Japanese context, with hardly any reference to foreign publications, authors, ideas or translation activities. They are also short-lived and disappear from bookstores and publishers' stocks within a few years. Theoretical translation texts are "philosophical" rather than scientific. Didactic texts are often aimed at language learners rather than at would-be translators. Linguistic translation texts are more interesting for the insight they give into the Japanese language and its use than for their contribution to translation theory. Texts that criticize published translations are numerous and very popular, something which is rather unique in the world. Many translation books are highly personal and contain numerous anecdotes from their authors' lives. Interpretation books are interesting, as they are more pragmatic than Western texts on the same subject, and address questions that Western publications seldom or never refer to. Machine translation articles are becoming increasingly popular. They tend to be confined to superficial explanations of the operation of systems and to descriptions of commercial products. Truly scientific papers on MT also exist, but their circulation is limited to academic and technical circles. There are a few periodicals dealing with translation. Most of the articles they carry are written by the same authors and have the same characteristics as the texts described above. On the whole, they are more interesting than translation books, as they are shorter and therefore denser. Articles on translation can also be found in countless books and periodicals on the Japanese language, on linguistics, sociology, public speaking, etc., as well as in weekly and monthly magazines and in other publications. This paper is followed by a list of Japanese texts on translation and by a list of Western language texts on translation of Japanese or on subjects relevant to the understanding of Japanese translation problems.

Download Full-text

The Role of CLARIN in Digital Transformations in the Humanities

International Journal of Humanities and Arts Computing ◽

10.3366/ijhac.2013.0083 ◽

2013 ◽

Vol 7 (1-2) ◽

pp. 89-104 ◽

Cited By ~ 2

Author(s):

Martin Wynne

Keyword(s):

Case Studies ◽

Digital Technologies ◽

Research Infrastructure ◽

Language Resources ◽

Unique Character ◽

Digital Scholarship ◽

New Directions ◽

Research Infrastructures ◽

Distributed Infrastructure

CLARIN is a recently-established research infrastructure which aims to build and sustain services based on language resources and tools. CLARIN aims to support and foster the next generation of research in the humanities, which will make use of advanced digital technologies. A distributed infrastructure is necessary in order to overcome the problems of the current fragmented environment, to create an ecosystem in which data and tools can be connected, and in which innovation will be encouraged. Case studies of early CLARIN demonstrators give a flavour of the possibilities of digital transformations in a number of humanities disciplines, and there is huge potential for important future new directions in literary and linguistic computing. For more widespread, thoroughgoing and effective transformations to take place, builders of infrastructure and researchers will need to negotiate and avoid potential pitfalls, and agree to achieve a certain measure of consensus in terms of priorities, categories and concepts. In the context of current debates about the nature of the humanities and their role in society, it will be necessary for digital humanists to be careful to preserve the unique character and importance of research in the humanities, while moving towards research infrastructures which will facilitate digital scholarship.

Download Full-text

Multilingual Dependency Parsing: Using Machine Translated Texts Instead of Parallel Corpora

Prague Bulletin of Mathematical Linguistics ◽

10.2478/pralin-2014-0017 ◽

2014 ◽

Vol 102 (1) ◽

pp. 93-104

Author(s):

Ramasamy Loganathan ◽

Mareček David ◽

Žabokrtský Zdenčk

Keyword(s):

Machine Translation ◽

The Other ◽

Target Language ◽

Grammar Induction ◽

Language Resources ◽

Parallel Corpora ◽

Similar Performance ◽

Part Of Speech ◽

Target Languages ◽

Cross Lingual

Abstract This paper revisits the projection-based approach to dependency grammar induction task. Traditional cross-lingual dependency induction tasks one way or the other, depend on the existence of bitexts or target language tools such as part-of-speech (POS) taggers to obtain reasonable parsing accuracy. In this paper, we transfer dependency parsers using only approximate resources, i.e., machine translated bitexts instead of manually created bitexts. We do this by obtaining the the source side of the text from a machine translation (MT) system and then apply transfer approaches to induce parser for the target languages. We further reduce the need for the availability of labeled target language resources by using unsupervised target tagger. We show that our approach consistently outperforms unsupervised parsers by a bigger margin (8.2% absolute), and results in similar performance when compared with delexicalized transfer parsers.

Download Full-text

Managing mining project documentation using human language technology

The Electronic Library ◽

10.1108/el-11-2017-0239 ◽

2018 ◽

Vol 36 (6) ◽

pp. 993-1009

Author(s):

Aleksandra Tomašević ◽

Ranka Stanković ◽

Miloš Utvić ◽

Ivan Obradović ◽

Božo Kolonja

Keyword(s):

Language Processing ◽

Keyword Search ◽

Semantic Networks ◽

Human Language ◽

Language Resources ◽

Content Type ◽

Efficient Management ◽

Language Technology ◽

Mining Projects ◽

Human Language Technology

Purpose This paper aims to develop a system, which would enable efficient management and exploitation of documentation in electronic form, related to mining projects, with information retrieval and information extraction (IE) features, using various language resources and natural language processing. Design/methodology/approach The system is designed to integrate textual, lexical, semantic and terminological resources, enabling advanced document search and extraction of information. These resources are integrated with a set of Web services and applications, for different user profiles and use-cases. Findings The use of the system is illustrated by examples demonstrating keyword search supported by Web query expansion services, search based on regular expressions, corpus search based on local grammars, followed by extraction of information based on this search and finally, search with lexical masks using domain and semantic markers. Originality/value The presented system is the first software solution for implementation of human language technology in management of documentation from the mining engineering domain, but it is also applicable to other engineering and non-engineering domains. The system is independent of the type of alphabet (Cyrillic and Latin), which makes it applicable to other languages of the Balkan region related to Serbian, and its support for morphological dictionaries can be applied in most morphologically complex languages, such as Slavic languages. Significant search improvements and the efficiency of IE are based on semantic networks and terminology dictionaries, with the support of local grammars.

Download Full-text