The Bulgarian-Polish-Russian parallel corpus

Cognitive Studies | Études cognitives ◽

10.11649/cs.2011.015 ◽

2015 ◽

pp. 241-254

Author(s):

Maksim Duškin ◽

Joanna Satoła-Staśkowiak

Keyword(s):

Parallel Corpora ◽

Parallel Corpus ◽

Polish Literature ◽

Slavic Languages ◽

Eastern Group ◽

Language Studies ◽

Western Group ◽

Characteristic Features ◽

Academy Of Sciences ◽

Linguistic Material

The Bulgarian-Polish-Russian parallel corpusThe Semantics Laboratory Team of Institute of Slavic Studies of Polish Academy of Sciences is planning to begin work on the creation of a Bulgarian-Polish-Russian parallel corpus. The three selected languages are representatives of the main groups of Slavic languages: Bulgarian represents the southern group of Slavic languages, Polish – the western group of Slavic languages, Russian – the eastern group of Slavic languages. Our project will be the first parallel corpus of these three languages. The planned corpus will be based on material, dating from one period (the 20th century) and will have a synchronous nature. The project will not constitute the sum of the separate corpora of selected languages.One of the problems with creating multilingual parallel corpora are different proportions of translated texts between the selected languages, for example, Polish literature is often translated into Bulgarian, but not vice versa.Bulgarian, Russian and Polish differ typologically – Bulgarian is an analytic language, Polish and Russian are synthetic. The parallel corpus should have compatible annotation, while taking into account the characteristic features of the selected languages.We hope that the Bulgarian-Polish-Russian parallel corpus will serve as a source of linguistic material of contrastive language studies and may prove to be a big help for linguists, translators, terminologists and students of linguistics. The results of our work will be available on the Internet.

Download Full-text

Języki słowiańskie i litewski w korpusach równoległych Clarin-PL

Studia z Filologii Polskiej i Słowiańskiej ◽

10.11649/sfps.2016.011 ◽

2016 ◽

Vol 51 ◽

pp. 191-217

Author(s):

Violetta Koseska-Toszewa ◽

Roman Roszko

Keyword(s):

Corpus Linguistics ◽

Semantic Annotation ◽

Original System ◽

Language Resources ◽

Natural Languages ◽

Parallel Corpora ◽

Parallel Corpus ◽

Slavic Languages ◽

European Languages ◽

Polish Language

Slavic languages and the Lithuanian language in the Clarin-PL parallel corporaThe Clarin Eric and Clarin-PL strategic scientific purpose is to support humanistic research in a multicultural and multilingual Europe. Polish researchers put the emphasis on building a bridge between the Polish language and Polish linguistic technologies and other European languages and their linguistic technologies. So far, the Polish scientific community has mainly focused on Polish-English connections. Clarin-PL has been developing the first and only multilingual corpora of the Polish language in conjunction with other Slavic languages and the Lithuanian language: the Polish-Bulgarian-Russian Parallel Corpus and the Polish- Lithuanian Parallel Corpus. The parallel corpora created by the ISS PAS Corpus Linguistics and Semantics Team break through the existing “canons” and allow scientists access to interlinked multilingual language resources – in the first phase limited to the languages of the three Slavic groups and the Lithuanian language. In the article, the authors present very detailed information on their original system of the semantic annotation of scope quantification in multilingual parallel corpora, hitherto unused in the subject literature. Due to the system’s originality, the semantic annotation is carried out manually. Identification of particular values of scope quantification in a sentence and the hereby presented attempts of its recording are supported by long-term research conducted by an international team of linguists and computer scientists / mathematicians developing the issue of quantification of names, time and aspect in natural languages. Języki słowiańskie i litewski w korpusach równoległych Clarin-PLStrategicznym celem naukowym Clarin ERIC i Clarin-PL jest wspieranie badań humanistycznych w wielokulturowej i wielojęzycznej Europie. Dla polskich badaczy ważna jest budowa pomostu między językiem polskim, polskimi technologiami językowymi a innymi językami europejskimi i na ich rzecz opracowanymi technologiami językowymi. Dotychczas w nauce polskiej największy nacisk był kładziony na powiązania polsko-angielskie. Clarin-PL opracowuje zatem pierwsze jak dotąd wielojęzyczne korpusy języka polskiego w zestawieniu z innymi językami słowiańskimi oraz z językiem litewskim: Korpus równoległy polsko-bułgarsko-rosyjski i Korpus równoległy polsko-litewski. Tworzone przez Zespół Lingwistyki Korpusowej i Semantyki (IS PAN) korpusy równoległe przełamują dotychczasowe „kanony” i udostępniają nauce powiązane wielojęzyczne zasoby – w pierwszym etapie ograniczone do języków trzech grup słowiańskich oraz języka litewskiego. W artykule autorzy przedstawiają bardzo szczegółową informację o zastosowanej po raz pierwszy w literaturze przedmiotu anotacji semantycznej dotyczącej kwantyfikacji zakresowej w wielojęzycznych korpusach równoległych. Z powodu swojego rozległego zakresu i nowatorstwa ta anotacja semantyczna jest nanoszona ręcznie. Identyfikacja poszczególnych wartości kwantyfikacji zakresowej w zdaniu oraz przedstawiane tu próby jej zapisu są poparte wieloletnimi badaniami międzynarodowego zespołu lingwistów i matematyków-informatyków opracowujących zagadnienie kwantyfikacji imion, czasu i aspektu w językach naturalnych.

Download Full-text

STRATEGIES OF MODERN LINGUISTIC RESEARCH AND TARGETS OF UKRAINIAN LANGUAGE STUDIES PERFORMED IN THE NATIONAL ACADEMY OF SCIENCES OF UKRAINE

Visnik Nacional noi academii nauk Ukrai ni ◽

10.15407/visn2018.06.064 ◽

2018 ◽

Vol 06 ◽

pp. 64-74

Author(s):

P.Yu. Hrytsenko ◽

Keyword(s):

National Academy Of Sciences ◽

Linguistic Research ◽

Language Studies ◽

Academy Of Sciences

Download Full-text

Young Scholars Conference “Slavic World: Commonality and Diversity”. 21–22 May 2019. Session “Linguistics”

Slavic World in the Third Millennium ◽

10.31168/2412-6446.2019.14.1-2.21 ◽

2019 ◽

Vol 14 (1-2) ◽

pp. 295-297

Author(s):

Sergej A. Borisov

Keyword(s):

Czech Republic ◽

Middle Ages ◽

The Czech Republic ◽

International Context ◽

Slavic Languages ◽

Russian Academy Of Sciences ◽

Current State ◽

Wide Range ◽

History Of ◽

Academy Of Sciences

For more than twenty years, the Institute of Slavic Studies of the Russian Academy of Sciences celebrates the Day of Slavic Writing and Culture with a traditional scholarly conference.”. Since 2014, it has been held in the young scholars’ format. In 2019, participants from Moscow, St. Petersburg, Kazan, Togliatti, Tyumen, Yekaterinburg, and Rostov-on-Don, as well as Slovakia, the Czech Republic, Hungary, and Romania continued this tradition. A wide range of problems related to the history of the Slavic peoples from the Middle Ages to the present time in the national, regional and international context were discussed again. Participants talked about the typology of Slavic languages and dialects, linguo-geography, socio- and ethnolinguistics, analyzed formation, development, current state, and prospects of Slavic literatures, etc.

Download Full-text

Rutile Mineral Chemistry and Zr-in-Rutile Thermometry in Provenance Study of Albian (Uppermost Lower Cretaceous) Terrigenous Quartz Sands and Sandstones in Southern Extra-Carpathian Poland

Minerals ◽

10.3390/min11060553 ◽

2021 ◽

Vol 11 (6) ◽

pp. 553

Author(s):

Jakub Kotowski ◽

Krzysztof Nejbert ◽

Danuta Olszewska-Nejbert

Keyword(s):

High Temperature ◽

Source Area ◽

Bimodal Distribution ◽

Provenance Study ◽

Eastern Group ◽

Western Group ◽

Quartz Sands ◽

Probable Source ◽

Eclogite Facies ◽

Zr In Rutile

The geochemistry of detrital rutile grains, which are extremely resistant to weathering, was used in a provenance study of the transgressive Albian quartz sands in the southern part of extra-Carpathian Poland. Rutile grains were sampled from eight outcrops and four boreholes located on the Miechów, Szydłowiec, and Puławy Segments. The crystallization temperatures of the rutile grains, calculated using a Zr-in-rutile geothermometer, allowed for the division of the study area into three parts: western, central, and eastern. The western group of samples, located in the Miechów Segment, is characterized by a polymodal distribution of rutile crystallization temperatures (700–800 °C; 550–600 °C, and c. 900 °C) with a significant predominance of high-temperature forms, and with a clear prevalence of metapelitic over metamafic rutile. The eastern group of samples, corresponding to the Lublin Area, is monomodal and their crystallization temperatures peak at 550–600 °C. The contents of metapelitic to metamafic rutile in the study area are comparable. The central group of rutile samples with bimodal distribution (550–600 °C and 850–950 °C) most likely represents a mixing zone, with a visible influence from the western and, to a lesser extent, the eastern group. The most probable source area for the western and the central groups seems to be granulite and high-temperature eclogite facies rocks from the Bohemian Massif. The most probable source area for the eastern group of rutiles seems to be amphibolites and low temperature eclogite facies rocks, probably derived from the southern part of the Baltic Shield.

Download Full-text

25 years of the Institute of Slavic Studies of the Slovak Academy of Sciences

Slavic World in the Third Millennium ◽

10.31168/2412-6446.2020.15.3-4.16 ◽

2020 ◽

Vol 15 (3-4) ◽

pp. 226-235

Author(s):

Marina M. Valentsova ◽

Elena S. Uzeneva

Keyword(s):

Middle Ages ◽

Scientific Work ◽

25Th Anniversary ◽

Religious Texts ◽

Language And Culture ◽

Slavic Languages ◽

Language History ◽

History Of ◽

Academy Of Sciences ◽

Slavic Studies

The essay was written to mark the 25th anniversary of the Slavic Institute named after Jan Stanislav SAS (Bratislava). The Institute was founded to conduct interdisciplinary research on the relationships of the Slovak language and culture with other Slavic languages and cultures, as well as to study the Slovak-Latin, Slovak-Hungarian, and Slovak-German cultural and linguistic interactions in ancient times and the Middle Ages. The article introduces the main milestones in the formation and development of the Institute, its employees, the directions of their scientific work, and their significant publications. The main areas of research of the Slavic Institute (initially the Slavic Cabinet) cover linguistics (lexicography, history of language), history, folklore, cultural studies, musicology, and textology. Much attention is paid to the annotated translation of foreign religious texts into Slovak. A valuable contribution of the Institute to Slavic Studies is the creation of a database of Cyrillic and Latin handwritten and printed texts related to the Byzantine-Slavic tradition in Slovakia.

Download Full-text

Gabór Hamza, Origine e sviluppo degli ordinamenti giusprivatistici moderni in base alla tradizione del diritto romano (Santiago de Compostela: Andavira Editora, 2013)

Nordicum-Mediterraneum ◽

10.33112/nm.9.1.7 ◽

2014 ◽

Vol 9 (1) ◽

Author(s):

Ádám Boóc

Keyword(s):

Hungarian Academy ◽

Roman Law ◽

Full Professor ◽

Private Law ◽

Santiago De Compostela ◽

Italian Language ◽

Language Studies ◽

Law Faculty ◽

Ordinary Member ◽

Academy Of Sciences

The new opus of Gábor Hamza, ordinary Member of the Hungarian Academy of Sciences and Full Professor of Roman Law (Faculty of Law of the Eötvös Loránd University [Budapest]), which was published in the fall of 2013 in Italian language, studies the formation and development of modern private law systems based on the tradition of Roman Law.

Download Full-text

Semantics, contrastive linguistics and parallel corpora

Cognitive Studies | Études cognitives ◽

10.11649/cs.2014.009 ◽

2014 ◽

pp. 85-100

Author(s):

Violetta Koseska

Keyword(s):

Lexical Semantics ◽

Semantic Annotation ◽

Semantic Structure ◽

Automatic Annotation ◽

Parallel Corpora ◽

Parallel Corpus ◽

Linguistic Form ◽

Semantic Categories ◽

Contrastive Linguistics

Semantics, contrastive linguistics and parallel corporaIn view of the ambiguity of the term “semantics”, the author shows the differences between the traditional lexical semantics and the contemporary semantics in the light of various semantic schools. She examines semantics differently in connection with contrastive studies where the description must necessary go from the meaning towards the linguistic form, whereas in traditional contrastive studies the description proceeded from the form towards the meaning. This requirement regarding theoretical contrastive studies necessitates construction of a semantic interlanguage, rather than only singling out universal semantic categories expressed with various language means. Such studies can be strongly supported by parallel corpora. However, in order to make them useful for linguists in manual and computer translations, as well as in the development of dictionaries, including online ones, we need not only formal, often automatic, annotation of texts, but also semantic annotation - which is unfortunately manual. In the article we focus on semantic annotation concerning time, aspect and quantification of names and predicates in the whole semantic structure of the sentence on the example of the “Polish-Bulgarian-Russian parallel corpus”.

Download Full-text

The diversity of lexical functions in Bulgarian and Russian: an approach to compatible digital comparative lexicography

Cognitive Studies | Études cognitives ◽

10.11649/cs.2010.003 ◽

2015 ◽

pp. 53-70

Author(s):

Svetlana Timoshenko ◽

Olga Shemanaeva

Keyword(s):

Language Learning ◽

Word Sense Disambiguation ◽

Important Application ◽

Computer Assisted ◽

Computer Assisted Language Learning ◽

Word Sense ◽

Slavic Languages ◽

Transmission Problems ◽

Sense Disambiguation ◽

Academy Of Sciences

The diversity of lexical functions in Bulgarian and Russian: an approach to compatible digital comparative lexicographyThis paper presents an approach to the creation of Russian-Bulgarian digital dictionary of collocations using the apparatus of lexical functions. The project is aimed not only at the high-quality translation and word sense disambiguation but also at the cross-linguistic analysis and at comparing the semantics and compatibility of the words in Slavic languages (here: Russian and Bulgarian) by means of digital lexicographical data. Another important application is computer-assisted language learning: Bulgarian data can be incorporated in the educational project being developed for Russian and English at the Institute for Information Transmission Problems of the Russian Academy of Sciences.

Download Full-text

Pseudotext Injection and Advance Filtering of Low-Resource Corpus for Neural Machine Translation

Computational Intelligence and Neuroscience ◽

10.1155/2021/6682385 ◽

2021 ◽

Vol 2021 ◽

pp. 1-10

Author(s):

Michael Adjeisah ◽

Guohua Liu ◽

Douglas Omwenga Nyabuga ◽

Richard Nuetey Nortey ◽

Jinling Song

Keyword(s):

Machine Translation ◽

Language Processing ◽

Training Data ◽

Target Language ◽

Similarity Metrics ◽

Mahalanobis Distances ◽

Parallel Corpora ◽

Parallel Corpus ◽

Low Resource ◽

Sentence Level

Scaling natural language processing (NLP) to low-resourced languages to improve machine translation (MT) performance remains enigmatic. This research contributes to the domain on a low-resource English-Twi translation based on filtered synthetic-parallel corpora. It is often perplexing to learn and understand what a good-quality corpus looks like in low-resource conditions, mainly where the target corpus is the only sample text of the parallel language. To improve the MT performance in such low-resource language pairs, we propose to expand the training data by injecting synthetic-parallel corpus obtained by translating a monolingual corpus from the target language based on bootstrapping with different parameter settings. Furthermore, we performed unsupervised measurements on each sentence pair engaging squared Mahalanobis distances, a filtering technique that predicts sentence parallelism. Additionally, we extensively use three different sentence-level similarity metrics after round-trip translation. Experimental results on a diverse amount of available parallel corpus demonstrate that injecting pseudoparallel corpus and extensive filtering with sentence-level similarity metrics significantly improves the original out-of-the-box MT systems for low-resource language pairs. Compared with existing improvements on the same original framework under the same structure, our approach exhibits tremendous developments in BLEU and TER scores.

Download Full-text

Arabic-English Parallel Corpus: A New Resource for Translation Training and Language Teaching

10.31235/osf.io/rek3w ◽

2017 ◽

Author(s):

Arab World English Journal ◽

Hind M. Alotaibi

Keyword(s):

Language Teaching ◽

Data Driven ◽

Text Segmentation ◽

Web Interface ◽

King Saud University ◽

Parallel Corpora ◽

Parallel Corpus ◽

Source Language ◽

User Friendly ◽

Ongoing Project

Parallel corpora can be defined as collections of aligned, translated texts of two or more languages. They play a major role in translation and contrastive studies, and are also becoming popular in translation training and language teaching, with the advent of the data-driven learning (DDL) approach. Despite their significance, however, Arabic seems to lack a satisfactory general-use parallel corpus resource. The literature describes few Arabic–English parallel corpora, and these few are usually inaccurate and/or expensive. Some are small in size, while others are restricted in terms of genre, failing to meet the requirements of many academics and researchers. This paper describes an ongoing project at the College of Languages and Translation, King Saud University, to compile a 10-million-word Arabic–English parallel corpus to be used as a resource for translation training and language teaching. The bidirectional corpus can be used to compare translated and source language and identify differences. The corpus has been manually verified at different stages, including translation, text segmentation, alignment, and file preparation; it is available as full-text in XML format and through a user-friendly web interface that provides a concordancer to support bilingual search queries and several filtering options.

Download Full-text