scholarly journals The Bulgarian-Polish-Russian parallel corpus

2015 ◽  
pp. 241-254
Author(s):  
Maksim Duškin ◽  
Joanna Satoła-Staśkowiak

The Bulgarian-Polish-Russian parallel corpusThe Semantics Laboratory Team of Institute of Slavic Studies of Polish Academy of Sciences is planning to begin work on the creation of a Bulgarian-Polish-Russian parallel corpus. The three selected languages are representatives of the main groups of Slavic languages: Bulgarian represents the southern group of Slavic languages, Polish – the western group of Slavic languages, Russian – the eastern group of Slavic languages. Our project will be the first parallel corpus of these three languages. The planned corpus will be based on material, dating from one period (the 20th century) and will have a synchronous nature. The project will not constitute the sum of the separate corpora of selected languages.One of the problems with creating multilingual parallel corpora are different proportions of translated texts between the selected languages, for example, Polish literature is often translated into Bulgarian, but not vice versa.Bulgarian, Russian and Polish differ typologically – Bulgarian is an analytic language, Polish and Russian are synthetic. The parallel corpus should have compatible annotation, while taking into account the characteristic features of the selected languages.We hope that the Bulgarian-Polish-Russian parallel corpus will serve as a source of linguistic material of contrastive language studies and may prove to be a big help for linguists, translators, terminologists and students of linguistics. The results of our work will be available on the Internet.

2016 ◽  
Vol 51 ◽  
pp. 191-217
Author(s):  
Violetta Koseska-Toszewa ◽  
Roman Roszko

Slavic languages and the Lithuanian language in the Clarin-PL parallel corporaThe Clarin Eric and Clarin-PL strategic scientific purpose is to support humanistic research in a multicultural and multilingual Europe. Polish researchers put the emphasis on building a bridge between the Polish language and Polish linguistic technologies and other European languages and their linguistic technologies. So far, the Polish scientific community has mainly focused on Polish-English connections. Clarin-PL has been developing the first and only multilingual corpora of the Polish language in conjunction with other Slavic languages and the Lithuanian language: the Polish-Bulgarian-Russian Parallel Corpus and the Polish- Lithuanian Parallel Corpus. The parallel corpora created by the ISS PAS Corpus Linguistics and Semantics Team break through the existing “canons” and allow scientists access to interlinked multilingual language resources – in the first phase limited to the languages of the three Slavic groups and the Lithuanian language. In the article, the authors present very detailed information on their original system of the semantic annotation of scope quantification in multilingual parallel corpora, hitherto unused in the subject literature. Due to the system’s originality, the semantic annotation is carried out manually. Identification of particular values of scope quantification in a sentence and the hereby presented attempts of its recording are supported by long-term research conducted by an international team of linguists and computer scientists / mathematicians developing the issue of quantification of names, time and aspect in natural languages. Języki słowiańskie i litewski w korpusach równoległych Clarin-PLStrategicznym celem naukowym Clarin ERIC i Clarin-PL jest wspieranie badań humanistycznych w wielokulturowej i wielojęzycznej Europie. Dla polskich badaczy ważna jest budowa pomostu między językiem polskim, polskimi technologiami językowymi a innymi językami europejskimi i na ich rzecz opracowanymi technologiami językowymi. Dotychczas w nauce polskiej największy nacisk był kładziony na powiązania polsko-angielskie. Clarin-PL opracowuje zatem pierwsze jak dotąd wielojęzyczne korpusy języka polskiego w zestawieniu z innymi językami słowiańskimi oraz z językiem litewskim: Korpus równoległy polsko-bułgarsko-rosyjski i Korpus równoległy polsko-litewski. Tworzone przez Zespół Lingwistyki Korpusowej i Semantyki (IS PAN) korpusy równoległe przełamują dotychczasowe „kanony” i udostępniają nauce powiązane wielojęzyczne zasoby – w pierwszym etapie ograniczone do języków trzech grup słowiańskich oraz języka litewskiego. W artykule autorzy przedstawiają bardzo szczegółową informację o zastosowanej po raz pierwszy w literaturze przedmiotu anotacji semantycznej dotyczącej kwantyfikacji zakresowej w wielojęzycznych korpusach równoległych. Z powodu swojego rozległego zakresu i nowatorstwa ta anotacja semantyczna jest nanoszona ręcznie. Identyfikacja poszczególnych wartości kwantyfikacji zakresowej w zdaniu oraz przedstawiane tu próby jej zapisu są poparte wieloletnimi badaniami międzynarodowego zespołu lingwistów i matematyków-informatyków opracowujących zagadnienie kwantyfikacji imion, czasu i aspektu w językach naturalnych.


2019 ◽  
Vol 14 (1-2) ◽  
pp. 295-297
Author(s):  
Sergej A. Borisov

For more than twenty years, the Institute of Slavic Studies of the Russian Academy of Sciences celebrates the Day of Slavic Writing and Culture with a traditional scholarly conference.”. Since 2014, it has been held in the young scholars’ format. In 2019, participants from Moscow, St. Petersburg, Kazan, Togliatti, Tyumen, Yekaterinburg, and Rostov-on-Don, as well as Slovakia, the Czech Republic, Hungary, and Romania continued this tradition. A wide range of problems related to the history of the Slavic peoples from the Middle Ages to the present time in the national, regional and international context were discussed again. Participants talked about the typology of Slavic languages and dialects, linguo-geography, socio- and ethnolinguistics, analyzed formation, development, current state, and prospects of Slavic literatures, etc.


Minerals ◽  
2021 ◽  
Vol 11 (6) ◽  
pp. 553
Author(s):  
Jakub Kotowski ◽  
Krzysztof Nejbert ◽  
Danuta Olszewska-Nejbert

The geochemistry of detrital rutile grains, which are extremely resistant to weathering, was used in a provenance study of the transgressive Albian quartz sands in the southern part of extra-Carpathian Poland. Rutile grains were sampled from eight outcrops and four boreholes located on the Miechów, Szydłowiec, and Puławy Segments. The crystallization temperatures of the rutile grains, calculated using a Zr-in-rutile geothermometer, allowed for the division of the study area into three parts: western, central, and eastern. The western group of samples, located in the Miechów Segment, is characterized by a polymodal distribution of rutile crystallization temperatures (700–800 °C; 550–600 °C, and c. 900 °C) with a significant predominance of high-temperature forms, and with a clear prevalence of metapelitic over metamafic rutile. The eastern group of samples, corresponding to the Lublin Area, is monomodal and their crystallization temperatures peak at 550–600 °C. The contents of metapelitic to metamafic rutile in the study area are comparable. The central group of rutile samples with bimodal distribution (550–600 °C and 850–950 °C) most likely represents a mixing zone, with a visible influence from the western and, to a lesser extent, the eastern group. The most probable source area for the western and the central groups seems to be granulite and high-temperature eclogite facies rocks from the Bohemian Massif. The most probable source area for the eastern group of rutiles seems to be amphibolites and low temperature eclogite facies rocks, probably derived from the southern part of the Baltic Shield.


2020 ◽  
Vol 15 (3-4) ◽  
pp. 226-235
Author(s):  
Marina M. Valentsova ◽  
Elena S. Uzeneva

The essay was written to mark the 25th anniversary of the Slavic Institute named after Jan Stanislav SAS (Bratislava). The Institute was founded to conduct interdisciplinary research on the relationships of the Slovak language and culture with other Slavic languages and cultures, as well as to study the Slovak-Latin, Slovak-Hungarian, and Slovak-German cultural and linguistic interactions in ancient times and the Middle Ages. The article introduces the main milestones in the formation and development of the Institute, its employees, the directions of their scientific work, and their significant publications. The main areas of research of the Slavic Institute (initially the Slavic Cabinet) cover linguistics (lexicography, history of language), history, folklore, cultural studies, musicology, and textology. Much attention is paid to the annotated translation of foreign religious texts into Slovak. A valuable contribution of the Institute to Slavic Studies is the creation of a database of Cyrillic and Latin handwritten and printed texts related to the Byzantine-Slavic tradition in Slovakia.


2014 ◽  
Vol 9 (1) ◽  
Author(s):  
Ádám Boóc

The new opus of Gábor Hamza, ordinary Member of the Hungarian Academy of Sciences and Full Professor of Roman Law (Faculty of Law of the Eötvös Loránd University [Budapest]), which was published in the fall of 2013 in Italian language, studies the formation and development of modern private law systems based on the tradition of Roman Law.


2014 ◽  
pp. 85-100
Author(s):  
Violetta Koseska

Semantics, contrastive linguistics and parallel corporaIn view of the ambiguity of the term “semantics”, the author shows the differences between the traditional lexical semantics and the contemporary semantics in the light of various semantic schools. She examines semantics differently in connection with contrastive studies where the description must necessary go from the meaning towards the linguistic form, whereas in traditional contrastive studies the description proceeded from the form towards the meaning. This requirement regarding theoretical contrastive studies necessitates construction of a semantic interlanguage, rather than only singling out universal semantic categories expressed with various language means. Such studies can be strongly supported by parallel corpora. However, in order to make them useful for linguists in manual and computer translations, as well as in the development of dictionaries, including online ones, we need not only formal, often automatic, annotation of texts, but also semantic annotation - which is unfortunately manual. In the article we focus on semantic annotation concerning time, aspect and quantification of names and predicates in the whole semantic structure of the sentence on the example of the “Polish-Bulgarian-Russian parallel corpus”.


2015 ◽  
pp. 53-70
Author(s):  
Svetlana Timoshenko ◽  
Olga Shemanaeva

The diversity of lexical functions in Bulgarian and Russian: an approach to compatible digital comparative lexicographyThis paper presents an approach to the creation of Russian-Bulgarian digital dictionary of collocations using the apparatus of lexical functions. The project is aimed not only at the high-quality translation and word sense disambiguation but also at the cross-linguistic analysis and at comparing the semantics and compatibility of the words in Slavic languages (here: Russian and Bulgarian) by means of digital lexicographical data. Another important application is computer-assisted language learning: Bulgarian data can be incorporated in the educational project being developed for Russian and English at the Institute for Information Transmission Problems of the Russian Academy of Sciences.


2021 ◽  
Vol 2021 ◽  
pp. 1-10
Author(s):  
Michael Adjeisah ◽  
Guohua Liu ◽  
Douglas Omwenga Nyabuga ◽  
Richard Nuetey Nortey ◽  
Jinling Song

Scaling natural language processing (NLP) to low-resourced languages to improve machine translation (MT) performance remains enigmatic. This research contributes to the domain on a low-resource English-Twi translation based on filtered synthetic-parallel corpora. It is often perplexing to learn and understand what a good-quality corpus looks like in low-resource conditions, mainly where the target corpus is the only sample text of the parallel language. To improve the MT performance in such low-resource language pairs, we propose to expand the training data by injecting synthetic-parallel corpus obtained by translating a monolingual corpus from the target language based on bootstrapping with different parameter settings. Furthermore, we performed unsupervised measurements on each sentence pair engaging squared Mahalanobis distances, a filtering technique that predicts sentence parallelism. Additionally, we extensively use three different sentence-level similarity metrics after round-trip translation. Experimental results on a diverse amount of available parallel corpus demonstrate that injecting pseudoparallel corpus and extensive filtering with sentence-level similarity metrics significantly improves the original out-of-the-box MT systems for low-resource language pairs. Compared with existing improvements on the same original framework under the same structure, our approach exhibits tremendous developments in BLEU and TER scores.


2017 ◽  
Author(s):  
Arab World English Journal ◽  
Hind M. Alotaibi

Parallel corpora can be defined as collections of aligned, translated texts of two or more languages. They play a major role in translation and contrastive studies, and are also becoming popular in translation training and language teaching, with the advent of the data-driven learning (DDL) approach. Despite their significance, however, Arabic seems to lack a satisfactory general-use parallel corpus resource. The literature describes few Arabic–English parallel corpora, and these few are usually inaccurate and/or expensive. Some are small in size, while others are restricted in terms of genre, failing to meet the requirements of many academics and researchers. This paper describes an ongoing project at the College of Languages and Translation, King Saud University, to compile a 10-million-word Arabic–English parallel corpus to be used as a resource for translation training and language teaching. The bidirectional corpus can be used to compare translated and source language and identify differences. The corpus has been manually verified at different stages, including translation, text segmentation, alignment, and file preparation; it is available as full-text in XML format and through a user-friendly web interface that provides a concordancer to support bilingual search queries and several filtering options.


Sign in / Sign up

Export Citation Format

Share Document