source word Latest Research Papers

This paper addresses the morphological word formation process that is known as clipping. In English, that process yields shortened word forms such as lab (< laboratory), exam (< examination), or gator (< alligator). It is frequently argued (Davy 2000, Durkin 2009, Haspelmath & Sims 2010, Don 2014) that clipping is highly variable and that it is difficult to predict how a given source word will be shortened. We draw on recent work (Lappe 2007, Jamet 2009, Berg 2011, Alber & Arndt-Lappe 2012, Arndt-Lappe 2018) in order to challenge that view. Our main hypothesis is that English clipping follows predictable tendencies, that these tendencies can be captured by a probabilistic, multifactorial model, and that the features of that model can be explained functionally in terms of cognitive, discourse-pragmatic, and phonological factors. Cognitive factors include the principle of least effort (Zipf 1949), an important discourse-pragmatic factor is the recoverability of the source word (Tournier 1985), and phonological factors include issues of stress and syllable structure (Lappe 2007). While the individual influence of these factors on clipping has been recognized, their interaction and their relative importance remains to be fully understood. The empirical analysis in this paper will use Hierarchical Configural Frequency Analysis (Krauth & Lienert 1973, Gries 2008) on the basis of a large, newly compiled database of more than 2000 English clippings. Our analysis allows us to detect regularities in the way speakers of English create clippings. We argue that there are several English clipping schemas that are optimized for processability.

Download Full-text

“Brexit” models in the Russian-language Internet discourse

XLinguae ◽

10.18355/xl.2021.14.02.20 ◽

2021 ◽

Vol 14 (2) ◽

pp. 276-285

Author(s):

Dinara G. Vasbieva ◽

Tatiana V. Kapshukova ◽

Tursunai Ibragimova ◽

Aitbayeva Nursaule ◽

Zhanat Bissenbayeva

Keyword(s):

Word Formation ◽

Russian Language ◽

The Internet ◽

The Subject ◽

The Russian Language ◽

Source Word

The paper discusses the linguistic impact of Brexit on the Russian-language Internet discourse as this event has generated a myriad of neologisms in English. The present study aims to identify the composition of Brexit-induced neologisms whose source word is -exit and to describe the features of the reception of the analyzed units at the morphological and word-formation levels in the Russian-speaking segment of the Internet. The subject of the research is an assimilation of the Brexit model in the Russian language. The findings of this study indicate that the features of the reception of -exit derivatives in the Russian language were revealed in the aspect of the morphemization of the -exit component.

Download Full-text

Lexicon Development for COVID-19-related Concepts Using Open-source Word Embedding Sources: An Intrinsic and Extrinsic Evaluation

JMIR Medical Informatics ◽

10.2196/21679 ◽

2021 ◽

Vol 9 (2) ◽

pp. e21679

Author(s):

Soham Parikh ◽

Anahita Davoudi ◽

Shun Yu ◽

Carolina Giraldo ◽

Emily Schriver ◽

...

Keyword(s):

Acute Respiratory Distress Syndrome ◽

Respiratory Distress Syndrome ◽

Respiratory Distress ◽

Open Source ◽

Distress Syndrome ◽

Prediction Models ◽

Word Embedding ◽

Word Embeddings ◽

Clinical Text ◽

Source Word

Background Scientists are developing new computational methods and prediction models to better clinically understand COVID-19 prevalence, treatment efficacy, and patient outcomes. These efforts could be improved by leveraging documented COVID-19–related symptoms, findings, and disorders from clinical text sources in an electronic health record. Word embeddings can identify terms related to these clinical concepts from both the biomedical and nonbiomedical domains, and are being shared with the open-source community at large. However, it’s unclear how useful openly available word embeddings are for developing lexicons for COVID-19–related concepts. Objective Given an initial lexicon of COVID-19–related terms, this study aims to characterize the returned terms by similarity across various open-source word embeddings and determine common semantic and syntactic patterns between the COVID-19 queried terms and returned terms specific to the word embedding source. Methods We compared seven openly available word embedding sources. Using a series of COVID-19–related terms for associated symptoms, findings, and disorders, we conducted an interannotator agreement study to determine how accurately the most similar returned terms could be classified according to semantic types by three annotators. We conducted a qualitative study of COVID-19 queried terms and their returned terms to detect informative patterns for constructing lexicons. We demonstrated the utility of applying such learned synonyms to discharge summaries by reporting the proportion of patients identified by concept among three patient cohorts: pneumonia (n=6410), acute respiratory distress syndrome (n=8647), and COVID-19 (n=2397). Results We observed high pairwise interannotator agreement (Cohen kappa) for symptoms (0.86-0.99), findings (0.93-0.99), and disorders (0.93-0.99). Word embedding sources generated based on characters tend to return more synonyms (mean count of 7.2 synonyms) compared to token-based embedding sources (mean counts range from 2.0 to 3.4). Word embedding sources queried using a qualifier term (eg, dry cough or muscle pain) more often returned qualifiers of the similar semantic type (eg, “dry” returns consistency qualifiers like “wet” and “runny”) compared to a single term (eg, cough or pain) queries. A higher proportion of patients had documented fever (0.61-0.84), cough (0.41-0.55), shortness of breath (0.40-0.59), and hypoxia (0.51-0.56) retrieved than other clinical features. Terms for dry cough returned a higher proportion of patients with COVID-19 (0.07) than the pneumonia (0.05) and acute respiratory distress syndrome (0.03) populations. Conclusions Word embeddings are valuable technology for learning related terms, including synonyms. When leveraging openly available word embedding sources, choices made for the construction of the word embeddings can significantly influence the words learned.

Download Full-text

An Intrinsic and Extrinsic Evaluation of Learned COVID-19 Concepts using Open-Source Word Embedding Sources

10.1101/2020.12.29.20249005 ◽

2021 ◽

Author(s):

Soham Parikh ◽

Anahita Davoudi ◽

Shun Yu ◽

Carolina Giraldo ◽

Emily Schriver ◽

...

Keyword(s):

Open Source ◽

Prediction Models ◽

Word Embedding ◽

Word Embeddings ◽

Clinical Text ◽

Single Term ◽

Semantic Types ◽

Agreement Study ◽

Source Word ◽

Discharge Summaries

IntroductionScientists are developing new computational methods and prediction models to better clinically understand COVID-19 prevalence, treatment efficacy, and patient outcomes. These efforts could be improved by leveraging documented, COVID-19-related symptoms, findings, and disorders from clinical text sources in the electronic health record. Word embeddings can identify terms related to these clinical concepts from both the biomedical and non-biomedical domains and are being shared with the open-source community at large. However, it’s unclear how useful openly-available word embeddings are for developing lexicons for COVID-19-related concepts.ObjectiveGiven an initial lexicon of COVID-19-related terms, characterize the returned terms by similarity across various, open-source word embeddings and determine common semantic and syntactic patterns between the COVID-19 queried terms and returned terms specific to word embedding source.Materials and MethodsWe compared 7 openly-available word embedding sources. Using a series of COVID-19-related terms for associated symptoms, findings, and disorders, we conducted an inter-annotator agreement study to determine how accurately the most semantically similar returned terms could be classified according to semantic types by three annotators. We conducted a qualitative study of COVID-19 queried terms and their returned terms to identify useful patterns for constructing lexicons. We demonstrated the utility of applying such terms to discharge summaries by reporting the proportion of patients identified by concept for pneumonia, acute respiratory distress syndrome, and COVID-19 cohorts.ResultsWe observed high, pairwise inter-annotator agreement (Cohen’s Kappa) for symptoms (0.86 to 0.99), findings (0.93 to 0.99), and disorders (0.93 to 0.99). Word embedding sources generated based on characters tend to return more lexical variants and synonyms; in contrast, embeddings based on tokens more often return a variety of semantic types. Word embedding sources queried using an adjective phrase compared to a single term (e.g., dry cough vs. cough; muscle pain vs. pain) are more likely to return qualifiers of the same semantic type (e.g., “dry” returns consistency qualifiers like “wet”, “runny”). Terms for fever, cough, shortness of breath, and hypoxia retrieved a higher proportion of patients than other clinical features. Terms for dry cough returned a higher proportion of COVID-19 patients than pneumonia and ARDS populations.DiscussionWord embeddings are a valuable technology for learning terms, including synonyms. When leveraging openly-available word embedding sources, choices made for the construction of the word embeddings can significantly influence the phrases returned.

Download Full-text

Evaluation of Source to Target and Target to Source Word Alignment for English to Hindi

Algorithms for Intelligent Systems - Proceedings of International Conference on Innovations in Information and Communication Technologies ◽

10.1007/978-981-16-0873-5_12 ◽

2021 ◽

pp. 133-144

Author(s):

Arun R. Babhulgaonkar ◽

Shefali P. Sonavane

Keyword(s):

Word Alignment ◽

Source Word

Download Full-text

Bohoričev prispevek slovenskemu slovaropisju

Stati inu obstati revija za vprašanja protestantizma ◽

10.26493/2590-9754.16(32)247-267 ◽

2020 ◽

Vol 16 (32) ◽

pp. 247-267

Author(s):

Majda Merše

Keyword(s):

Foreign Language ◽

Semantic Relations ◽

16Th Century ◽

Semantic Structure ◽

Literary Language ◽

Source Language ◽

Semantic Difference ◽

Different Types ◽

Further Development ◽

Source Word

Bohorič’s Contribution to Slovene Lexicography Lexicography was one of Bohorič’s central activities devoted to the 16th-century Slovene literary language. He is believed to be the author of three types of dictionaries: (1) a lost trilingual glossary with Latin as the source language, compiled for pedagogical use (Nomenclatura trium linguarum, ca. 1580); (2) the index in DB 1584 (perhaps also the shorter one in DB 1578), which contains also dialectal synonyms as well as equivalents from Croatian dialects and thus enables understanding of the central-Slovene lexicon; (3) six Slovene-Latin-German glossaries included in the grammar Arcticae horulae ſucciſivae (1584): the first three described nomina (nouns and adjectives) of all three genders, while the other three described verbs of the main three conjugation types (-am, -em, -im). In terms of their informativity, the glossaries included in the grammar are of several different types. In addition to grammatical information contributing to a better knowledge of the 16th-century Slovene literary language, they also provide lexicological information closely related to the modes of lexicographic presentation. Both features are an important contribution to the beginnings and further development of Slovene lexicography. Slovene headwords are semantically defined by their foreign-language equivalents. In addition, their semantic structure is outlined by one-word or phraseological subentries (with attested meanings, sub-meanings, and terms for specific types, such as štala ‘barn’—volovska štala ‘oxen barn’, ovčja štala ‘sheep barn’, kozja štala ‘goat barn’; set phrases with established use). Entries with subentries and the order of entries itself also bring into focus various types of semantic relations between lexical items, including synonymy, where pairs of loan and native synonyms stand out in particular (e.g. punt—zaveza ‘bond’, gmerati—množiti ‘to multiply’); antonymy (e.g. čast ‘honor’—nečast ‘disgrace’); the semantic difference between the source word and its derivative (e.g. kamen ‘stone’—kamčič ‘little stone’). Verbal entries with added subentries demonstrate various types of formation of aspectual pairs, e.g. sejati—obsejati ‘to sow’, nagniti—nagibati ‘to lean’. In cases where members of the pair differ by the type of action (e.g. gibati se—ganiti se ‘to move’), the semantic difference is added to the aspectual one. A re-comparison of Bohorič’s glossaries and Megiser’s dictionaries—his quadrilingual dictionary with German as the source language (MD 1592) and multilingual dictionary with Latin as the source (MTh 1603)—that included data on how widely the usage of this lexicon was spread, confirmed the hypothesis that the glossaries were one of Megiser’s main lexicographic sources of Slovene equivalents. (This data is one of the results of the complete excerption of the Slovene texts in book publications in the period of 1550–1603). Megiser’s consideration of Bohorič’s glossaries is most clearly evident in approximately 90 words that cannot be found in other works. In addition, the glossaries proved to be a useful reference work for citing widespread and commonly used lexicon. A typological difference between the compared lexicographic works is reflected in differences in their informativity as well as in the number of foreign-language equivalents and included Slovene equivalents. The increased number of Slovene synonyms in Megiser’s multilingual dictionaries was the result of Megiser’s inclusion of the Register (1584) and his ever-improving knowledge of the Slovene literary language and some Slovene dialects (e.g., Carinthian). Keywords: Adam Bohorič, lexicography, glossaries in Bohorič’s grammar, dictionary informativity, Megiser’s multilingual dictionaries

Download Full-text

Lexicon Development for COVID-19-related Concepts Using Open-source Word Embedding Sources: An Intrinsic and Extrinsic Evaluation (Preprint)

10.2196/preprints.21679 ◽

2020 ◽

Author(s):

Soham Parikh ◽

Anahita Davoudi ◽

Shun Yu ◽

Carolina Giraldo ◽

Emily Schriver ◽

...

Keyword(s):

Acute Respiratory Distress Syndrome ◽

Respiratory Distress Syndrome ◽

Respiratory Distress ◽

Open Source ◽

Distress Syndrome ◽

Prediction Models ◽

Word Embedding ◽

Word Embeddings ◽

Clinical Text ◽

Source Word

BACKGROUND Scientists are developing new computational methods and prediction models to better clinically understand COVID-19 prevalence, treatment efficacy, and patient outcomes. These efforts could be improved by leveraging documented COVID-19–related symptoms, findings, and disorders from clinical text sources in an electronic health record. Word embeddings can identify terms related to these clinical concepts from both the biomedical and nonbiomedical domains, and are being shared with the open-source community at large. However, it’s unclear how useful openly available word embeddings are for developing lexicons for COVID-19–related concepts. OBJECTIVE Given an initial lexicon of COVID-19–related terms, this study aims to characterize the returned terms by similarity across various open-source word embeddings and determine common semantic and syntactic patterns between the COVID-19 queried terms and returned terms specific to the word embedding source. METHODS We compared seven openly available word embedding sources. Using a series of COVID-19–related terms for associated symptoms, findings, and disorders, we conducted an interannotator agreement study to determine how accurately the most similar returned terms could be classified according to semantic types by three annotators. We conducted a qualitative study of COVID-19 queried terms and their returned terms to detect informative patterns for constructing lexicons. We demonstrated the utility of applying such learned synonyms to discharge summaries by reporting the proportion of patients identified by concept among three patient cohorts: pneumonia (n=6410), acute respiratory distress syndrome (n=8647), and COVID-19 (n=2397). RESULTS We observed high pairwise interannotator agreement (Cohen kappa) for symptoms (0.86-0.99), findings (0.93-0.99), and disorders (0.93-0.99). Word embedding sources generated based on characters tend to return more synonyms (mean count of 7.2 synonyms) compared to token-based embedding sources (mean counts range from 2.0 to 3.4). Word embedding sources queried using a qualifier term (eg, dry cough or muscle pain) more often returned qualifiers of the similar semantic type (eg, “dry” returns consistency qualifiers like “wet” and “runny”) compared to a single term (eg, cough or pain) queries. A higher proportion of patients had documented fever (0.61-0.84), cough (0.41-0.55), shortness of breath (0.40-0.59), and hypoxia (0.51-0.56) retrieved than other clinical features. Terms for dry cough returned a higher proportion of patients with COVID-19 (0.07) than the pneumonia (0.05) and acute respiratory distress syndrome (0.03) populations. CONCLUSIONS Word embeddings are valuable technology for learning related terms, including synonyms. When leveraging openly available word embedding sources, choices made for the construction of the word embeddings can significantly influence the words learned.

Download Full-text

Blends in Gravity Falls TV Series

Lexicon ◽

10.22146/lexicon.v6i1.50311 ◽

2019 ◽

Vol 6 (1) ◽

Author(s):

Ermi Andriani ◽

Rio Rini Diah Moehkardi

Keyword(s):

Data Source ◽

Blending Process ◽

Source Word

This research investigates the blending process used in Gravity Falls TV series seasons I and 2. It aims to classify blends based on the classification of blends proposed by Mattiello (2013) and interpret the meaning of blends. From the data source, there are fifty-four data considered as blends. The data are categorised in three perspectives, namely: morphotactic, morphonological and graphical, and morphosemantic. The result shows that morphotactically, the most frequently used pattern is partial blend particularly the blends consist of full word followed by splinter with 49 percent data. Then, morphonologically and graphically, non-overlapping type in which neither the graphs nor the sounds of source words are overlapped each other is commonly used in the series with 57 percent of overall data. Finally, morphosemantically, the most used structure with percentage of 63 percent is right headed blend in which the head is the second source word.

Download Full-text

Replacing Linguists with Dummies: A Serious Need for Trivial Baselines in Multi-Task Neural Machine Translation

Prague Bulletin of Mathematical Linguistics ◽

10.2478/pralin-2019-0005 ◽

2019 ◽

Vol 113 (1) ◽

pp. 31-40

Author(s):

Daniel Kondratyuk ◽

Ronald Cardenas ◽

Ondřej Bojar

Keyword(s):

Machine Translation ◽

Deep Neural Networks ◽

Neural Machine Translation ◽

Translation Quality ◽

Syntactic Information ◽

Multiple Tasks ◽

Recent Developments ◽

Resource Conditions ◽

Main Message ◽

Source Word

Abstract Recent developments in machine translation experiment with the idea that a model can improve the translation quality by performing multiple tasks, e.g., translating from source to target and also labeling each source word with syntactic information. The intuition is that the network would generalize knowledge over the multiple tasks, improving the translation performance, especially in low resource conditions. We devised an experiment that casts doubt on this intuition. We perform similar experiments in both multi-decoder and interleaving setups that label each target word either with a syntactic tag or a completely random tag. Surprisingly, we show that the model performs nearly as well on uncorrelated random tags as on true syntactic tags. We hint some possible explanations of this behavior. The main message from our article is that experimental results with deep neural networks should always be complemented with trivial baselines to document that the observed gain is not due to some unrelated properties of the system or training effects. True confidence in where the gains come from will probably remain problematic anyway.

Download Full-text

Addressing the Under-Translation Problem from the Entropy Perspective

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.3301451 ◽

2019 ◽

Vol 33 ◽

pp. 451-458

Author(s):

Yang Zhao ◽

Jiajun Zhang ◽

Chengqing Zong ◽

Zhongjun He ◽

Hua Wu

Keyword(s):

Neural Model ◽

Coarse Grained ◽

Training Method ◽

Neural Machine Translation ◽

High Entropy ◽

Fine Grained ◽

Translation Quality ◽

Simple Strategy ◽

Coarse To Fine ◽

Source Word

Neural Machine Translation (NMT) has drawn much attention due to its promising translation performance in recent years. However, the under-translation problem still remains a big challenge. In this paper, we focus on the under-translation problem and attempt to find out what kinds of source words are more likely to be ignored. Through analysis, we observe that a source word with a large translation entropy is more inclined to be dropped. To address this problem, we propose a coarse-to-fine framework. In coarse-grained phase, we introduce a simple strategy to reduce the entropy of highentropy words through constructing the pseudo target sentences. In fine-grained phase, we propose three methods, including pre-training method, multitask method and two-pass method, to encourage the neural model to correctly translate these high-entropy words. Experimental results on various translation tasks show that our method can significantly improve the translation quality and substantially reduce the under-translation cases of high-entropy words.

Download Full-text

source word
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

A multivariate approach to English Clippings

“Brexit” models in the Russian-language Internet discourse

Lexicon Development for COVID-19-related Concepts Using Open-source Word Embedding Sources: An Intrinsic and Extrinsic Evaluation

An Intrinsic and Extrinsic Evaluation of Learned COVID-19 Concepts using Open-Source Word Embedding Sources

Evaluation of Source to Target and Target to Source Word Alignment for English to Hindi

Bohoričev prispevek slovenskemu slovaropisju

Lexicon Development for COVID-19-related Concepts Using Open-source Word Embedding Sources: An Intrinsic and Extrinsic Evaluation (Preprint)

Blends in Gravity Falls TV Series

Replacing Linguists with Dummies: A Serious Need for Trivial Baselines in Multi-Task Neural Machine Translation

Addressing the Under-Translation Problem from the Entropy Perspective

Export Citation Format

source wordRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

A multivariate approach to English Clippings

“Brexit” models in the Russian-language Internet discourse

Lexicon Development for COVID-19-related Concepts Using Open-source Word Embedding Sources: An Intrinsic and Extrinsic Evaluation

An Intrinsic and Extrinsic Evaluation of Learned COVID-19 Concepts using Open-Source Word Embedding Sources

Evaluation of Source to Target and Target to Source Word Alignment for English to Hindi

Bohoričev prispevek slovenskemu slovaropisju

Lexicon Development for COVID-19-related Concepts Using Open-source Word Embedding Sources: An Intrinsic and Extrinsic Evaluation (Preprint)

Blends in Gravity Falls TV Series

Replacing Linguists with Dummies: A Serious Need for Trivial Baselines in Multi-Task Neural Machine Translation

Addressing the Under-Translation Problem from the Entropy Perspective

source word
Recently Published Documents