scholarly journals Pipeline for a Data-driven Network of Linguistic Terms

2021 ◽  
Author(s):  
Søren Wichmann

The present work is aimed at (1) developing a search machine adapted to the large DReaM corpus of linguistic descriptive literature and (2) getting insights into how a data-driven ontology of linguistic terminology might be built. Starting from close to 20,000 text documents from the literature of language descriptions, from documents either born digitally or scanned and OCR’d, we extract keywords and pass them through a pruning pipeline where mainly keywords that can be considered as belonging to linguistic terminology survive. Subsequently we quantify relations among those terms using Normalized Pointwise Mutual Information (NPMI) and use the resulting measures, in conjunction with the Google Page Rank (GPR), to build networks of linguistic terms.

2021 ◽  
Author(s):  
Gourab Das

LitRev is a novel robust data driven approach, devel-oped for quick literature review on a particular topic of interest. This method identifies common biological phrases that follow a power law distribution and important phrases which have the normalized point wise mutual information score greater than zero.


2013 ◽  
Vol 20 (2) ◽  
pp. 185-234 ◽  
Author(s):  
AKIRA UTSUMI

AbstractThis study examines the ability of a semantic space model to represent the meaning of noun compounds such as ‘information gathering’ or ‘heart disease.’ For a semantic space model to compute the meaning and the attributional similarity (or semantic relatedness) for unfamiliar noun compounds that do not occur in a corpus, the vector for a noun compound must be computed from the vectors of its constituent words using vector composition algorithms. Six composition algorithms (i.e., centroid, multiplication, circular convolution, predication, comparison, and dilation) are compared in terms of the quality of the computation of the attributional similarity for English and Japanese noun compounds. To evaluate the performance of the computation of the similarity, this study uses three tasks (i.e., related word ranking, similarity correlation, and semantic classification), and two types of semantic spaces (i.e., latent semantic analysis-based and positive pointwise mutual information-based spaces). The result of these tasks is that the dilation algorithm is generally most effective in computing the similarity of noun compounds, while the multiplication algorithm is best suited specifically for the positive pointwise mutual information-based space. In addition, the comparison algorithm works better for unfamiliar noun compounds that do not occur in the corpus. These findings indicate that in general a semantic space model, and in particular the dilation, multiplication, and comparison algorithms have sufficient ability to compute the attributional similarity for noun compounds.


Sign in / Sign up

Export Citation Format

Share Document