scholarly journals A Comparative Study for String Metrics and the Feasibility of Joining them as Combined Text Similarity Measures

2017 ◽  
Vol 5 (2) ◽  
pp. 6-18 ◽  
Author(s):  
Safa Abdul-Jabbar ◽  
◽  
Loay George ◽  
2020 ◽  
Vol 12 (04-SPECIAL ISSUE) ◽  
pp. 1922-1935
Author(s):  
Dr.J. Ujwala Rekha ◽  
Dr.K. Shahu Chatrapati

2019 ◽  
Vol 13 (10) ◽  
pp. 26
Author(s):  
Issa Atoum

Despite the ever-increasing interest in the field of text similarity methods, the development of adequate text similarity methods is lagging. Some methods are decent in entailment while others are reasonable to the degree to which two texts are similar. Very often, these methods are compared using Pearson’s correlation; however, Pearson’s correlation is bound to outliers that could affect the final correlation coefficient figure. As a result, the Pearson correlation is inadequate to find which text similarity method is better in situations where data items are very similar or are unrelated. This paper borrows the scaled Pearson correlation from the finance domain and builds a metric that can evaluate the performance of similarity methods over cross-sectional datasets. Results showed that the new metric is fine-grained with the benchmark dataset scores range as a promising alternative to Pearson’s correlation. Moreover, extrinsic results from the application of the System Usability Scale (SUS) questionnaire on the scaled Pearson correlation revealed that the proposed metric is attaining attention from scholars which implicate its usage in the academia.


2019 ◽  
Vol 9 (9) ◽  
pp. 1870 ◽  
Author(s):  
Pavel Stefanovič ◽  
Olga Kurasova ◽  
Rokas Štrimaitis

In the paper the word-level n-grams based approach is proposed to find similarity between texts. The approach is a combination of two separate and independent techniques: self-organizing map (SOM) and text similarity measures. SOM’s uniqueness is that the obtained results of data clustering, as well as dimensionality reduction, are presented in a visual form. The four measures have been evaluated: cosine, dice, extended Jaccard’s, and overlap. First of all, texts have to be converted to numerical expression. For that purpose, the text has been split into the word-level n-grams and after that, the bag of n-grams has been created. The n-grams’ frequencies are calculated and the frequency matrix of dataset is formed. Various filters are used to create a bag of n-grams: stemming algorithms, number and punctuation removers, stop words, etc. All experimental investigation has been made using a corpus of plagiarized short answers dataset.


Sign in / Sign up

Export Citation Format

Share Document