A Comparative Study for String Metrics and the Feasibility of Joining them as Combined Text Similarity Measures

Safa Abdul-Jabbar;  ; Loay George;

doi:10.14500/aro.10180

A Comparative Study for String Metrics and the Feasibility of Joining them as Combined Text Similarity Measures

ARO-The Scientific Journal of Koya University ◽

10.14500/aro.10180 ◽

2017 ◽

Vol 5 (2) ◽

pp. 6-18 ◽

Cited By ~ 1

Author(s):

Safa Abdul-Jabbar ◽

◽

Loay George ◽

Keyword(s):

Comparative Study ◽

Similarity Measures ◽

Text Similarity ◽

Text Similarity Measures

Download Full-text

Quick asymmetric text similarity measures

Proceedings of the 2003 International Conference on Machine Learning and Cybernetics (IEEE Cat. No.03EX693) ◽

10.1109/icmlc.2003.1264505 ◽

2004 ◽

Author(s):

Jun-Peng Bao ◽

Jun-Yi Shen ◽

Xiao-Domg Liu ◽

Hai-Yan Liu

Keyword(s):

Similarity Measures ◽

Text Similarity ◽

Text Similarity Measures

Download Full-text

A Study on Text Similarity Measures

Journal of Advanced Research in Dynamical and Control Systems ◽

10.5373/jardcs/v12sp4/20201681 ◽

2020 ◽

Vol 12 (04-SPECIAL ISSUE) ◽

pp. 1922-1935

Author(s):

Dr.J. Ujwala Rekha ◽

Dr.K. Shahu Chatrapati

Keyword(s):

Similarity Measures ◽

Text Similarity ◽

Text Similarity Measures

Download Full-text

A Heuristic Based Pre-processing Methodology for Short Text Similarity Measures in Microblogs

2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS) ◽

10.1109/hpcc/smartcity/dss.2018.00265 ◽

2018 ◽

Author(s):

Noufa Alnajran ◽

Keeley Crockett ◽

David McLean ◽

Annabel Latham

Keyword(s):

Similarity Measures ◽

Text Similarity ◽

Short Text ◽

Short Text Similarity ◽

Text Similarity Measures

Download Full-text

Scaled Pearson’s Correlation Coefficient for Evaluating Text Similarity Measures

Modern Applied Science ◽

10.5539/mas.v13n10p26 ◽

2019 ◽

Vol 13 (10) ◽

pp. 26

Author(s):

Issa Atoum

Keyword(s):

Correlation Coefficient ◽

Pearson Correlation ◽

Similarity Measures ◽

Text Similarity ◽

Cross Sectional ◽

Promising Alternative ◽

Fine Grained ◽

System Usability Scale ◽

Text Similarity Measures ◽

Similarity Method

Despite the ever-increasing interest in the field of text similarity methods, the development of adequate text similarity methods is lagging. Some methods are decent in entailment while others are reasonable to the degree to which two texts are similar. Very often, these methods are compared using Pearson’s correlation; however, Pearson’s correlation is bound to outliers that could affect the final correlation coefficient figure. As a result, the Pearson correlation is inadequate to find which text similarity method is better in situations where data items are very similar or are unrelated. This paper borrows the scaled Pearson correlation from the finance domain and builds a metric that can evaluate the performance of similarity methods over cross-sectional datasets. Results showed that the new metric is fine-grained with the benchmark dataset scores range as a promising alternative to Pearson’s correlation. Moreover, extrinsic results from the application of the System Usability Scale (SUS) questionnaire on the scaled Pearson correlation revealed that the proposed metric is attaining attention from scholars which implicate its usage in the academia.

Download Full-text

Efficient Set Intersection Counting Algorithm for Text Similarity Measures

2017 Proceedings of the Ninteenth Workshop on Algorithm Engineering and Experiments (ALENEX) ◽

10.1137/1.9781611974768.12 ◽

2017 ◽

Cited By ~ 1

Author(s):

Preethi Lahoti ◽

Patrick K. Nicholson ◽

Bilyana Taneva

Keyword(s):

Similarity Measures ◽

Efficient Set ◽

Text Similarity ◽

Set Intersection ◽

Text Similarity Measures ◽

Counting Algorithm

Download Full-text

The N-Grams Based Text Similarity Detection Approach Using Self-Organizing Maps and Similarity Measures

Applied Sciences ◽

10.3390/app9091870 ◽

2019 ◽

Vol 9 (9) ◽

pp. 1870 ◽

Cited By ~ 3

Author(s):

Pavel Stefanovič ◽

Olga Kurasova ◽

Rokas Štrimaitis

Keyword(s):

Data Clustering ◽

Similarity Measures ◽

Self Organizing Map ◽

Text Similarity ◽

Self Organizing Maps ◽

Similarity Detection ◽

Word Level ◽

Detection Approach ◽

Text Similarity Measures ◽

Self Organizing

In the paper the word-level n-grams based approach is proposed to find similarity between texts. The approach is a combination of two separate and independent techniques: self-organizing map (SOM) and text similarity measures. SOM’s uniqueness is that the obtained results of data clustering, as well as dimensionality reduction, are presented in a visual form. The four measures have been evaluated: cosine, dice, extended Jaccard’s, and overlap. First of all, texts have to be converted to numerical expression. For that purpose, the text has been split into the word-level n-grams and after that, the bag of n-grams has been created. The n-grams’ frequencies are calculated and the frequency matrix of dataset is formed. Various filters are used to create a bag of n-grams: stemming algorithms, number and punctuation removers, stop words, etc. All experimental investigation has been made using a corpus of plagiarized short answers dataset.

Download Full-text