text similarity measures
Recently Published Documents


TOTAL DOCUMENTS

14
(FIVE YEARS 0)

H-INDEX

3
(FIVE YEARS 0)

2020 ◽  
Vol 12 (04-SPECIAL ISSUE) ◽  
pp. 1922-1935
Author(s):  
Dr.J. Ujwala Rekha ◽  
Dr.K. Shahu Chatrapati

2019 ◽  
Vol 13 (10) ◽  
pp. 26
Author(s):  
Issa Atoum

Despite the ever-increasing interest in the field of text similarity methods, the development of adequate text similarity methods is lagging. Some methods are decent in entailment while others are reasonable to the degree to which two texts are similar. Very often, these methods are compared using Pearson’s correlation; however, Pearson’s correlation is bound to outliers that could affect the final correlation coefficient figure. As a result, the Pearson correlation is inadequate to find which text similarity method is better in situations where data items are very similar or are unrelated. This paper borrows the scaled Pearson correlation from the finance domain and builds a metric that can evaluate the performance of similarity methods over cross-sectional datasets. Results showed that the new metric is fine-grained with the benchmark dataset scores range as a promising alternative to Pearson’s correlation. Moreover, extrinsic results from the application of the System Usability Scale (SUS) questionnaire on the scaled Pearson correlation revealed that the proposed metric is attaining attention from scholars which implicate its usage in the academia.


2019 ◽  
Vol 9 (9) ◽  
pp. 1870 ◽  
Author(s):  
Pavel Stefanovič ◽  
Olga Kurasova ◽  
Rokas Štrimaitis

In the paper the word-level n-grams based approach is proposed to find similarity between texts. The approach is a combination of two separate and independent techniques: self-organizing map (SOM) and text similarity measures. SOM’s uniqueness is that the obtained results of data clustering, as well as dimensionality reduction, are presented in a visual form. The four measures have been evaluated: cosine, dice, extended Jaccard’s, and overlap. First of all, texts have to be converted to numerical expression. For that purpose, the text has been split into the word-level n-grams and after that, the bag of n-grams has been created. The n-grams’ frequencies are calculated and the frequency matrix of dataset is formed. Various filters are used to create a bag of n-grams: stemming algorithms, number and punctuation removers, stop words, etc. All experimental investigation has been made using a corpus of plagiarized short answers dataset.


Author(s):  
Goran Šimić

This chapter is about documents and data clustering as a process of preparing the information resources stored in the e-government systems for advanced search. These resources are mainly represented as textual data stored as field values in the databases or located as documents in file repositories. Due to their growth in number, search for some specific information takes more time. Different techniques are used for this purpose. Most of them include information retrieval based on a variety of text similarity measures. The cost of such processing depends on preparation of resources for searching. Clustering represents the most commonly used technique for such a purpose, and this fact is the basic motive for this chapter.


Sign in / Sign up

Export Citation Format

Share Document