Evolution of Semantic Similarity—A Survey

Dhivya Chandrasekaran; Vijay Mago

doi:10.1145/3440755

Evolution of Semantic Similarity—A Survey

ACM Computing Surveys ◽

10.1145/3440755 ◽

2021 ◽

Vol 54 (2) ◽

pp. 1-37

Author(s):

Dhivya Chandrasekaran ◽

Vijay Mago

Keyword(s):

Natural Language ◽

Semantic Similarity ◽

Language Processing ◽

Hybrid Methods ◽

Research Work ◽

Similarity Measures ◽

Text Data ◽

Knowledge Based ◽

Open Research ◽

Research Problems

Estimating the semantic similarity between text data is one of the challenging and open research problems in the field of Natural Language Processing (NLP). The versatility of natural language makes it difficult to define rule-based methods for determining semantic similarity measures. To address this issue, various semantic similarity methods have been proposed over the years. This survey article traces the evolution of such methods beginning from traditional NLP techniques such as kernel-based methods to the most recent research work on transformer-based models, categorizing them based on their underlying principles as knowledge-based, corpus-based, deep neural network–based methods, and hybrid methods. Discussing the strengths and weaknesses of each method, this survey provides a comprehensive view of existing systems in place for new researchers to experiment and develop innovative ideas to address the issue of semantic similarity.

Download Full-text

LIS4: Lesk Inspired Sense Specific Semantic Similarity using WordNet

Journal of Information & Knowledge Management ◽

10.1142/s0219649221500064 ◽

2021 ◽

pp. 2150006

Author(s):

Saravanakumar Kandasamy ◽

Aswani Kumar Cherukuri

Keyword(s):

Information Retrieval ◽

Natural Language Processing ◽

Natural Language ◽

Semantic Similarity ◽

Language Processing ◽

Gold Standard ◽

Question Answering ◽

Knowledge Based ◽

Benchmark Datasets ◽

Processing Information

Semantic similarity quantification between concepts is one of the inevitable parts in domains like Natural Language Processing, Information Retrieval, Question Answering, etc. to understand the text and their relationships better. Last few decades, many measures have been proposed by incorporating various corpus-based and knowledge-based resources. WordNet and Wikipedia are two of the Knowledge-based resources. The contribution of WordNet in the above said domain is enormous due to its richness in defining a word and all of its relationship with others. In this paper, we proposed an approach to quantify the similarity between concepts that exploits the synsets and the gloss definitions of different concepts using WordNet. Our method considers the gloss definitions, contextual words that are helping in defining a word, synsets of contextual word and the confidence of occurrence of a word in other word’s definition for calculating the similarity. The evaluation based on different gold standard benchmark datasets shows the efficiency of our system in comparison with other existing taxonomical and definitional measures.

Download Full-text

Knowledge-based sentence semantic similarity: algebraical properties

Progress in Artificial Intelligence ◽

10.1007/s13748-021-00248-0 ◽

2021 ◽

Author(s):

Mourad Oussalah ◽

Muhidin Mohamed

Keyword(s):

Information Retrieval ◽

Natural Language Processing ◽

Natural Language ◽

Semantic Similarity ◽

Language Processing ◽

Similarity Measures ◽

Canonical Extension ◽

Similarity Score ◽

Semantic Similarity Measure ◽

Sentence Similarity

AbstractDetermining the extent to which two text snippets are semantically equivalent is a well-researched topic in the areas of natural language processing, information retrieval and text summarization. The sentence-to-sentence similarity scoring is extensively used in both generic and query-based summarization of documents as a significance or a similarity indicator. Nevertheless, most of these applications utilize the concept of semantic similarity measure only as a tool, without paying importance to the inherent properties of such tools that ultimately restrict the scope and technical soundness of the underlined applications. This paper aims to contribute to fill in this gap. It investigates three popular WordNet hierarchical semantic similarity measures, namely path-length, Wu and Palmer and Leacock and Chodorow, from both algebraical and intuitive properties, highlighting their inherent limitations and theoretical constraints. We have especially examined properties related to range and scope of the semantic similarity score, incremental monotonicity evolution, monotonicity with respect to hyponymy/hypernymy relationship as well as a set of interactive properties. Extension from word semantic similarity to sentence similarity has also been investigated using a pairwise canonical extension. Properties of the underlined sentence-to-sentence similarity are examined and scrutinized. Next, to overcome inherent limitations of WordNet semantic similarity in terms of accounting for various Part-of-Speech word categories, a WordNet “All word-To-Noun conversion” that makes use of Categorial Variation Database (CatVar) is put forward and evaluated using a publicly available dataset with a comparison with some state-of-the-art methods. The finding demonstrates the feasibility of the proposal and opens up new opportunities in information retrieval and natural language processing tasks.

Download Full-text

Automated Identification of Semantic Similarity between Concepts of Textual Business Rules

International Journal of Intelligent Engineering and Systems ◽

10.22266/ijies2021.0228.15 ◽

2021 ◽

Vol 14 (1) ◽

pp. 147-156

Author(s):

Abdellatif Haj ◽

◽

Youssef Balouki ◽

Taoufiq Gadi ◽

◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Semantic Similarity ◽

Language Processing ◽

Business Rules ◽

Automated Identification ◽

Standard Format ◽

Knowledge Based ◽

Special Case

Business Rules (BR) are usually written by different stakeholders, which makes them vulnerable to contain different designations for a same concept. Such problem can be the source of a not well orchestrated behaviors. Whereas identification of synonyms is manual or totally neglected in most approaches dealing with natural language Business Rules. In this paper, we present an automated approach to identify semantic similarity between terms in textual BR using Natural Language Processing and knowledge-based algorithm refined using heuristics. Our method is unique in that it also identifies abbreviations/expansions (as a special case of synonym) which is not possible using a dictionary. Then, results are saved in a standard format (SBVR) for reusability purposes. Our approach was applied on more than 160 BR statements divided on three cases with an accuracy between 69% and 87% which suggests it to be an indispensable enhancement for other methods dealing with textual BR.

Download Full-text

Similarity of Sentences With Contradiction Using Semantic Similarity Measures

The Computer Journal ◽

10.1093/comjnl/bxaa100 ◽

2020 ◽

Author(s):

M Krishna Siva Prasad ◽

Poonam Sharma

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Semantic Similarity ◽

Language Processing ◽

Word Order ◽

Similarity Measures ◽

Semantic Features ◽

Short Text ◽

Sentence Similarity ◽

Expert Ratings

Abstract Short text or sentence similarity is crucial in various natural language processing activities. Traditional measures for sentence similarity consider word order, semantic features and role annotations of text to derive the similarity. These measures do not suit short texts or sentences with negation. Hence, this paper proposes an approach to determine the semantic similarity of sentences and also presents an algorithm to handle negation. In sentence similarity, word pair similarity plays a significant role. Hence, this paper also discusses the similarity between word pairs. Existing semantic similarity measures do not handle antonyms accurately. Hence, this paper proposes an algorithm to handle antonyms. This paper also presents an antonym dataset with 111-word pairs and corresponding expert ratings. The existing semantic similarity measures are tested on the dataset. The results of the correlation proved that the expert ratings are in order with the correlation obtained from the semantic similarity measures. The sentence similarity is handled by proposing two algorithms. The first algorithm deals with the typical sentences, and the second algorithm deals with contradiction in the sentences. SICK dataset, which has sentences with negation, is considered for handling the sentence similarity. The algorithm helped in improving the results of sentence similarity.

Download Full-text

A Comparison of Approaches for Measuring the Semantic Similarity of Short Texts Based on Word Embeddings

Journal of information and organizational sciences ◽

10.31341/jios.44.2.2 ◽

2020 ◽

Vol 44 (2) ◽

pp. 231-246

Author(s):

Karlo Babić ◽

Francesco Guerra ◽

Sanda Martinčić-Ipšić ◽

Ana Meštrović

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Semantic Similarity ◽

Language Processing ◽

Similarity Measures ◽

Vital Role ◽

The Other ◽

Word Embeddings ◽

Spearman Correlation ◽

Word Senses

Measuring the semantic similarity of texts has a vital role in various tasks from the field of natural language processing. In this paper, we describe a set of experiments we carried out to evaluate and compare the performance of different approaches for measuring the semantic similarity of short texts. We perform a comparison of four models based on word embeddings: two variants of Word2Vec (one based on Word2Vec trained on a specific dataset and the second extending it with embeddings of word senses), FastText, and TF-IDF. Since these models provide word vectors, we experiment with various methods that calculate the semantic similarity of short texts based on word vectors. More precisely, for each of these models, we test five methods for aggregating word embeddings into text embedding. We introduced three methods by making variations of two commonly used similarity measures. One method is an extension of the cosine similarity based on centroids, and the other two methods are variations of the Okapi BM25 function. We evaluate all approaches on the two publicly available datasets: SICK and Lee in terms of the Pearson and Spearman correlation. The results indicate that extended methods perform better from the original in most of the cases.

Download Full-text

A Natural Language Processing Approach to Measuring Treatment Adherence and Consistency Using Semantic Similarity

AERA Open ◽

10.1177/23328584211028615 ◽

2021 ◽

Vol 7 ◽

pp. 233285842110286

Author(s):

Kylie L. Anglin ◽

Vivian C. Wong ◽

Arielle Boguslav

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Semantic Similarity ◽

Language Processing ◽

Intervention Implementation ◽

Proof Of Concept ◽

Coaching Intervention ◽

Processing Techniques ◽

Teacher Coaching ◽

The Impact

Though there is widespread recognition of the importance of implementation research, evaluators often face intense logistical, budgetary, and methodological challenges in their efforts to assess intervention implementation in the field. This article proposes a set of natural language processing techniques called semantic similarity as an innovative and scalable method of measuring implementation constructs. Semantic similarity methods are an automated approach to quantifying the similarity between texts. By applying semantic similarity to transcripts of intervention sessions, researchers can use the method to determine whether an intervention was delivered with adherence to a structured protocol, and the extent to which an intervention was replicated with consistency across sessions, sites, and studies. This article provides an overview of semantic similarity methods, describes their application within the context of educational evaluations, and provides a proof of concept using an experimental study of the impact of a standardized teacher coaching intervention.

Download Full-text

How Language Shapes Prejudice Against Women: An Examination Across 45 World Languages

10.31234/osf.io/mrbcf ◽

2020 ◽

Author(s):

David DeFranza ◽

Himanshu Mishra ◽

Arul Mishra

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Ongoing Debate ◽

Text Data ◽

Gender Prejudice ◽

World Languages ◽

The World ◽

Present Context ◽

The Common

Language provides an ever-present context for our cognitions and has the ability to shape them. Languages across the world can be gendered (language in which the form of noun, verb, or pronoun is presented as female or male) versus genderless. In an ongoing debate, one stream of research suggests that gendered languages are more likely to display gender prejudice than genderless languages. However, another stream of research suggests that language does not have the ability to shape gender prejudice. In this research, we contribute to the debate by using a Natural Language Processing (NLP) method which captures the meaning of a word from the context in which it occurs. Using text data from Wikipedia and the Common Crawl project (which contains text from billions of publicly facing websites) across 45 world languages, covering the majority of the world’s population, we test for gender prejudice in gendered and genderless languages. We find that gender prejudice occurs more in gendered rather than genderless languages. Moreover, we examine whether genderedness of language influences the stereotypic dimensions of warmth and competence utilizing the same NLP method.

Download Full-text

Sentiment of App with Word Vectors

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.f1416.0986s319 ◽

2019 ◽

Vol 8 (6S3) ◽

pp. 2156-2159

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Sentiment Analysis ◽

Language Processing ◽

Text Data ◽

Vector Representations ◽

Text Sentiment Analysis

Vector representations for language have been shown to be useful in a number of Natural Language Processing tasks. In this paper, we aim to investigate the effectiveness of word vector representations for the problem of Sentiment Analysis. In particular, we target three sub-tasks namely sentiment words extraction, polarity of sentiment words detection, and text sentiment prediction. We investigate the effectiveness of vector representations over different text data and evaluate the quality of domain-dependent vectors. Vector representations has been used to compute various vector-based features and conduct systematically experiments to demonstrate their effectiveness. Using simple vector based features can achieve better results for text sentiment analysis of APP.

Download Full-text

SEMblog

Ontology-Based Applications for Enterprise Systems and Knowledge Management - Advances in Knowledge Acquisition, Transfer, and Management ◽

10.4018/978-1-4666-1993-7.ch012 ◽

2013 ◽

pp. 210-223

Author(s):

Azleena Mohd Kassim ◽

Yu-N Cheah

Keyword(s):

Information Technology ◽

Knowledge Management ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Human Intervention ◽

Knowledge Based ◽

Search Mechanism ◽

Management Policies ◽

Knowledge Identification

Information Technology (IT) is often employed to put knowledge management policies into operation. However, many of these tools require human intervention when it comes to deciding how the knowledge is to be managed. The Sematic Web may be an answer to this issue, but many Sematic Web tools are not readily available for the regular IT user. Another problem that arises is that typical efforts to apply or reuse knowledge via a search mechanism do not necessarily link to other pages that are relevant. Blogging systems appear to address some of these challenges but the browsing experience can be further enhanced by providing links to other relevant posts. In this chapter, the authors present a semantic blogging tool called SEMblog to identify, organize, and reuse knowledge based on the Sematic Web and ontologies. The SEMblog methodology brings together technologies such as Natural Language Processing (NLP), Sematic Web representations, and the ubiquity of the blogging environment to produce a more intuitive way to manage knowledge, especially in the areas of knowledge identification, organization, and reuse. Based on detailed comparisons with other similar systems, the uniqueness of SEMblog lies in its ability to automatically generate keywords and semantic links.

Download Full-text

Knowledge-Based Task Planning Using Natural Language Processing for Robotic Manufacturing

Volume 3: 30th Computers and Information in Engineering Conference, Parts A and B ◽

10.1115/detc2010-29123 ◽

2010 ◽

Cited By ~ 1

Author(s):

Iraj Mantegh ◽

Nazanin S. Darbandi

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Programming Languages ◽

Language Processing ◽

New Method ◽

Robot Programming ◽

Task Planning ◽

End User ◽

Knowledge Based ◽

Manufacturing Applications

Robotic alternative to many manual operations falls short in application due to the difficulties in capturing the manual skill of an expert operator. One of the main problems to be solved if robots are to become flexible enough for various manufacturing needs is that of end-user programming. An end-user with little or no technical expertise in robotics area needs to be able to efficiently communicate its manufacturing task to the robot. This paper proposes a new method for robot task planning using some concepts of Artificial Intelligence. Our method is based on a hierarchical knowledge representation and propositional logic, which allows an expert user to incrementally integrate process and geometric parameters with the robot commands. The objective is to provide an intelligent and programmable agent such as a robot with a knowledge base about the attributes of human behaviors in order to facilitate the commanding process. The focus of this work is on robot programming for manufacturing applications. Industrial manipulators work with low level programming languages. This work presents a new method based on Natural Language Processing (NLP) that allows a user to generate robot programs using natural language lexicon and task information. This will enable a manufacturing operator (for example for painting) who may be unfamiliar with robot programming to easily employ the agent for the manufacturing tasks.

Download Full-text