Keyphrase Graph in Text Representation for Document Similarity Measurement

Author(s):  
ThanhThuong T. Huynh ◽  
TruongAn Phamnguyen ◽  
Nhon V. Do

To represent the text document more expressively, a kind of graph-based semantic model is proposed, in which more semantic information among keyphrases as well as the structural information of the text are incorporated. The method produces structured representations of texts by utilizing common, popular knowledge bases (e.g. DBpedia, Wikipedia) to acquire fine-grained information about concepts, entities, and their semantic relations, thus resulting in a knowledge-rich interpretation. We demonstrate the benefits of these representations in the task of document similarity measurement. Relevance evaluation between two documents is done by calculating the semantic similarity between two keyphrase graphs that represent them. Experimental results show that our approach outperforms standard baselines based on traditional document representations, and able to come close in performance to the specialized methods particularly tuned to this task on the specific dataset.

AI Magazine ◽  
2015 ◽  
Vol 36 (1) ◽  
pp. 75-86 ◽  
Author(s):  
Jennifer Sleeman ◽  
Tim Finin ◽  
Anupam Joshi

We describe an approach for identifying fine-grained entity types in heterogeneous data graphs that is effective for unstructured data or when the underlying ontologies or semantic schemas are unknown. Identifying fine-grained entity types, rather than a few high-level types, supports coreference resolution in heterogeneous graphs by reducing the number of possible coreference relations that must be considered. Big data problems that involve integrating data from multiple sources can benefit from our approach when the datas ontologies are unknown, inaccessible or semantically trivial. For such cases, we use supervised machine learning to map entity attributes and relations to a known set of attributes and relations from appropriate background knowledge bases to predict instance entity types. We evaluated this approach in experiments on data from DBpedia, Freebase, and Arnetminer using DBpedia as the background knowledge base.


Author(s):  
Andreas Nolle ◽  
Melisachew Wudage Chekol ◽  
Christian Meilicke ◽  
German Nemirovski ◽  
Heiner Stuckenschmidt
Keyword(s):  

2018 ◽  
Vol 7 (2.18) ◽  
pp. 102
Author(s):  
Harsha Patil ◽  
Ramjeevan Singh Thakur

Document Clustering is an unsupervised method for classified documents in clusters on the basis of their similarity. Any document get it place in any specific cluster, on the basis of membership score, which calculated through membership function. But many of the traditional clustering algorithms are generally based on only BOW (Bag of Words), which ignores the semantic similarity between document and Cluster. In this research we consider the semantic association between cluster and text document during the calculation of membership score of any document for any specific cluster. Several researchers are working on semantic aspects of document clustering to develop clustering performance. Many external knowledge bases like WordNet, Wikipedia, Lucene etc. are utilized for this purpose. The proposed approach exploits WordNet to improve cluster member ship function. The experimental result shows that clustering quality improved significantly by using proposed framework of semantic approach. 


2021 ◽  
Vol 2083 (4) ◽  
pp. 042044
Author(s):  
Zuhua Dai ◽  
Yuanyuan Liu ◽  
Shilong Di ◽  
Qi Fan

Abstract Aspect level sentiment analysis belongs to fine-grained sentiment analysis, w hich has caused extensive research in academic circles in recent years. For this task, th e recurrent neural network (RNN) model is usually used for feature extraction, but the model cannot effectively obtain the structural information of the text. Recent studies h ave begun to use the graph convolutional network (GCN) to model the syntactic depen dency tree of the text to solve this problem. For short text data, the text information is not enough to accurately determine the emotional polarity of the aspect words, and the knowledge graph is not effectively used as external knowledge that can enrich the sem antic information. In order to solve the above problems, this paper proposes a graph co nvolutional neural network (GCN) model that can process syntactic information, know ledge graphs and text semantic information. The model works on the “syntax-knowled ge” graph to extract syntactic information and common sense information at the same t ime. Compared with the latest model, the model in this paper can effectively improve t he accuracy of aspect-level sentiment classification on two datasets.


2014 ◽  
Vol 10 (3) ◽  
pp. 226-244 ◽  
Author(s):  
Johannes Lorey

Purpose – The purpose of this study is to introduce several metrics that enable universal and fine-grained characterization of arbitrary Linked Data repositories. Publicly accessible SPARQL endpoints contain vast amounts of knowledge from a large variety of domains. However, oftentimes these endpoints are not configured to process specific workloads as efficiently as possible. Assisting users in leveraging SPARQL endpoints requires insight into functional and non-functional properties of these knowledge bases. Design/methodology/approach – This study presents comprehensive approaches for deriving these metrics. More specifically, the study utilizes concrete SPARQL queries to determine corresponding values. Furthermore, it validates and discusses the introduced metrics through extensive evaluation on real-world SPARQL endpoints. Findings – The evaluation determined that endpoints exhibit different characteristics. While it comes as no surprise that latency and throughput are influenced by the network infrastructure, the costs for join operations depend on a number of factors that are not obvious to a data consumer. Moreover, as the author discusses mean, median and upper quartile values, it was found both endpoints behaving consistently as well as repositories offering varying levels of performance. Originality/value – On the one hand, the contribution of the authors work lies in assisting data consumers in evaluation of the quality of service of publicly available SPARQL endpoints. On the other hand, the performance metrics introduced in this study can also be considered as additional input features for distributed query processing frameworks. Moreover, the author provides a universal means for discerning characteristics of different SPARQL endpoints without the need of (synthetic or real-world) query workloads.


Author(s):  
Wentao Ding ◽  
Guanji Gao ◽  
Linfeng Shi ◽  
Yuzhong Qu

Recognizing time expressions is a fundamental and important task in many applications of natural language understanding, such as reading comprehension and question answering. Several newest state-of-the-art approaches have achieved good performance on recognizing time expressions. These approaches are black-boxed or based on heuristic rules, which leads to the difficulty in understanding the temporal information. On the contrary, classic rule-based or semantic parsing approaches can capture rich structural information, but their performances on recognition are not so good. In this paper, we propose a pattern-based approach, called PTime, which automatically generates and selects patterns for recognizing time expressions. In this approach, time expressions in training text are abstracted into type sequences by using fine-grained token types, thus the problem is transformed to select an appropriate subset of the sequential patterns. We use the Extended Budgeted Maximum Coverage (EBMC) model to optimize the pattern selection. The main idea is to maximize the correct token sequences matched by the selected patterns while the number of the mistakes should be limited by an adjustable budget. The interpretability of patterns and the adjustability of permitted number of mistakes make PTime a very promising approach for many applications. Experimental results show that PTime achieves a very competitive performance as compared with existing state-of-the-art approaches.


2017 ◽  
Vol 43 (1) ◽  
pp. 1-30 ◽  
Author(s):  
Claire Gardent ◽  
Laura Perez-Beltrachini

Although there has been much work in recent years on data-driven natural language generation, little attention has been paid to the fine-grained interactions that arise during microplanning between aggregation, surface realization, and sentence segmentation. In this article, we propose a hybrid symbolic/statistical approach to jointly model the constraints regulating these interactions. Our approach integrates a small handwritten grammar, a statistical hypertagger, and a surface realization algorithm. It is applied to the verbalization of knowledge base queries and tested on 13 knowledge bases to demonstrate domain independence. We evaluate our approach in several ways. A quantitative analysis shows that the hybrid approach outperforms a purely symbolic approach in terms of both speed and coverage. Results from a human study indicate that users find the output of this hybrid statistic/symbolic system more fluent than both a template-based and a purely symbolic grammar-based approach. Finally, we illustrate by means of examples that our approach can account for various factors impacting aggregation, sentence segmentation, and surface realization.


2021 ◽  
pp. 1-18
Author(s):  
Huajun Chen ◽  
Ning Hu ◽  
Guilin Qi ◽  
Haofen Wang ◽  
Zhen Bi ◽  
...  

Abstract The early concept of knowledge graph originates from the idea of the Semantic Web, which aims at using structured graphs to model the knowledge of the world and record the relationships that exist between things. Currently publishing knowledge bases as open data on the Web has gained significant attention. In China, CIPS(Chinese Information Processing Society) launched the OpenKG in 2015 to foster the development of Chinese Open Knowledge Graphs. Unlike existing open knowledge-based programs, OpenKG chain is envisioned as a blockchain-based open knowledge infrastructure. This article introduces the first attempt at the implementation of sharing knowledge graphs on OpenKG chain, a blockchain-based trust network. We have completed the test of the underlying blockchain platform, as well as the on-chain test of OpenKG's dataset and toolset sharing as well as fine-grained knowledge crowdsourcing at the triple level. We have also proposed novel definitions: K-Point and OpenKG Token, which can be considered as a measurement of knowledge value and user value. 1033 knowledge contributors have been involved in two months of testing on the blockchain, and the cumulative number of on-chain recordings triggered by real knowledge consumers has reached 550,000 with an average daily peak value of more than 10,000. For the first time, We have tested and realized on-chain sharing of knowledge at entity/triple granularity level. At present, all operations on the datasets and toolset in OpenKG.CN, as well as the triplets in OpenBase, are recorded on the chain, and corresponding value will also be generated and assigned in a trusted mode. Via this effort, OpenKG chain looks to provide a more credible and traceable knowledge-sharing platform for the knowledge graph community.


Sign in / Sign up

Export Citation Format

Share Document