Keyphrase Graph in Text Representation for Document Similarity Measurement

Knowledge Innovation Through Intelligent Software Methodologies, Tools and Techniques - Frontiers in Artificial Intelligence and Applications ◽

10.3233/faia200590 ◽

2020 ◽

Author(s):

ThanhThuong T. Huynh ◽

TruongAn Phamnguyen ◽

Nhon V. Do

Keyword(s):

Structural Information ◽

Knowledge Bases ◽

Similarity Measurement ◽

Document Similarity ◽

Fine Grained ◽

Text Document ◽

Structured Representations ◽

Popular Knowledge ◽

Relevance Evaluation ◽

To Come

To represent the text document more expressively, a kind of graph-based semantic model is proposed, in which more semantic information among keyphrases as well as the structural information of the text are incorporated. The method produces structured representations of texts by utilizing common, popular knowledge bases (e.g. DBpedia, Wikipedia) to acquire fine-grained information about concepts, entities, and their semantic relations, thus resulting in a knowledge-rich interpretation. We demonstrate the benefits of these representations in the task of document similarity measurement. Relevance evaluation between two documents is done by calculating the semantic similarity between two keyphrase graphs that represent them. Experimental results show that our approach outperforms standard baselines based on traditional document representations, and able to come close in performance to the specialized methods particularly tuned to this task on the specific dataset.

Download Full-text

Concept Forest: A New Ontology-assisted Text Document Similarity Measurement Method

IEEE/WIC/ACM International Conference on Web Intelligence (WI'07) ◽

10.1109/wi.2007.11 ◽

2007 ◽

Cited By ~ 20

Author(s):

James Z. Wang ◽

William Taylor

Keyword(s):

Measurement Method ◽

Similarity Measurement ◽

Document Similarity ◽

Text Document

Download Full-text

Entity Type Recognition for Heterogeneous Semantic Graphs

AI Magazine ◽

10.1609/aimag.v36i1.2569 ◽

2015 ◽

Vol 36 (1) ◽

pp. 75-86 ◽

Cited By ~ 4

Author(s):

Jennifer Sleeman ◽

Tim Finin ◽

Anupam Joshi

Keyword(s):

Machine Learning ◽

Background Knowledge ◽

Knowledge Bases ◽

Heterogeneous Data ◽

Unstructured Data ◽

Supervised Machine Learning ◽

Coreference Resolution ◽

Multiple Sources ◽

Fine Grained ◽

High Level

We describe an approach for identifying fine-grained entity types in heterogeneous data graphs that is effective for unstructured data or when the underlying ontologies or semantic schemas are unknown. Identifying fine-grained entity types, rather than a few high-level types, supports coreference resolution in heterogeneous graphs by reducing the number of possible coreference relations that must be considered. Big data problems that involve integrating data from multiple sources can benefit from our approach when the datas ontologies are unknown, inaccessible or semantically trivial. For such cases, we use supervised machine learning to map entity attributes and relations to a known set of attributes and relations from appropriate background knowledge bases to predict instance entity types. We evaluated this approach in experiments on data from DBpedia, Freebase, and Arnetminer using DBpedia as the background knowledge base.

Download Full-text

Automated Fine-Grained Trust Assessment in Federated Knowledge Bases

Lecture Notes in Computer Science - The Semantic Web – ISWC 2017 ◽

10.1007/978-3-319-68288-4_29 ◽

2017 ◽

pp. 490-506

Author(s):

Andreas Nolle ◽

Melisachew Wudage Chekol ◽

Christian Meilicke ◽

German Nemirovski ◽

Heiner Stuckenschmidt

Keyword(s):

Knowledge Bases ◽

Fine Grained

Download Full-text

A semantic approach for text document clustering using frequent itemsets and WordNet

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i2.9.10220 ◽

2018 ◽

Vol 7 (2.18) ◽

pp. 102

Author(s):

Harsha Patil ◽

Ramjeevan Singh Thakur

Keyword(s):

Clustering Algorithms ◽

Document Clustering ◽

Knowledge Bases ◽

Experimental Result ◽

Semantic Approach ◽

Text Document ◽

Clustering Quality ◽

Ship Function ◽

Membership Score ◽

Specific Cluster

Document Clustering is an unsupervised method for classified documents in clusters on the basis of their similarity. Any document get it place in any specific cluster, on the basis of membership score, which calculated through membership function. But many of the traditional clustering algorithms are generally based on only BOW (Bag of Words), which ignores the semantic similarity between document and Cluster. In this research we consider the semantic association between cluster and text document during the calculation of membership score of any document for any specific cluster. Several researchers are working on semantic aspects of document clustering to develop clustering performance. Many external knowledge bases like WordNet, Wikipedia, Lucene etc. are utilized for this purpose. The proposed approach exploits WordNet to improve cluster member ship function. The experimental result shows that clustering quality improved significantly by using proposed framework of semantic approach.

Download Full-text

Aspect-level sentiment analysis merged with knowledge graph and graph convolutional neural network

Journal of Physics Conference Series ◽

10.1088/1742-6596/2083/4/042044 ◽

2021 ◽

Vol 2083 (4) ◽

pp. 042044

Author(s):

Zuhua Dai ◽

Yuanyuan Liu ◽

Shilong Di ◽

Qi Fan

Keyword(s):

Neural Network ◽

Sentiment Analysis ◽

Structural Information ◽

Knowledge Graph ◽

Convolutional Network ◽

Text Data ◽

Short Text ◽

Fine Grained ◽

Syntactic Information ◽

Text Information

Abstract Aspect level sentiment analysis belongs to fine-grained sentiment analysis, w hich has caused extensive research in academic circles in recent years. For this task, th e recurrent neural network (RNN) model is usually used for feature extraction, but the model cannot effectively obtain the structural information of the text. Recent studies h ave begun to use the graph convolutional network (GCN) to model the syntactic depen dency tree of the text to solve this problem. For short text data, the text information is not enough to accurately determine the emotional polarity of the aspect words, and the knowledge graph is not effectively used as external knowledge that can enrich the sem antic information. In order to solve the above problems, this paper proposes a graph co nvolutional neural network (GCN) model that can process syntactic information, know ledge graphs and text semantic information. The model works on the “syntax-knowled ge” graph to extract syntactic information and common sense information at the same t ime. Compared with the latest model, the model in this paper can effectively improve t he accuracy of aspect-level sentiment classification on two datasets.

Download Full-text

Identifying and determining SPARQL endpoint characteristics

International Journal of Web Information Systems ◽

10.1108/ijwis-03-2014-0007 ◽

2014 ◽

Vol 10 (3) ◽

pp. 226-244 ◽

Cited By ~ 4

Author(s):

Johannes Lorey

Keyword(s):

Real World ◽

Performance Metrics ◽

Knowledge Bases ◽

Distributed Query Processing ◽

Data Repositories ◽

Content Type ◽

Fine Grained ◽

Distributed Query ◽

Extensive Evaluation ◽

The One

Purpose – The purpose of this study is to introduce several metrics that enable universal and fine-grained characterization of arbitrary Linked Data repositories. Publicly accessible SPARQL endpoints contain vast amounts of knowledge from a large variety of domains. However, oftentimes these endpoints are not configured to process specific workloads as efficiently as possible. Assisting users in leveraging SPARQL endpoints requires insight into functional and non-functional properties of these knowledge bases. Design/methodology/approach – This study presents comprehensive approaches for deriving these metrics. More specifically, the study utilizes concrete SPARQL queries to determine corresponding values. Furthermore, it validates and discusses the introduced metrics through extensive evaluation on real-world SPARQL endpoints. Findings – The evaluation determined that endpoints exhibit different characteristics. While it comes as no surprise that latency and throughput are influenced by the network infrastructure, the costs for join operations depend on a number of factors that are not obvious to a data consumer. Moreover, as the author discusses mean, median and upper quartile values, it was found both endpoints behaving consistently as well as repositories offering varying levels of performance. Originality/value – On the one hand, the contribution of the authors work lies in assisting data consumers in evaluation of the quality of service of publicly available SPARQL endpoints. On the other hand, the performance metrics introduced in this study can also be considered as additional input features for distributed query processing frameworks. Moreover, the author provides a universal means for discerning characteristics of different SPARQL endpoints without the need of (synthetic or real-world) query workloads.

Download Full-text

A Pattern-Based Approach to Recognizing Time Expressions

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33016335 ◽

2019 ◽

Vol 33 ◽

pp. 6335-6342

Author(s):

Wentao Ding ◽

Guanji Gao ◽

Linfeng Shi ◽

Yuzhong Qu

Keyword(s):

Question Answering ◽

State Of The Art ◽

Structural Information ◽

Main Idea ◽

Sequential Patterns ◽

Semantic Parsing ◽

Language Understanding ◽

Fine Grained ◽

Maximum Coverage ◽

Approach Time

Recognizing time expressions is a fundamental and important task in many applications of natural language understanding, such as reading comprehension and question answering. Several newest state-of-the-art approaches have achieved good performance on recognizing time expressions. These approaches are black-boxed or based on heuristic rules, which leads to the difficulty in understanding the temporal information. On the contrary, classic rule-based or semantic parsing approaches can capture rich structural information, but their performances on recognition are not so good. In this paper, we propose a pattern-based approach, called PTime, which automatically generates and selects patterns for recognizing time expressions. In this approach, time expressions in training text are abstracted into type sequences by using fine-grained token types, thus the problem is transformed to select an appropriate subset of the sequential patterns. We use the Extended Budgeted Maximum Coverage (EBMC) model to optimize the pattern selection. The main idea is to maximize the correct token sequences matched by the selected patterns while the number of the mistakes should be limited by an adjustable budget. The interpretability of patterns and the adjustability of permitted number of mistakes make PTime a very promising approach for many applications. Experimental results show that PTime achieves a very competitive performance as compared with existing state-of-the-art approaches.

Download Full-text

A Statistical, Grammar-Based Approach to Microplanning

Computational Linguistics ◽

10.1162/coli_a_00273 ◽

2017 ◽

Vol 43 (1) ◽

pp. 1-30 ◽

Cited By ~ 4

Author(s):

Claire Gardent ◽

Laura Perez-Beltrachini

Keyword(s):

Statistical Approach ◽

Hybrid Approach ◽

Human Study ◽

Knowledge Bases ◽

Symbolic System ◽

Language Generation ◽

Fine Grained ◽

Sentence Segmentation ◽

Symbolic Approach ◽

Surface Realization

Although there has been much work in recent years on data-driven natural language generation, little attention has been paid to the fine-grained interactions that arise during microplanning between aggregation, surface realization, and sentence segmentation. In this article, we propose a hybrid symbolic/statistical approach to jointly model the constraints regulating these interactions. Our approach integrates a small handwritten grammar, a statistical hypertagger, and a surface realization algorithm. It is applied to the verbalization of knowledge base queries and tested on 13 knowledge bases to demonstrate domain independence. We evaluate our approach in several ways. A quantitative analysis shows that the hybrid approach outperforms a purely symbolic approach in terms of both speed and coverage. Results from a human study indicate that users find the output of this hybrid statistic/symbolic system more fluent than both a template-based and a purely symbolic grammar-based approach. Finally, we illustrate by means of examples that our approach can account for various factors impacting aggregation, sentence segmentation, and surface realization.

Download Full-text

Document Similarity Measurement Using Ferret Algorithm and Map Reduce Programming Model

International Journal of Computer Trends and Technology ◽

10.14445/22312803/ijctt-v19p114 ◽

2015 ◽

Vol 19 (2) ◽

pp. 76-82

Author(s):

Condro Wibawa ◽

◽

Irwan Bastian ◽

Metty Mustikasari

Keyword(s):

Programming Model ◽

Map Reduce ◽

Similarity Measurement ◽

Document Similarity

Download Full-text

OpenKG Chain: A Blockchain Infrastructure for Open Knowledge Graphs

Data Intelligence ◽

10.1162/dint_a_00095 ◽

2021 ◽

pp. 1-18

Author(s):

Huajun Chen ◽

Ning Hu ◽

Guilin Qi ◽

Haofen Wang ◽

Zhen Bi ◽

...

Keyword(s):

Open Data ◽

Knowledge Bases ◽

Knowledge Graph ◽

Cumulative Number ◽

Trust Network ◽

Fine Grained ◽

Knowledge Based ◽

Granularity Level ◽

Knowledge Graphs ◽

Structured Graphs

Abstract The early concept of knowledge graph originates from the idea of the Semantic Web, which aims at using structured graphs to model the knowledge of the world and record the relationships that exist between things. Currently publishing knowledge bases as open data on the Web has gained significant attention. In China, CIPS(Chinese Information Processing Society) launched the OpenKG in 2015 to foster the development of Chinese Open Knowledge Graphs. Unlike existing open knowledge-based programs, OpenKG chain is envisioned as a blockchain-based open knowledge infrastructure. This article introduces the first attempt at the implementation of sharing knowledge graphs on OpenKG chain, a blockchain-based trust network. We have completed the test of the underlying blockchain platform, as well as the on-chain test of OpenKG's dataset and toolset sharing as well as fine-grained knowledge crowdsourcing at the triple level. We have also proposed novel definitions: K-Point and OpenKG Token, which can be considered as a measurement of knowledge value and user value. 1033 knowledge contributors have been involved in two months of testing on the blockchain, and the cumulative number of on-chain recordings triggered by real knowledge consumers has reached 550,000 with an average daily peak value of more than 10,000. For the first time, We have tested and realized on-chain sharing of knowledge at entity/triple granularity level. At present, all operations on the datasets and toolset in OpenKG.CN, as well as the triplets in OpenBase, are recorded on the chain, and corresponding value will also be generated and assigned in a trusted mode. Via this effort, OpenKG chain looks to provide a more credible and traceable knowledge-sharing platform for the knowledge graph community.

Download Full-text