Entropy Optimized Feature-Based Bag-of-Words Representation for Information Retrieval

Abstract When submitting queries to information retrieval (IR) systems, users often have the option of specifying which, if any, of the query terms are heavily dependent on each other and should be treated as a fixed phrase, for instance by placing them between quotes.In addition to such cases where users specify term dependence, automatic ways also exist for IR systems to detect dependent terms in queries. Most IR systems use both user and algorithmic approaches. It is not however clear whether and to what extent user-defined term dependence agrees with algorithmic estimates of term dependence, nor which of the two may fetch higher performance gains. Simply put, is it better to trust users or the system to detect term dependence in queries? To answer this question, we experiment with 101 crowdsourced search engine users and 334 queries (52 train and 282 test TREC queries) and we record 10 assessments per query. We find that (i) user assessments of term dependence differ significantly from algorithmic assessments of term dependence (their overlap is approximately 30%); (ii) there is little agreement among users about term dependence in queries, and this disagreement increases as queries become longer; (iii) the potential retrieval gain that can be fetched by treating term dependence (both user- and system-defined) over a bag of words baseline is reserved to a small subset (approximately 8%) of the queries, and is much higher for low-depth than deep precision measures. Points (ii) and (iii) constitute novel insights into term dependence.

Download Full-text

A Structure-Driven Method for Information Retrieval-Based Software Change Impact Analysis

Scientific Programming ◽

10.1155/2018/5494209 ◽

2018 ◽

Vol 2018 ◽

pp. 1-16

Author(s):

Yun He ◽

Tong Li ◽

Wei Wang ◽

Wei Lan ◽

Xiang Li

Keyword(s):

Information Retrieval ◽

Impact Analysis ◽

Contextual Information ◽

Source Code ◽

Word Embedding ◽

Bag Of Words ◽

Change Impact Analysis ◽

Software Change ◽

Change Impact ◽

Single Method

An important application of information retrieval technology is software change impact analysis. Existing information retrieval-based change impact analysis methods select a single method to transform the source code corpus into vectors in a process known as indexing. The single method is chosen from two primary methods, known as the bag-of-words and word embedding models, each having their specific advantages and disadvantages. The bag-of-words model records every word in the source code but ignores contextual information in the corpus. The word embedding model records the contextual information but loses detail for individual words. To address this problem, we propose a structure-driven method for information retrieval-based change impact analysis (named SDM-CIA). SDM-CIA integrates the bag-of-words and word embedding models based on the software’s structure. Our experiments using a standard benchmark shows that when compared with the existing methods, SDM-CIA improves on precision performance, recall performance, F-score performance, and MRR performance by an average of 3.65%, 3.82%, 3.6%, and 10.28%, respectively. Our experiments confirm the effectiveness of SDM-CIA.

Download Full-text

An efficient voice based information retrieval using bag of words based indexing

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i2.33.14850 ◽

2018 ◽

Vol 7 (3.3) ◽

pp. 622

Author(s):

R Uma ◽

B Latha

Keyword(s):

Data Mining ◽

Information Retrieval ◽

Retrieval System ◽

Bag Of Words ◽

Text Input ◽

Indexing Method ◽

Three Stages ◽

Physically Challenged ◽

Index Table ◽

Infor Mation

Data mining is one of the leading and drastically growing researches nowadays. One of the main areas in data mining is Information Retrieval (IR). Information retrieval is a broad job and it is finding information without any structured nature. Infor-mation retrieval retrieves the user required information from a large collection of data. The existing approaches yet to improve the accuracy in terms of relevant accuracy. In this paper, it is motivated to provide an Information Retrieval System (IRS) where it can retrieve information with high relevancy. The proposed IRS is specially designed for physically challenged people like blind people where the input and the output taken/given is voice. The functionality of proposed IRS consists of three stages such as: (i) Voice to Text input, (II). Pattern Matching, and (III). Text to Voice output.In order to improve the accuracy and relevancy the proposed IRS uses an indexing method called Bag of Words (BOW). BOW is like an index-table which can be referred to store, compare and retrieve the information speedily and accurately. Index-table utilization in IRS improves the accuracy with minimized computational complexity. The proposed IRS is simulated in DOTNET software and the results are compared with the existing system results in order to evaluate the performance.

Download Full-text

A personalised user preference and feature based semantic information retrieval system in semantic web search

International Journal of Grid and Utility Computing ◽

10.1504/ijguc.2018.10015151 ◽

2018 ◽

Vol 9 (3) ◽

pp. 256

Author(s):

P. Ranjith Jeba Thangiah ◽

S. Arockiasamy ◽

Princess Maria John

Keyword(s):

Information Retrieval ◽

Semantic Web ◽

Web Search ◽

Semantic Information ◽

Retrieval System ◽

Information Retrieval System ◽

User Preference ◽

Feature Based ◽

Semantic Information Retrieval

Download Full-text

Fuzzy Information Retrieval Based on Continuous Bag-of-Words Model

Symmetry ◽

10.3390/sym12020225 ◽

2020 ◽

Vol 12 (2) ◽

pp. 225

Author(s):

Dong Qiu ◽

Haihuan Jiang ◽

Shuqiao Chen

Keyword(s):

Information Retrieval ◽

Large Scale ◽

Query Language ◽

Word Embedding ◽

Bag Of Words ◽

Fuzzy Information ◽

Large Scale Data ◽

Harmonic Average ◽

Scale Data ◽

Fuzzy Information Retrieval

In this paper, we study the feasibility of performing fuzzy information retrieval by word embedding. We propose a fuzzy information retrieval approach to capture the relationships between words and query language, which combines some techniques of deep learning and fuzzy set theory. We try to leverage large scale data and the continuous-bag-of words model to find the relevant feature of words and obtain word embedding. To enhance retrieval effectiveness, we measure the relativity among words by word embedding, with the property of symmetry. Experimental results show that the recall ratio, precision ratio, and harmonic average of two ratios of the proposed method outperforms the ones of the traditional methods.

Download Full-text