Entropy Optimized Feature-Based Bag-of-Words Representation for Information Retrieval

2016 ◽  
Vol 28 (7) ◽  
pp. 1664-1677 ◽  
Author(s):  
Nikolaos Passalis ◽  
Anastasios Tefas
2018 ◽  
Vol 2 (1) ◽  
pp. 1-14 ◽  
Author(s):  
Christina Lioma ◽  
Birger Larsen ◽  
Peter Ingwersen

Abstract When submitting queries to information retrieval (IR) systems, users often have the option of specifying which, if any, of the query terms are heavily dependent on each other and should be treated as a fixed phrase, for instance by placing them between quotes.In addition to such cases where users specify term dependence, automatic ways also exist for IR systems to detect dependent terms in queries. Most IR systems use both user and algorithmic approaches. It is not however clear whether and to what extent user-defined term dependence agrees with algorithmic estimates of term dependence, nor which of the two may fetch higher performance gains. Simply put, is it better to trust users or the system to detect term dependence in queries? To answer this question, we experiment with 101 crowdsourced search engine users and 334 queries (52 train and 282 test TREC queries) and we record 10 assessments per query. We find that (i) user assessments of term dependence differ significantly from algorithmic assessments of term dependence (their overlap is approximately 30%); (ii) there is little agreement among users about term dependence in queries, and this disagreement increases as queries become longer; (iii) the potential retrieval gain that can be fetched by treating term dependence (both user- and system-defined) over a bag of words baseline is reserved to a small subset (approximately 8%) of the queries, and is much higher for low-depth than deep precision measures. Points (ii) and (iii) constitute novel insights into term dependence.


2018 ◽  
Vol 2018 ◽  
pp. 1-16
Author(s):  
Yun He ◽  
Tong Li ◽  
Wei Wang ◽  
Wei Lan ◽  
Xiang Li

An important application of information retrieval technology is software change impact analysis. Existing information retrieval-based change impact analysis methods select a single method to transform the source code corpus into vectors in a process known as indexing. The single method is chosen from two primary methods, known as the bag-of-words and word embedding models, each having their specific advantages and disadvantages. The bag-of-words model records every word in the source code but ignores contextual information in the corpus. The word embedding model records the contextual information but loses detail for individual words. To address this problem, we propose a structure-driven method for information retrieval-based change impact analysis (named SDM-CIA). SDM-CIA integrates the bag-of-words and word embedding models based on the software’s structure. Our experiments using a standard benchmark shows that when compared with the existing methods, SDM-CIA improves on precision performance, recall performance, F-score performance, and MRR performance by an average of 3.65%, 3.82%, 3.6%, and 10.28%, respectively. Our experiments confirm the effectiveness of SDM-CIA.


2018 ◽  
Vol 7 (3.3) ◽  
pp. 622
Author(s):  
R Uma ◽  
B Latha

Data mining is one of the leading and drastically growing researches nowadays. One of the main areas in data mining is Information Retrieval (IR). Information retrieval is a broad job and it is finding information without any structured nature. Infor-mation retrieval retrieves the user required information from a large collection of data. The existing approaches yet to improve the accuracy in terms of relevant accuracy. In this paper, it is motivated to provide an Information Retrieval System (IRS) where it can retrieve information with high relevancy. The proposed IRS is specially designed for physically challenged people like blind people where the input and the output taken/given is voice. The functionality of proposed IRS consists of three stages such as: (i) Voice to Text input, (II). Pattern Matching, and (III). Text to Voice output.In order to improve the accuracy and relevancy the proposed IRS uses an indexing method called Bag of Words (BOW). BOW is like an index-table which can be referred to store, compare and retrieve the information speedily and accurately. Index-table utilization in IRS improves the accuracy with minimized computational complexity. The proposed IRS is simulated in DOTNET software and the results are compared with the existing system results in order to evaluate the performance.  


Symmetry ◽  
2020 ◽  
Vol 12 (2) ◽  
pp. 225
Author(s):  
Dong Qiu ◽  
Haihuan Jiang ◽  
Shuqiao Chen

In this paper, we study the feasibility of performing fuzzy information retrieval by word embedding. We propose a fuzzy information retrieval approach to capture the relationships between words and query language, which combines some techniques of deep learning and fuzzy set theory. We try to leverage large scale data and the continuous-bag-of words model to find the relevant feature of words and obtain word embedding. To enhance retrieval effectiveness, we measure the relativity among words by word embedding, with the property of symmetry. Experimental results show that the recall ratio, precision ratio, and harmonic average of two ratios of the proposed method outperforms the ones of the traditional methods.


Sign in / Sign up

Export Citation Format

Share Document