Fast Retrieval Algorithm for Earth Mover's Distance Using EMD Lower Bounds and a Skipping Algorithm

Advances in Multimedia ◽

10.1155/2011/421820 ◽

2011 ◽

Vol 2011 ◽

pp. 1-9 ◽

Cited By ~ 3

Author(s):

Masami Shishibori ◽

Daichi Koizumi ◽

Kenji Kita

Keyword(s):

High Speed ◽

Large Scale ◽

Retrieval Algorithm ◽

Efficient Computation ◽

Earth Mover’S Distance ◽

Earth Mover's Distance ◽

Retrieval Systems ◽

Large Databases ◽

Information Retrieval Systems ◽

Feature Based

The earth mover's distance (EMD) is a measure of the distance between two distributions, and it has been widely used in multimedia information retrieval systems, in particular, in content-based image retrieval systems. When the EMD is applied to image problems based on color or texture, the EMD reflects the human perceptual similarities. However, its computations are too expensive to use in large-scale databases. In order to achieve efficient computation of the EMD during query processing, we have developed “fastEMD,” a library for high-speed feature-based similarity retrievals in large databases. This paper introduces techniques that are used in the implementation of the fastEMD and performs extensive experiments to demonstrate its efficiency.

Download Full-text

IR Research and Innovation in Commercial Online Systems: An Exploratory Survey

Proceedings of the Annual Conference of CAIS / Actes du congrès annuel de l'ACSI ◽

10.29173/cais731 ◽

2013 ◽

Author(s):

Sherry Koshman ◽

Edie Rasmussen

Keyword(s):

Information Retrieval ◽

Large Scale ◽

Interactive Systems ◽

Information Industry ◽

Online Systems ◽

Research And Innovation ◽

Retrieval Systems ◽

Information Retrieval Systems ◽

Operational Systems ◽

Mcgill University

From the 1994 CAIS Conference: The Information Industry in Transition McGill University, Montreal, Quebec. May 25 - 27, 1994."Conventional" information retrieval systems (IRS), originating in the research of the 11950s and 1960s, are based on keyword matching and the application of Boolean operators to produce a set of retrieved documents from a database. In the ensuing years, research in information retrieval has identified a number of innovations (for example, automatic weighting of terms, ranked output, and relevance feedback) which have the potential to significantly enhance the performance of IRS, though commercial vendors have been slow to incorporate these changes into their systems. This was the situation in 1988 which led Radecki, in a special issue of Information Processing & Management, to examine the potential for improvements in conventional Boolean retrieval systems, and explore the reasons why these improvements had not been implemented in operational systems. Over the last five years, this position has begun to change as commercial vendors such as Dialog, Dow Jones, West Publishing, and Mead have implemented new, non-Boolean features in their systems, including natural language input, weighted keyword terms, and document ranking. This paper identifies some of the significant findings of IR research and compares them to the implementation of non-Boolean features in such systems. The preliminary survey of new features in commercial systems suggests the need for new methods of evaluation, including the development of evalutation measures appropriate to large-scale, interactive systems.

Download Full-text

Automatic Keyword Annotation System Using Newspapers

Journal of Advanced Computational Intelligence and Intelligent Informatics ◽

10.20965/jaciii.2014.p0340 ◽

2014 ◽

Vol 18 (3) ◽

pp. 340-346 ◽

Cited By ~ 1

Author(s):

Tomoki Takada ◽

◽

Mizuki Arai ◽

Tomohiro Takagi

Keyword(s):

Information Retrieval ◽

Language Processing ◽

High Speed ◽

Naive Bayes ◽

High Accuracy ◽

Naïve Bayes ◽

Annotation System ◽

Retrieval Systems ◽

Information Retrieval Systems ◽

Index Terms

Nowadays, an increasingly large amount of information exists on the web. Therefore, a method is needed that enables us to find necessary information quickly because this is becoming increasingly difficult for users. To solve this problem, information retrieval systems like Google and recommendation systems like that on Amazon are used. In this paper, we focus on information retrieval systems. These retrieval systems require index terms, which affect the precision of retrieval. Two methods generally decide index terms. One is analyzing a text using natural language processing and deciding index terms using varying amounts of statistics. The other is someone choosing document keywords as index terms. However, the latter method requires too much time and effort and becomes more impractical as information grows. Therefore, we propose the Nikkei annotator system, which is based on the model of the human brain and learns patterns of past keyword annotation and automatically outputs keywords that users prefer. The purposes of the proposed method are automating manual keyword annotation and achieving high speed and high accuracy keyword annotation. Experimental results showed that the proposed method is more accurate than TFIDF and Naive Bayes in P@5 and P@10. Moreover, these results also showed that the proposed method could annotate about 19 times faster than Naive Bayes.

Download Full-text

A Linear Approximate Algorithm for Earth Mover's Distance with Thresholded Ground Distance

Mathematical Problems in Engineering ◽

10.1155/2014/406358 ◽

2014 ◽

Vol 2014 ◽

pp. 1-9 ◽

Cited By ~ 2

Author(s):

Longjie Li ◽

Min Ma ◽

Peng Lei ◽

Xiaoping Wang ◽

Xiaoyun Chen

Keyword(s):

Approximation Algorithm ◽

Large Scale ◽

Geometric Interpretation ◽

Vital Role ◽

Approximate Algorithm ◽

Earth Mover’S Distance ◽

Image Comparison ◽

Earth Mover's Distance ◽

Ground Distance ◽

The Impact

Effective and efficient image comparison plays a vital role in content-based image retrieval (CBIR). The earth mover’s distance (EMD) is an enticing measure for image comparison, offering intuitive geometric interpretation and modelling the human perceptions of similarity. Unfortunately, computing EMD, using the simplex method, has cubic complexity. FastEMD, based on min-cost flow, reduces the complexity to (O(N2log⁡N)). Although both methods can obtain the optimal result, the high complexity prevents the application of EMD on large-scale image datasets. Thresholding the ground distance can make EMD faster and more robust, since it can decrease the impact of noise and reduce the range of transportation. In this paper, we present a new image distance metric,EMD+, which applies a threshold to the ground distance. To computeEMD+, the FastEMD approach can be employed. We also propose a novel linear approximation algorithm. Our algorithm achievesONcomplexity with the benefit of qualified bins. Experimental results show that (1) our method is 2 to 3 orders of magnitude faster than EMD (computed by FastEMD) and 2 orders of magnitude faster than FastEMD and (2) the precision of our approximation algorithm is no less than the precision of FastEMD.

Download Full-text

Large-scale image search with text for information retrieval

Journal of Innovations in Engineering Education ◽

10.3126/jiee.v4i1.35390 ◽

2021 ◽

Vol 4 (1) ◽

pp. 87-89

Author(s):

Janardan Bhatta

Keyword(s):

Information Retrieval ◽

Language Processing ◽

Large Scale ◽

Image Feature ◽

Image Search ◽

Search Results ◽

Retrieval Systems ◽

Information Retrieval Systems ◽

Text Features ◽

Text Query

Searching images in a large database is a major requirement in Information Retrieval Systems. Expecting image search results based on a text query is a challenging task. In this paper, we leverage the power of Computer Vision and Natural Language Processing in Distributed Machines to lower the latency of search results. Image pixel features are computed based on contrastive loss function for image search. Text features are computed based on the Attention Mechanism for text search. These features are aligned together preserving the information in each text and image feature. Previously, the approach was tested only in multilingual models. However, we have tested it in image-text dataset and it enabled us to search in any form of text or images with high accuracy.

Download Full-text

Three approaches to measuring recall on the Web: a systematic review

The Electronic Library ◽

10.1108/el-12-2019-0287 ◽

2020 ◽

Vol 38 (3) ◽

pp. 477-492

Author(s):

Mahdi Zeynali Tazehkandi ◽

Mohsen Nowkarizi

Keyword(s):

Search Engines ◽

Design Methodology ◽

Retrieval Algorithm ◽

Content Type ◽

The Third ◽

Retrieval Systems ◽

Information Retrieval Systems ◽

Practical Implications ◽

The Web ◽

Group Recall

Purpose The purpose of this paper is to present a review on the use of the recall metric for evaluating information retrieval systems, especially search engines. Design/methodology/approach This paper investigates different researchers’ views about recall metrics. Findings Five different definitions for recall were identified. For the first group, recall refers to completeness, but it does not specify where all the relevant documents are located. For the second group, recall refers to retrieving all the relevant documents from the collection. However, it seems that the term “collection” is ambiguous. For the third group (first approach), collection means the index of search engines and, for the fourth group (second approach), collection refers to the Web. For the fifth group (third approach), ranking of the retrieved documents should also be accounted for in calculating recall. Practical implications It can be said that in the first, second and third approaches, the components of the retrieval algorithm, the retrieval algorithm and crawler, and the retrieval algorithm and crawler and ranker, respectively, are evaluated. To determine the effectiveness of search engines for the use of users, it is better to use the third approach in recall measurement. Originality/value The value of this paper is to collect, identify and analyse literature that is used in recall. In addition, different views of researchers about recall are identified.

Download Full-text

A neural algorithm for a fundamental computing problem

Science ◽

10.1126/science.aam9868 ◽

2017 ◽

Vol 358 (6364) ◽

pp. 793-796 ◽

Cited By ~ 53

Author(s):

Sanjoy Dasgupta ◽

Charles F. Stevens ◽

Saket Navlakha

Keyword(s):

Large Scale ◽

Activity Patterns ◽

Fruit Fly ◽

Locality Sensitive Hashing ◽

Sensory Function ◽

Retrieval Systems ◽

Information Retrieval Systems ◽

Similar Images ◽

Traditional Approaches ◽

Similarity Searches

Similarity search—for example, identifying similar images in a database or similar documents on the web—is a fundamental computing problem faced by large-scale information retrieval systems. We discovered that the fruit fly olfactory circuit solves this problem with a variant of a computer science algorithm (called locality-sensitive hashing). The fly circuit assigns similar neural activity patterns to similar odors, so that behaviors learned from one odor can be applied when a similar odor is experienced. The fly algorithm, however, uses three computational strategies that depart from traditional approaches. These strategies can be translated to improve the performance of computational similarity searches. This perspective helps illuminate the logic supporting an important sensory function and provides a conceptually new algorithm for solving a fundamental computational problem.

Download Full-text

Managing tail latency in large scale information retrieval systems

ACM SIGIR Forum ◽

10.1145/3451964.3451982 ◽

2020 ◽

Vol 54 (1) ◽

pp. 1-2

Author(s):

Joel M. Mackenzie

Keyword(s):

Information Retrieval ◽

User Experience ◽

Large Scale ◽

Response Times ◽

Smart Devices ◽

Worst Case ◽

Retrieval Systems ◽

Trade Offs ◽

Efficiency And Effectiveness ◽

Information Retrieval Systems

As both the availability of internet access and the prominence of smart devices continue to increase, data is being generated at a rate faster than ever before. This massive increase in data production comes with many challenges, including efficiency concerns for the storage and retrieval of such large-scale data. However, users have grown to expect the sub-second response times that are common in most modern search engines, creating a problem --- how can such large amounts of data continue to be served efficiently enough to satisfy end users? This dissertation investigates several issues regarding tail latency in large-scale information retrieval systems. Tail latency corresponds to the high percentile latency that is observed from a system --- in the case of search, this latency typically corresponds to how long it takes for a query to be processed. In particular, keeping tail latency as low as possible translates to a good experience for all users, as tail latency is directly related to the worst-case latency and hence, the worst possible user experience. The key idea in targeting tail latency is to move from questions such as "what is the median latency of our search engine?" to questions which more accurately capture user experience such as "how many queries take more than 200 ms to return answers?" or "what is the worst case latency that a user may be subject to, and how often might it occur?" While various strategies exist for efficiently processing queries over large textual corpora, prior research has focused almost entirely on improvements to the average processing time or cost of search systems. As a first contribution, we examine some state-of-the-art retrieval algorithms for two popular index organizations, and discuss the trade-offs between them, paying special attention to the notion of tail latency. This research uncovers a number of observations that are subsequently leveraged for improved search efficiency and effectiveness. We then propose and solve a new problem, which involves processing a number of related query variations together, known as multi-queries , to yield higher quality search results. We experiment with a number of algorithmic approaches to efficiently process these multi-queries, and report on the cost, efficiency, and effectiveness trade-offs present with each. Finally, we examine how predictive models can be used to improve the tail latency and end-to-end cost of a commonly used multi-stage retrieval architecture without impacting result effectiveness. By combining ideas from numerous areas of information retrieval, we propose a prediction framework which can be used for training and evaluating several efficiency/effectiveness trade-off parameters, resulting in improved trade-offs between cost, result quality, and tail latency.

Download Full-text