scholarly journals SAHA: A String Adaptive Hash Table for Analytical Databases

2020 ◽  
Vol 10 (6) ◽  
pp. 1915
Author(s):  
Tianqi Zheng ◽  
Zhibin Zhang ◽  
Xueqi Cheng

Hash tables are the fundamental data structure for analytical database workloads, such as aggregation, joining, set filtering and records deduplication. The performance aspects of hash tables differ drastically with respect to what kind of data are being processed or how many inserts, lookups and deletes are constructed. In this paper, we address some common use cases of hash tables: aggregating and joining over arbitrary string data. We designed a new hash table, SAHA, which is tightly integrated with modern analytical databases and optimized for string data with the following advantages: (1) it inlines short strings and saves hash values for long strings only; (2) it uses special memory loading techniques to do quick dispatching and hashing computations; and (3) it utilizes vectorized processing to batch hashing operations. Our evaluation results reveal that SAHA outperforms state-of-the-art hash tables by one to five times in analytical workloads, including Google’s SwissTable and Facebook’s F14Table. It has been merged into the ClickHouse database and shows promising results in production.

2021 ◽  
Vol 8 (2) ◽  
pp. 1-17
Author(s):  
Oded Green

In this article, we introduce HashGraph, a new scalable approach for building hash tables that uses concepts taken from sparse graph representations—hence, the name HashGraph. HashGraph introduces a new way to deal with hash-collisions that does not use “open-addressing” or “separate-chaining,” yet it has the benefits of both these approaches. HashGraph currently works for static inputs. Recent progress with dynamic graph data structures suggests that HashGraph might be extendable to dynamic inputs as well. We show that HashGraph can deal with a large number of hash values per entry without loss of performance. Last, we show a new querying algorithm for value lookups. We experimentally compare HashGraph to several state-of-the-art implementations and find that it outperforms them on average 2× when the inputs are unique and by as much as 40× when the input contains duplicates. The implementation of HashGraph in this article is for NVIDIA GPUs. HashGraph can build a hash table at a rate of 2.5 billion keys per second on a NVIDIA GV100 GPU and can query at nearly the same rate.


2021 ◽  
Vol 14 (11) ◽  
pp. 2244-2257
Author(s):  
Otmar Ertl

MinHash and HyperLogLog are sketching algorithms that have become indispensable for set summaries in big data applications. While HyperLogLog allows counting different elements with very little space, MinHash is suitable for the fast comparison of sets as it allows estimating the Jaccard similarity and other joint quantities. This work presents a new data structure called SetSketch that is able to continuously fill the gap between both use cases. Its commutative and idempotent insert operation and its mergeable state make it suitable for distributed environments. Fast, robust, and easy-to-implement estimators for cardinality and joint quantities, as well as the ability to use SetSketch for similarity search, enable versatile applications. The presented joint estimator can also be applied to other data structures such as MinHash, HyperLogLog, or Hyper-MinHash, where it even performs better than the corresponding state-of-the-art estimators in many cases.


2021 ◽  
Vol 14 (13) ◽  
pp. 3267-3280
Author(s):  
Huayi Wang ◽  
Jingfan Meng ◽  
Long Gong ◽  
Jun Xu ◽  
Mitsunori Ogihara

Approximate Nearest Neighbor Search (ANNS) is a fundamental algorithmic problem, with numerous applications in many areas of computer science. Locality-Sensitive Hashing (LSH) is one of the most popular solution approaches for ANNS. A common shortcoming of many LSH schemes is that since they probe only a single bucket in a hash table, they need to use a large number of hash tables to achieve a high query accuracy. For ANNS- L 2 , a multi-probe scheme was proposed to overcome this drawback by strategically probing multiple buckets in a hash table. In this work, we propose MP-RW-LSH, the first and so far only multi-probe LSH solution to ANNS in L 1 distance, and show that it achieves a better tradeoff between scalability and query efficiency than all existing LSH-based solutions. We also explain why a state-of-the-art ANNS -L 1 solution called Cauchy projection LSH (CP-LSH) is fundamentally not suitable for multi-probe extension. Finally, as a use case, we construct, using MP-RW-LSH as the underlying "ANNS- L 1 engine", a new ANNS-E (E for edit distance) solution that beats the state of the art.


2021 ◽  
Vol 25 (2) ◽  
pp. 283-303
Author(s):  
Na Liu ◽  
Fei Xie ◽  
Xindong Wu

Approximate multi-pattern matching is an important issue that is widely and frequently utilized, when the pattern contains variable-length wildcards. In this paper, two suffix array-based algorithms have been proposed to solve this problem. Suffix array is an efficient data structure for exact string matching in existing studies, as well as for approximate pattern matching and multi-pattern matching. An algorithm called MMSA-S is for the short exact characters in a pattern by dynamic programming, while another algorithm called MMSA-L deals with the long exact characters by the edit distance method. Experimental results of Pizza & Chili corpus demonstrate that these two newly proposed algorithms, in most cases, are more time-efficient than the state-of-the-art comparison algorithms.


2021 ◽  
Author(s):  
Danila Piatov ◽  
Sven Helmer ◽  
Anton Dignös ◽  
Fabio Persia

AbstractWe develop a family of efficient plane-sweeping interval join algorithms for evaluating a wide range of interval predicates such as Allen’s relationships and parameterized relationships. Our technique is based on a framework, components of which can be flexibly combined in different manners to support the required interval relation. In temporal databases, our algorithms can exploit a well-known and flexible access method, the Timeline Index, thus expanding the set of operations it supports even further. Additionally, employing a compact data structure, the gapless hash map, we utilize the CPU cache efficiently. In an experimental evaluation, we show that our approach is several times faster and scales better than state-of-the-art techniques, while being much better suited for real-time event processing.


2021 ◽  
Vol 16 (1) ◽  
Author(s):  
Jens Zentgraf ◽  
Sven Rahmann

Abstract Motivation With an increasing number of patient-derived xenograft (PDX) models being created and subsequently sequenced to study tumor heterogeneity and to guide therapy decisions, there is a similarly increasing need for methods to separate reads originating from the graft (human) tumor and reads originating from the host species’ (mouse) surrounding tissue. Two kinds of methods are in use: On the one hand, alignment-based tools require that reads are mapped and aligned (by an external mapper/aligner) to the host and graft genomes separately first; the tool itself then processes the resulting alignments and quality metrics (typically BAM files) to assign each read or read pair. On the other hand, alignment-free tools work directly on the raw read data (typically FASTQ files). Recent studies compare different approaches and tools, with varying results. Results We show that alignment-free methods for xenograft sorting are superior concerning CPU time usage and equivalent in accuracy. We improve upon the state of the art sorting by presenting a fast lightweight approach based on three-way bucketed quotiented Cuckoo hashing. Our hash table requires memory comparable to an FM index typically used for read alignment and less than other alignment-free approaches. It allows extremely fast lookups and uses less CPU time than other alignment-free methods and alignment-based methods at similar accuracy. Several engineering steps (e.g., shortcuts for unsuccessful lookups, software prefetching) improve the performance even further. Availability Our software xengsort is available under the MIT license at http://gitlab.com/genomeinformatics/xengsort. It is written in numba-compiled Python and comes with sample Snakemake workflows for hash table construction and dataset processing.


2021 ◽  
Vol 14 (5) ◽  
pp. 785-798
Author(s):  
Daokun Hu ◽  
Zhiwen Chen ◽  
Jianbing Wu ◽  
Jianhua Sun ◽  
Hao Chen

Persistent memory (PM) is increasingly being leveraged to build hash-based indexing structures featuring cheap persistence, high performance, and instant recovery, especially with the recent release of Intel Optane DC Persistent Memory Modules. However, most of them are evaluated on DRAM-based emulators with unreal assumptions, or focus on the evaluation of specific metrics with important properties sidestepped. Thus, it is essential to understand how well the proposed hash indexes perform on real PM and how they differentiate from each other if a wider range of performance metrics are considered. To this end, this paper provides a comprehensive evaluation of persistent hash tables. In particular, we focus on the evaluation of six state-of-the-art hash tables including Level hashing, CCEH, Dash, PCLHT, Clevel, and SOFT, with real PM hardware. Our evaluation was conducted using a unified benchmarking framework and representative workloads. Besides characterizing common performance properties, we also explore how hardware configurations (such as PM bandwidth, CPU instructions, and NUMA) affect the performance of PM-based hash tables. With our in-depth analysis, we identify design trade-offs and good paradigms in prior arts, and suggest desirable optimizations and directions for the future development of PM-based hash tables.


2019 ◽  
Vol 1 ◽  
pp. 1-1
Author(s):  
Yaqian Chen ◽  
Jiangfeng She ◽  
Xingong Li

<p><strong>Abstract.</strong> Cost distance is one of the fundamental functions in geographic information systems (GIS), which has been used in various applications such as route planning, construction of Thiessen polygons and distance weighted interpolation. Conventional 2D cost distance function, due to its limited movement directions (either 4 or 8 neighbourhood cells) in the raster data model, overestimates the least cost and the problem is especially severe with a homogeneous friction surface. 3D cost distance function removes the limitation that movement must occur on a planar surface. It can therefore take into account tunnels and bridges when calculating least cost paths. In addition, it can also be used in many other application domains which deal with 3D geospatial data such as in atmospheric science, geology, and oceanography. Based on the method in Tomlin (2010), which can completely eliminate the overestimation when traveling on a homogeneous friction surface, this research proposes an algorithm that calculates accurate least cost with both homogeneous and heterogeneous friction in 3D space. When extending the cost distance function from 2D to 3D, the number of voxels in the propagation front increases significantly and efficiency is an imperative issue. This research also improves the computational efficiency by developing a data structure that combines a binary heap and a hash table. Our results show that the proposed algorithm can calculate accurate 3D cost distance in a homogeneous friction space, and the proposed data structure (i.e., heap plus hash table) not only significantly reduces the algorithm’s runtime but also benefits more in 3D than in 2D. In addition, we have applied the method in a 3D drone delivery routing application in a city environment (Figure 1). Additional applications, such as calculating groundwater flow paths of least hydraulic resistance in a heterogeneous 3D hydraulic conductivity field, are currently under development.</p>


2013 ◽  
Vol 22 (3) ◽  
pp. 455-476
Author(s):  
NICLAS PETERSSON

In this paper we study the maximum displacement for linear probing hashing. We use the standard probabilistic model together with the insertion policy known as First-Come-(First-Served). The results are of asymptotic nature and focus on dense hash tables. That is, the number of occupied cellsnand the size of the hash tablemtend to infinity with ration/m→ 1. We present distributions and moments for the size of the maximum displacement, as well as for the number of items with displacement larger than some critical value. This is done via process convergence of the (appropriately normalized) length of the largest block of consecutive occupied cells, when the total number of occupied cellsnvaries.


Algorithms ◽  
2020 ◽  
Vol 13 (12) ◽  
pp. 338
Author(s):  
Ting Huang ◽  
Zhengping Weng ◽  
Gang Liu ◽  
Zhenwen He

To manage multidimensional point data more efficiently, this paper presents an improvement, called HD-tree, of a previous indexing method, called D-tree. Both structures combine quadtree-like partitioning (using integer shift operations without storing internal nodes, but only leaves) and hash tables (for searching for the nodes stored). However, the HD-tree follows a brand-new decomposition strategy, which is called half decomposition strategy. This improvement avoids the generation of nodes containing only a small amount of data and the sequential search of the hash table, so that it can save storage space while having faster I/O and better time performance when building the tree and querying data. The results demonstrate convincingly that the time and space performance of HD-tree is better than that of D-tree regardless of uniform or uneven data, which are less affected by data distribution.


Sign in / Sign up

Export Citation Format

Share Document