Succinct range filters

Huanchen Zhang; Hyeontaek Lim; Viktor Leis; David G. Andersen; Michael Kaminsky; Kimberly Keeton; Andrew Pavlo

doi:10.1145/3450262

Succinct range filters

Communications of the ACM ◽

10.1145/3450262 ◽

2021 ◽

Vol 64 (4) ◽

pp. 166-173

Author(s):

Huanchen Zhang ◽

Hyeontaek Lim ◽

Viktor Leis ◽

David G. Andersen ◽

Michael Kaminsky ◽

...

Keyword(s):

Information Theory ◽

Data Structure ◽

State Of The Art ◽

Bloom Filters ◽

Range Queries ◽

Database Storage

We present the Succinct Range Filter (SuRF), a fast and compact data structure for approximate membership tests. Unlike traditional Bloom filters, SuRF supports both single-key lookups and common range queries, such as range counts. SuRF is based on a new data structure called the Fast Succinct Trie (FST) that matches the performance of state-of-the-art order-preserving indexes, while consuming only 10 bits per trie node---a space close to the minimum required by information theory. Our experiments show that SuRF speeds up range queries in a widely used database storage engine by up to 5×.

Download Full-text

Succinct Encoding of Binary Strings Representing Triangulations

Algorithmica ◽

10.1007/s00453-021-00861-4 ◽

2021 ◽

Author(s):

José Fuentes-Sepúlveda ◽

Diego Seco ◽

Raquel Viaña

Keyword(s):

Information Theory ◽

Data Structure ◽

Experimental Evaluation ◽

Special Class ◽

Spanning Trees ◽

State Of The Art ◽

Succinct Data Structure ◽

Planar Embeddings ◽

Specific Sequences ◽

Binary Strings

AbstractWe consider the problem of designing a succinct data structure for representing the connectivity of planar triangulations. The main result is a new succinct encoding achieving the information-theory optimal bound of 3.24 bits per vertex, while allowing efficient navigation. Our representation is based on the bijection of Poulalhon and Schaeffer (Algorithmica, 46(3):505–527, 2006) that defines a mapping between planar triangulations and a special class of spanning trees, called PS-trees. The proposed solution differs from previous approaches in that operations in planar triangulations are reduced to operations in particular parentheses sequences encoding PS-trees. Existing methods to handle balanced parentheses sequences have to be combined and extended to operate on such specific sequences, essentially for retrieving matching elements. The new encoding supports extracting the d neighbors of a query vertex in O(d) time and testing adjacency between two vertices in O(1) time. Additionally, we provide an implementation of our proposed data structure. In the experimental evaluation, our representation reaches up to 7.35 bits per vertex, improving the space usage of state-of-the-art implementations for planar embeddings.

Download Full-text

Suffix array for multi-pattern matching with variable length wildcards

Intelligent Data Analysis ◽

10.3233/ida-205087 ◽

2021 ◽

Vol 25 (2) ◽

pp. 283-303

Author(s):

Na Liu ◽

Fei Xie ◽

Xindong Wu

Keyword(s):

Dynamic Programming ◽

Data Structure ◽

Pattern Matching ◽

Edit Distance ◽

State Of The Art ◽

Suffix Array ◽

Variable Length ◽

Distance Method ◽

Efficient Data ◽

Comparison Algorithms

Approximate multi-pattern matching is an important issue that is widely and frequently utilized, when the pattern contains variable-length wildcards. In this paper, two suffix array-based algorithms have been proposed to solve this problem. Suffix array is an efficient data structure for exact string matching in existing studies, as well as for approximate pattern matching and multi-pattern matching. An algorithm called MMSA-S is for the short exact characters in a pattern by dynamic programming, while another algorithm called MMSA-L deals with the long exact characters by the edit distance method. Experimental results of Pizza & Chili corpus demonstrate that these two newly proposed algorithms, in most cases, are more time-efficient than the state-of-the-art comparison algorithms.

Download Full-text

Cache-efficient sweeping-based interval joins for extended Allen relation predicates

The VLDB Journal ◽

10.1007/s00778-020-00650-5 ◽

2021 ◽

Author(s):

Danila Piatov ◽

Sven Helmer ◽

Anton Dignös ◽

Fabio Persia

Keyword(s):

Data Structure ◽

Experimental Evaluation ◽

State Of The Art ◽

Temporal Databases ◽

Access Method ◽

Wide Range ◽

Interval Relation ◽

Cache Efficient ◽

Join Algorithms ◽

Better Than

AbstractWe develop a family of efficient plane-sweeping interval join algorithms for evaluating a wide range of interval predicates such as Allen’s relationships and parameterized relationships. Our technique is based on a framework, components of which can be flexibly combined in different manners to support the required interval relation. In temporal databases, our algorithms can exploit a well-known and flexible access method, the Timeline Index, thus expanding the set of operations it supports even further. Additionally, employing a compact data structure, the gapless hash map, we utilize the CPU cache efficiently. In an experimental evaluation, we show that our approach is several times faster and scales better than state-of-the-art techniques, while being much better suited for real-time event processing.

Download Full-text

Spaced Seed Data Structures forDe NovoAssembly

International Journal of Genomics ◽

10.1155/2015/196591 ◽

2015 ◽

Vol 2015 ◽

pp. 1-8 ◽

Cited By ~ 3

Author(s):

Inanç Birol ◽

Justin Chu ◽

Hamid Mohamadi ◽

Shaun D. Jackman ◽

Karthika Raghavan ◽

...

Keyword(s):

Data Structure ◽

Data Structures ◽

De Novo ◽

Bloom Filters ◽

De Bruijn Graph ◽

Sequence Specificity ◽

Sequencing Errors ◽

Spaced Seeds ◽

Read Error Correction ◽

Seed Data

De novoassembly of the genome of a species is essential in the absence of a reference genome sequence. Many scalable assembly algorithms use the de Bruijn graph (DBG) paradigm to reconstruct genomes, where a table of subsequences of a certain length is derived from the reads, and their overlaps are analyzed to assemble sequences. Despite longer subsequences unlocking longer genomic features for assembly, associated increase in compute resources limits the practicability of DBG over other assembly archetypes already designed for longer reads. Here, we revisit the DBG paradigm to adapt it to the changing sequencing technology landscape and introduce three data structure designs for spaced seeds in the form of paired subsequences. These data structures address memory and run time constraints imposed by longer reads. We observe that when a fixed distance separates seed pairs, it provides increased sequence specificity with increased gap length. Further, we note that Bloom filters would be suitable to implicitly store spaced seeds and be tolerant to sequencing errors. Building on this concept, we describe a data structure for tracking the frequencies of observed spaced seeds. These data structure designs will have applications in genome, transcriptome and metagenome assemblies, and read error correction.

Download Full-text

A Multi-attribute Data Structure with Parallel Bloom Filters for Network Services

High Performance Computing - HiPC 2006 - Lecture Notes in Computer Science ◽

10.1007/11945918_30 ◽

2006 ◽

pp. 277-288 ◽

Cited By ~ 9

Author(s):

Yu Hua ◽

Bin Xiao

Keyword(s):

Data Structure ◽

Bloom Filters ◽

Network Services ◽

Attribute Data

Download Full-text

A Graph-Based Author Name Disambiguation Method and Analysis via Information Theory

Entropy ◽

10.3390/e22040416 ◽

2020 ◽

Vol 22 (4) ◽

pp. 416

Author(s):

Yingying Ma ◽

Youlong Wu ◽

Chengqiang Lu

Keyword(s):

Information Theory ◽

Information Integration ◽

Web Search ◽

State Of The Art ◽

Real Life ◽

Representation Learning ◽

Document Retrieval ◽

Name Disambiguation ◽

Author Name Disambiguation ◽

And Training

Name ambiguity, due to the fact that many people share an identical name, often deteriorates the performance of information integration, document retrieval and web search. In academic data analysis, author name ambiguity usually decreases the analysis performance. To solve this problem, an author name disambiguation task is designed to divide documents related to an author name reference into several parts and each part is associated with a real-life person. Existing methods usually use either attributes of documents or relationships between documents and co-authors. However, methods of feature extraction using attributes cause inflexibility of models while solutions based on relationship graph network ignore the information contained in the features. In this paper, we propose a novel name disambiguation model based on representation learning which incorporates attributes and relationships. Experiments on a public real dataset demonstrate the effectiveness of our model and experimental results demonstrate that our solution is superior to several state-of-the-art graph-based methods. We also increase the interpretability of our method through information theory and show that the analysis could be helpful for model selection and training progress.

Download Full-text

Recursive Lists of Clusters: A Dynamic Data Structure for Range Queries in Metric Spaces

Computer and Information Sciences - ISCIS 2005 - Lecture Notes in Computer Science ◽

10.1007/11569596_86 ◽

2005 ◽

pp. 843-853 ◽

Cited By ~ 4

Author(s):

Margarida Mamede

Keyword(s):

Data Structure ◽

Metric Spaces ◽

Range Queries ◽

Dynamic Data ◽

Dynamic Data Structure

Download Full-text

A log log n data structure for three-sided range queries

Information Processing Letters ◽

10.1016/0020-0190(87)90174-8 ◽

1987 ◽

Vol 25 (4) ◽

pp. 269-273 ◽

Cited By ~ 16

Author(s):

O. Fries ◽

K. Mehlhorn ◽

S. Näher ◽

A. Tsakalidis

Keyword(s):

Data Structure ◽

Range Queries

Download Full-text

Productive corecursion in logic programming

Theory and Practice of Logic Programming ◽

10.1017/s147106841700028x ◽

2017 ◽

Vol 17 (5-6) ◽

pp. 906-923 ◽

Cited By ~ 5

Author(s):

EKATERINA KOMENDANTSKAYA ◽

YUE LI

Keyword(s):

Data Structure ◽

Logic Programming ◽

State Of The Art ◽

Logic Program ◽

Stream Processing ◽

The Internet ◽

Logic Programs ◽

Algorithmic Solution ◽

Undecidable Property ◽

Turing Complete

AbstractLogic Programming is a Turing complete language. As a consequence, designing algorithms that decide termination and non-termination of programs or decide inductive/coinductive soundness of formulae is a challenging task. For example, the existing state-of-the-art algorithms can only semi-decide coinductive soundness of queries in logic programming for regular formulae. Another, less famous, but equally fundamental and important undecidable property is productivity. If a derivation is infinite and coinductively sound, we may ask whether the computed answer it determines actually computes an infinite formula. If it does, the infinite computation is productive. This intuition was first expressed under the name of computations at infinity in the 80s. In modern days of the Internet and stream processing, its importance lies in connection to infinite data structure processing. Recently, an algorithm was presented that semi-decides a weaker property – of productivity of logic programs. A logic program is productive if it can give rise to productive derivations. In this paper, we strengthen these recent results. We propose a method that semi-decides productivity of individual derivations for regular formulae. Thus, we at last give an algorithmic counterpart to the notion of productivity of derivations in logic programming. This is the first algorithmic solution to the problem since it was raised more than 30 years ago. We also present an implementation of this algorithm.

Download Full-text

Voxelisation Algorithms and Data Structures: A Review

Sensors ◽

10.3390/s21248241 ◽

2021 ◽

Vol 21 (24) ◽

pp. 8241

Author(s):

Mitko Aleksandrov ◽

Sisi Zlatanova ◽

David J. Heslop

Keyword(s):

Data Structure ◽

Data Structures ◽

State Of The Art ◽

Algorithms And Data Structures ◽

3D Objects ◽

Digital Representations ◽

Memory Footprint ◽

Voxel Data ◽

2D And 3D ◽

Geometric Primitives

Voxel-based data structures, algorithms, frameworks, and interfaces have been used in computer graphics and many other applications for decades. There is a general necessity to seek adequate digital representations, such as voxels, that would secure unified data structures, multi-resolution options, robust validation procedures and flexible algorithms for different 3D tasks. In this review, we evaluate the most common properties and algorithms for voxelisation of 2D and 3D objects. Thus, many voxelisation algorithms and their characteristics are presented targeting points, lines, triangles, surfaces and solids as geometric primitives. For lines, we identify three groups of algorithms, where the first two achieve different voxelisation connectivity, while the third one presents voxelisation of curves. We can say that surface voxelisation is a more desired voxelisation type compared to solid voxelisation, as it can be achieved faster and requires less memory if voxels are stored in a sparse way. At the same time, we evaluate in the paper the available voxel data structures. We split all data structures into static and dynamic grids considering the frequency to update a data structure. Static grids are dominated by SVO-based data structures focusing on memory footprint reduction and attributes preservation, where SVDAG and SSVDAG are the most advanced methods. The state-of-the-art dynamic voxel data structure is NanoVDB which is superior to the rest in terms of speed as well as support for out-of-core processing and data management, which is the key to handling large dynamically changing scenes. Overall, we can say that this is the first review evaluating the available voxelisation algorithms for different geometric primitives as well as voxel data structures.

Download Full-text