Succinct range filters

2021 ◽  
Vol 64 (4) ◽  
pp. 166-173
Author(s):  
Huanchen Zhang ◽  
Hyeontaek Lim ◽  
Viktor Leis ◽  
David G. Andersen ◽  
Michael Kaminsky ◽  
...  

We present the Succinct Range Filter (SuRF), a fast and compact data structure for approximate membership tests. Unlike traditional Bloom filters, SuRF supports both single-key lookups and common range queries, such as range counts. SuRF is based on a new data structure called the Fast Succinct Trie (FST) that matches the performance of state-of-the-art order-preserving indexes, while consuming only 10 bits per trie node---a space close to the minimum required by information theory. Our experiments show that SuRF speeds up range queries in a widely used database storage engine by up to 5×.

Algorithmica ◽  
2021 ◽  
Author(s):  
José Fuentes-Sepúlveda ◽  
Diego Seco ◽  
Raquel Viaña

AbstractWe consider the problem of designing a succinct data structure for representing the connectivity of planar triangulations. The main result is a new succinct encoding achieving the information-theory optimal bound of 3.24 bits per vertex, while allowing efficient navigation. Our representation is based on the bijection of Poulalhon and Schaeffer (Algorithmica, 46(3):505–527, 2006) that defines a mapping between planar triangulations and a special class of spanning trees, called PS-trees. The proposed solution differs from previous approaches in that operations in planar triangulations are reduced to operations in particular parentheses sequences encoding PS-trees. Existing methods to handle balanced parentheses sequences have to be combined and extended to operate on such specific sequences, essentially for retrieving matching elements. The new encoding supports extracting the d neighbors of a query vertex in O(d) time and testing adjacency between two vertices in O(1) time. Additionally, we provide an implementation of our proposed data structure. In the experimental evaluation, our representation reaches up to 7.35 bits per vertex, improving the space usage of state-of-the-art implementations for planar embeddings.


2021 ◽  
Vol 25 (2) ◽  
pp. 283-303
Author(s):  
Na Liu ◽  
Fei Xie ◽  
Xindong Wu

Approximate multi-pattern matching is an important issue that is widely and frequently utilized, when the pattern contains variable-length wildcards. In this paper, two suffix array-based algorithms have been proposed to solve this problem. Suffix array is an efficient data structure for exact string matching in existing studies, as well as for approximate pattern matching and multi-pattern matching. An algorithm called MMSA-S is for the short exact characters in a pattern by dynamic programming, while another algorithm called MMSA-L deals with the long exact characters by the edit distance method. Experimental results of Pizza & Chili corpus demonstrate that these two newly proposed algorithms, in most cases, are more time-efficient than the state-of-the-art comparison algorithms.


2021 ◽  
Author(s):  
Danila Piatov ◽  
Sven Helmer ◽  
Anton Dignös ◽  
Fabio Persia

AbstractWe develop a family of efficient plane-sweeping interval join algorithms for evaluating a wide range of interval predicates such as Allen’s relationships and parameterized relationships. Our technique is based on a framework, components of which can be flexibly combined in different manners to support the required interval relation. In temporal databases, our algorithms can exploit a well-known and flexible access method, the Timeline Index, thus expanding the set of operations it supports even further. Additionally, employing a compact data structure, the gapless hash map, we utilize the CPU cache efficiently. In an experimental evaluation, we show that our approach is several times faster and scales better than state-of-the-art techniques, while being much better suited for real-time event processing.


2015 ◽  
Vol 2015 ◽  
pp. 1-8 ◽  
Author(s):  
Inanç Birol ◽  
Justin Chu ◽  
Hamid Mohamadi ◽  
Shaun D. Jackman ◽  
Karthika Raghavan ◽  
...  

De novoassembly of the genome of a species is essential in the absence of a reference genome sequence. Many scalable assembly algorithms use the de Bruijn graph (DBG) paradigm to reconstruct genomes, where a table of subsequences of a certain length is derived from the reads, and their overlaps are analyzed to assemble sequences. Despite longer subsequences unlocking longer genomic features for assembly, associated increase in compute resources limits the practicability of DBG over other assembly archetypes already designed for longer reads. Here, we revisit the DBG paradigm to adapt it to the changing sequencing technology landscape and introduce three data structure designs for spaced seeds in the form of paired subsequences. These data structures address memory and run time constraints imposed by longer reads. We observe that when a fixed distance separates seed pairs, it provides increased sequence specificity with increased gap length. Further, we note that Bloom filters would be suitable to implicitly store spaced seeds and be tolerant to sequencing errors. Building on this concept, we describe a data structure for tracking the frequencies of observed spaced seeds. These data structure designs will have applications in genome, transcriptome and metagenome assemblies, and read error correction.


Entropy ◽  
2020 ◽  
Vol 22 (4) ◽  
pp. 416
Author(s):  
Yingying Ma ◽  
Youlong Wu ◽  
Chengqiang Lu

Name ambiguity, due to the fact that many people share an identical name, often deteriorates the performance of information integration, document retrieval and web search. In academic data analysis, author name ambiguity usually decreases the analysis performance. To solve this problem, an author name disambiguation task is designed to divide documents related to an author name reference into several parts and each part is associated with a real-life person. Existing methods usually use either attributes of documents or relationships between documents and co-authors. However, methods of feature extraction using attributes cause inflexibility of models while solutions based on relationship graph network ignore the information contained in the features. In this paper, we propose a novel name disambiguation model based on representation learning which incorporates attributes and relationships. Experiments on a public real dataset demonstrate the effectiveness of our model and experimental results demonstrate that our solution is superior to several state-of-the-art graph-based methods. We also increase the interpretability of our method through information theory and show that the analysis could be helpful for model selection and training progress.


1987 ◽  
Vol 25 (4) ◽  
pp. 269-273 ◽  
Author(s):  
O. Fries ◽  
K. Mehlhorn ◽  
S. Näher ◽  
A. Tsakalidis
Keyword(s):  

2017 ◽  
Vol 17 (5-6) ◽  
pp. 906-923 ◽  
Author(s):  
EKATERINA KOMENDANTSKAYA ◽  
YUE LI

AbstractLogic Programming is a Turing complete language. As a consequence, designing algorithms that decide termination and non-termination of programs or decide inductive/coinductive soundness of formulae is a challenging task. For example, the existing state-of-the-art algorithms can only semi-decide coinductive soundness of queries in logic programming for regular formulae. Another, less famous, but equally fundamental and important undecidable property is productivity. If a derivation is infinite and coinductively sound, we may ask whether the computed answer it determines actually computes an infinite formula. If it does, the infinite computation is productive. This intuition was first expressed under the name of computations at infinity in the 80s. In modern days of the Internet and stream processing, its importance lies in connection to infinite data structure processing. Recently, an algorithm was presented that semi-decides a weaker property – of productivity of logic programs. A logic program is productive if it can give rise to productive derivations. In this paper, we strengthen these recent results. We propose a method that semi-decides productivity of individual derivations for regular formulae. Thus, we at last give an algorithmic counterpart to the notion of productivity of derivations in logic programming. This is the first algorithmic solution to the problem since it was raised more than 30 years ago. We also present an implementation of this algorithm.


Sensors ◽  
2021 ◽  
Vol 21 (24) ◽  
pp. 8241
Author(s):  
Mitko Aleksandrov ◽  
Sisi Zlatanova ◽  
David J. Heslop

Voxel-based data structures, algorithms, frameworks, and interfaces have been used in computer graphics and many other applications for decades. There is a general necessity to seek adequate digital representations, such as voxels, that would secure unified data structures, multi-resolution options, robust validation procedures and flexible algorithms for different 3D tasks. In this review, we evaluate the most common properties and algorithms for voxelisation of 2D and 3D objects. Thus, many voxelisation algorithms and their characteristics are presented targeting points, lines, triangles, surfaces and solids as geometric primitives. For lines, we identify three groups of algorithms, where the first two achieve different voxelisation connectivity, while the third one presents voxelisation of curves. We can say that surface voxelisation is a more desired voxelisation type compared to solid voxelisation, as it can be achieved faster and requires less memory if voxels are stored in a sparse way. At the same time, we evaluate in the paper the available voxel data structures. We split all data structures into static and dynamic grids considering the frequency to update a data structure. Static grids are dominated by SVO-based data structures focusing on memory footprint reduction and attributes preservation, where SVDAG and SSVDAG are the most advanced methods. The state-of-the-art dynamic voxel data structure is NanoVDB which is superior to the rest in terms of speed as well as support for out-of-core processing and data management, which is the key to handling large dynamically changing scenes. Overall, we can say that this is the first review evaluating the available voxelisation algorithms for different geometric primitives as well as voxel data structures.


Sign in / Sign up

Export Citation Format

Share Document