similarity join
Recently Published Documents


TOTAL DOCUMENTS

182
(FIVE YEARS 8)

H-INDEX

17
(FIVE YEARS 0)

2021 ◽  
Author(s):  
Jianhua Wang ◽  
Jianye Yang ◽  
Wenjie Zhang

2021 ◽  
Vol 12 (3) ◽  
Author(s):  
Leonardo Andrade Ribeiro ◽  
Felipe Ferreira Borges ◽  
Diego Oliveira

We consider the problem of efficiently answering set similarity joins on multi-attribute data. Traditional set similarity join algorithms assume string data represented by a single set and, thus, miss the opportunity to exploit predicates over multiple attributes to reduce the number of similarity computations. In this article, we present a framework to enhance existing algorithms with additional filters for dealing with multi-attribute data. We then instantiate this framework with a lightweight filtering technique based on a simple, yet effective data structure, for which exact and probabilistic implementations are evaluated. In this context, we devise a cost model to identify the best attribute ordering to reduce processing time. Moreover, alternative approaches are also investigated and a new algorithm combining key ideas from previous work is introduced. Finally, we present a thorough experimental evaluation, which demonstrates that our main proposal is efficient and significantly outperforms competing algorithms.


2021 ◽  
pp. 100267
Author(s):  
Lining Yu ◽  
Tiezheng Nie ◽  
Derong Shen ◽  
Yue Kou

Author(s):  
Yanchuan Chang ◽  
Jianzhong Qi ◽  
Egemen Tanin ◽  
Xingjun Ma ◽  
Hanan Samet

Author(s):  
Chengyuan Zhang ◽  
Fangxin Xie ◽  
Hao Yu ◽  
Jianfeng Zhang ◽  
Lei Zhu ◽  
...  

Author(s):  
Chengcheng Yang ◽  
Lisi Chen ◽  
Hao Wang ◽  
Shuo Shang ◽  
Rui Mao ◽  
...  
Keyword(s):  

Author(s):  
Benoit Gallet ◽  
Michael Gowanlock

Abstract Given two datasets (or tables) A and B and a search distance $$\epsilon$$ ϵ , the distance similarity join, denoted as $$A \ltimes _\epsilon B$$ A ⋉ ϵ B , finds the pairs of points ($$p_a$$ p a , $$p_b$$ p b ), where $$p_a \in A$$ p a ∈ A and $$p_b \in B$$ p b ∈ B , and such that the distance between $$p_a$$ p a and $$p_b$$ p b is $$\le \epsilon$$ ≤ ϵ . If $$A = B$$ A = B , then the similarity join is equivalent to a similarity self-join, denoted as $$A \bowtie _\epsilon A$$ A ⋈ ϵ A . We propose in this paper Heterogeneous Epsilon Grid Joins (HEGJoin), a heterogeneous CPU-GPU distance similarity join algorithm. Efficiently partitioning the work between the CPU and the GPU is a challenge. Indeed, the work partitioning strategy needs to consider the different characteristics and computational throughput of the processors (CPU and GPU), as well as the data-dependent nature of the similarity join that accounts in the overall execution time (e.g., the number of queries, their distribution, the dimensionality, etc.). In addition to HEGJoin, we design in this paper a dynamic and two static work partitioning strategies. We also propose a performance model for each static partitioning strategy to perform the distribution of the work between the processors. We evaluate the performance of all three partitioning methods by considering the execution time and the load imbalance between the CPU and GPU as performance metrics. HEGJoin achieves a speedup of up to $$5.46\times$$ 5.46 × ($$3.97\times$$ 3.97 × ) over the GPU-only (CPU-only) algorithms on our first test platform and up to $$1.97\times$$ 1.97 × ($$12.07\times$$ 12.07 × ) on our second test platform over the GPU-only (CPU-only) algorithms.


Sign in / Sign up

Export Citation Format

Share Document