DREAM-Yara: An exact read mapper for very large databases with short update time

Mapping Intimacies ◽

10.1101/256354 ◽

2018 ◽

Cited By ~ 1

Author(s):

Temesgen Hailemariam Dadi ◽

Enrico Siragusa ◽

Vitor C. Piro ◽

Andreas Andrusch ◽

Enrico Seiler ◽

...

Keyword(s):

Computing Time ◽

Search Time ◽

Bloom Filter ◽

Bloom Filters ◽

Fast Search ◽

Approximate Search ◽

Large Sets ◽

Large Databases ◽

Very Large Databases ◽

Compressed Index

AbstractMotivationMapping-based approaches have become limited in their application to very large sets of references since computing an FM-index for very large databases (e.g. > 10 GB) has become a bottleneck. This affects many analyses that need such index as an essential step for approximate matching of the NGS reads to reference databases. For instance, in typical metagenomics analysis, the size of the reference sequences has become prohibitive to compute a single full-text index on standard machines. Even on large memory machines, computing such index takes about one day of computing time. As a result, updates of indices are rarely performed. Hence, it is desirable to create an alternative way of indexing while preserving fast search times.ResultsTo solve the index construction and update problem we propose the DREAM (Dynamic seaRchablE pArallel coMpressed index) framework and provide an implementation. The main contributions are the introduction of an approximate search distributor directories via a novel use of Bloom filters. We combine several Bloom filters to form an interleaved Bloom filter and use this new data structure to quickly exclude reads for parts of the databases where they cannot match. This allows us to keep the databases in several indices which can be easily rebuilt if parts are updated while maintaining a fast search time. The second main contribution is an implementation of DREAM-Yara a distributed version of a fully sensitive read mapper under the DREAM [email protected]://gitlab.com/pirovc/dream_yara/

Download Full-text

Secure Privacy Preserving Record Linkage of Large Databases by Modified Bloom Filter Encodings

International Journal for Population Data Science ◽

10.23889/ijpds.v1i1.29 ◽

2017 ◽

Vol 1 (1) ◽

Cited By ~ 2

Author(s):

Rainer Schnell ◽

Christian Borgs

Keyword(s):

Record Linkage ◽

Large Scale ◽

Bloom Filter ◽

Privacy Preserving ◽

Error Rates ◽

Bloom Filters ◽

Data Sets ◽

Research Subjects ◽

Practical Applications ◽

Large Databases

ABSTRACTObjectiveIn most European settings, record linkage across different institutions has to be based on personal identifiers such as names, birthday or place of birth. To protect the privacy of research subjects, the identifiers have to be encrypted. In practice, these identifiers show error rates up to 20% per identifier, therefore linking on encrypted identifiers usually implies the loss of large subsets of the databases. In many applications, this loss of cases is related to variables of interest for the subject matter of the study. Therefore, this kind of record-linkage will generate biased estimates. These problems gave rise to techniques of Privacy Preserving Record Linkage (PPRL). Many different PPRL techniques have been suggested within the last 10 years, very few of them are suitable for practical applications with large database containing millions of records as they are typical for administrative or medical databases. One proven technique for PPRL for large scale applications is PPRL based on Bloom filters.MethodUsing appropriate parameter settings, Bloom filter approaches show linkage results comparable to linkage based on unencrypted identifiers. Furthermore, this approach has been used in real-world settings with data sets containing up to 100 Million records. By the application of suitable blocking strategies, linking can be done in reasonable time.ResultHowever, Bloom filters have been subject of cryptographic attacks. Previous research has shown that the straight application of Bloom filters has a nonzero re-identification risk. We will present new results on recently developed techniques to defy all known attacks on PPRL Bloom filters. These computationally simple algorithms modify the identifiers by different cryptographic diffusion techniques. The presentation will demonstrate these new algorithms and show their performance concerning precision, recall and re-identification risk on large databases.

Download Full-text

Comparison of Keyword Search Techniques with Respect to Electronic Health Records

Asia Pacific Journal of Health Management ◽

10.24083/apjhm.v16i4.587 ◽

2021 ◽

Vol 16 (4) ◽

pp. 30-35

Author(s):

Prachi Gurav ◽

Sanjeev Panandikar

Keyword(s):

Keyword Search ◽

Search Time ◽

Health Data ◽

Health Records ◽

Search Techniques ◽

Approximate Search ◽

Manual Search ◽

Careful Scrutiny ◽

Large Databases ◽

Better Than

As the world progresses towards automation, manual search for data from large databases also needs to keep pace. When the database includes health data, even minute aspects need careful scrutiny. Keyword search techniques are helpful in extracting data from large databases. There are two keyword search techniques: Exact and Approximate. When the user wants to search through EHR, a short search time is expected. To this end, this work investigates Metaphone (Exact search) and Similar_Text (approximate search) Techniques. We have applied keyword search to the data, which includes the symptoms and names of medicines. Our results indicate that the search time for Similar_text is better than for Metaphone.

Download Full-text

Very Large Databases

Wiley Encyclopedia of Electrical and Electronics Engineering ◽

10.1002/047134608x.w4308 ◽

1999 ◽

Author(s):

Minos N. Garofalakis ◽

Ren��e J. Miller

Keyword(s):

Large Databases ◽

Very Large Databases

Download Full-text

Sampling Methods in Approximate Query Answering Systems

Encyclopedia of Data Warehousing and Mining ◽

10.4018/978-1-59140-557-3.ch186 ◽

2011 ◽

pp. 990-994 ◽

Cited By ~ 2

Author(s):

Gautam Das

Keyword(s):

Data Analysis ◽

Large Data ◽

Massive Datasets ◽

Data Repositories ◽

Large Databases ◽

Approximate Query Answering ◽

Very Large Databases ◽

Approximate Query ◽

And Storage ◽

Collection And Management

In recent years, advances in data collection and management technologies have led to a proliferation of very large databases. These large data repositories typically are created in the hope that, through analysis such as data mining and decision support, they will yield new insights into the data and the real-world processes that created them. In practice, however, while the collection and storage of massive datasets has become relatively straightforward, effective data analysis has proven more difficult to achieve. One reason that data analysis successes have proven elusive is that most analysis queries, by their nature, require aggregation or summarization of large portions of the data being analyzed. For multi-gigabyte data repositories, this means that processing even a single analysis query involves accessing enormous amounts of data, leading to prohibitively expensive running times. This severely limits the feasibility of many types of analysis applications, especially those that depend on timeliness or interactivity.

Download Full-text

Scalable Blocking for Very Large Databases

ECML PKDD 2020 Workshops - Communications in Computer and Information Science ◽

10.1007/978-3-030-65965-3_20 ◽

2020 ◽

pp. 303-319

Author(s):

Andrew Borthwick ◽

Stephen Ash ◽

Bin Pang ◽

Shehzad Qureshi ◽

Timothy Jones

Keyword(s):

Large Databases ◽

Very Large Databases

Download Full-text

A Novel Hardware Security Architecture: PD-CRP(PUF Database & Challenge-Response Pair) Bloom Filter on Memristor Based PUF

10.20944/preprints202008.0598.v1 ◽

2020 ◽

Author(s):

Jungwon Lee ◽

Seoyeon Choi ◽

Dayoung Kim ◽

Yunyoung Choi ◽

Wookyung Sun

Keyword(s):

Data Transmission ◽

Hardware Security ◽

Bloom Filter ◽

Transmission Error ◽

Bloom Filters ◽

Search Performance ◽

Security Technology ◽

Security Environment ◽

Filter Size ◽

Simulation Results

Because the development of the internet of things (IoT) requires technology that transfers information between objects without human intervention, the core of IoT security will be secure authentication between devices or between devices and servers. Software-based authentication may be a security vulnerability in IoT, but hardware-based security technology can provide a strong security environment. A physical unclonable functions (PUFs) are a hardware security element suitable for lightweight applications. PUFs can generate challenge-response pairs(CRPs) that cannot be controlled or predicted by utilizing inherent physical variations that occur in the manufacturing process. In particular, pulse width memristive PUF (PWM-PUF) improves security performance by applying different write pulse widths and bank structures. Bloom filter (BF) is probabilistic data structures that answer membership queries using small memories. Bloom filter can improve search performance and reduce memory usage and are used in areas such as networking, security, big data, and IoT. In this paper, we propose a structure that applies Bloom filters based on the PWM-PUF to reduce PUF data transmission errors. The proposed structure uses two different Bloom filter types that store different information and that are located in front of and behind the PWM-PUF, improving security by removing challenges from attacker access. Simulation results show that the proposed structure decreases the data transmission error rate and reuse rate as the Bloom filter size increases, the simulation results also show that the proposed structure improves PWM-PUF security with a very small Bloom filter memory.

Download Full-text

A User-Profile-Oriented Mediation Architecture for Very Large DataBases in a Dynamic Inter-Grid Context

2008 International Conference on Complex, Intelligent and Software Intensive Systems ◽

10.1109/cisis.2008.79 ◽

2008 ◽

Author(s):

Nadia Bennani ◽

Julien Gossa ◽

Ny Haingo Andrianarisoa

Keyword(s):

User Profile ◽

Large Databases ◽

Very Large Databases

Download Full-text

Knowledge Combination vs. Meta-Learning

Encyclopedia of Information Science and Technology, Second Edition ◽

10.4018/978-1-60566-026-4.ch368 ◽

2011 ◽

pp. 2325-2331

Author(s):

Ivan Bruha

Keyword(s):

Knowledge Integration ◽

Learning Strategy ◽

Combination Strategy ◽

General Belief ◽

Useful Knowledge ◽

Large Databases ◽

Meta Learning ◽

Very Large Databases ◽

Intelligent Information ◽

High Level

Research in intelligent information systems investigates the possibilities of enhancing their over-all performance, particularly their prediction accuracy and time complexity. One such discipline, data mining (DM), processes usually very large databases in a profound and robust way (Fayyad et al., 1996). DM points to the overall process of determining a useful knowledge from databases, that is, extracting high-level knowledge from low-level data in the context of large databases. This article discusses two newer directions in this field, namely knowledge combination and meta-learning (Vilalta & Drissi, 2002). There exist approaches to combine various paradigms into one robust (hybrid, multistrategy) system which utilizes the advantages of each subsystem and tries to eliminate their drawbacks. There is a general belief that integrating results obtained from multiple lower-level decision-making systems, each usually (but not required) based on a different paradigm, produce better performance. Such multi-level knowledgebased systems are usually referred to as knowledge integration systems. One subset of these systems is called knowledge combination (Fan et al., 1996). We focus on a common topology of the knowledge combination strategy with base learners and base classifiers (Bruha, 2004). Meta-learning investigates how learning systems may improve their performance through experience in order to become flexible. Its goal is to search dynamically for the best learning strategy. We define the fundamental characteristics of the meta-learning such as bias, and hypothesis space. Section 2 surveys the various directions in algorithms and topologies utilized in knowledge combination and meta-learning. Section 3 represents the main focus of this article: description of knowledge combination techniques, meta-learning, and a particular application including the corresponding flow charts. The last section presents the future trends in these topics.

Download Full-text

A better way for finding the optimal number of nodes in a distributed database management system

Daffodil International University Journal of Science and Technology ◽

10.3329/diujst.v4i2.4362 ◽

2010 ◽

Vol 4 (2) ◽

pp. 19-22

Author(s):

Rashed Mustafa ◽

Md Javed Hossain ◽

Thomas Chowdhury

Keyword(s):

Management System ◽

Database Management ◽

Database Systems ◽

Distributed Database ◽

Optimal Number ◽

Database Management System ◽

Data Fragmentation ◽

Large Databases ◽

Very Large Databases ◽

Distributed Database Management

Distributed Database Management System (DDBMS) is one of the prime concerns in distributed computing. The driving force of development of DDBMS is the demand of the applications that need to query very large databases (order of terabytes). Traditional Client- Server database systems are too slower to handle such applications. This paper presents a better way to find the optimal number of nodes in a distributed database management systems. Keywords: DDBMS, Data Fragmentation, Linear Search, RMI. DOI: 10.3329/diujst.v4i2.4362 Daffodil International University Journal of Science and Technology Vol.4(2) 2009 pp.19-22

Download Full-text

Route Prefix Caching Using Bloom Filters in Named Data Networking

Applied Sciences ◽

10.3390/app10072226 ◽

2020 ◽

Vol 10 (7) ◽

pp. 2226

Author(s):

Junghwan Kim ◽

Myeong-Cheol Ko ◽

Jinsoo Kim ◽

Moon Sun Shin

Keyword(s):

Bloom Filter ◽

Bloom Filters ◽

Experimental Result ◽

Named Data Networking ◽

Next Generation Internet ◽

Complex Process ◽

Caching Scheme ◽

Name Lookup ◽

Cache Miss ◽

Data Networking

This paper proposes an elaborate route prefix caching scheme for fast packet forwarding in named data networking (NDN) which is a next-generation Internet structure. The name lookup is a crucial function of the NDN router, which delivers a packet based on its name rather than IP address. It carries out a complex process to find the longest matching prefix for the content name. Even the size of a name prefix is variable and unbounded; thus, the name lookup is to be more complicated and time-consuming. The name lookup can be sped up by using route prefix caching, but it may cause a problem when non-leaf prefixes are cached. The proposed prefix caching scheme can cache non-leaf prefixes, as well as leaf prefixes, without incurring any problem. For this purpose, a Bloom filter is kept for each prefix. The Bloom filter, which is widely used for checking membership, is utilized to indicate the branch information of a non-leaf prefix. The experimental result shows that the proposed caching scheme achieves a much higher hit ratio than other caching schemes. Furthermore, how much the parameters of the Bloom filter affect the cache miss count is quantitatively evaluated. The best performance can be achieved with merely 8-bit Bloom filters and two hash functions.

Download Full-text