scholarly journals DREAM-Yara: An exact read mapper for very large databases with short update time

2018 ◽  
Author(s):  
Temesgen Hailemariam Dadi ◽  
Enrico Siragusa ◽  
Vitor C. Piro ◽  
Andreas Andrusch ◽  
Enrico Seiler ◽  
...  

AbstractMotivationMapping-based approaches have become limited in their application to very large sets of references since computing an FM-index for very large databases (e.g. > 10 GB) has become a bottleneck. This affects many analyses that need such index as an essential step for approximate matching of the NGS reads to reference databases. For instance, in typical metagenomics analysis, the size of the reference sequences has become prohibitive to compute a single full-text index on standard machines. Even on large memory machines, computing such index takes about one day of computing time. As a result, updates of indices are rarely performed. Hence, it is desirable to create an alternative way of indexing while preserving fast search times.ResultsTo solve the index construction and update problem we propose the DREAM (Dynamic seaRchablE pArallel coMpressed index) framework and provide an implementation. The main contributions are the introduction of an approximate search distributor directories via a novel use of Bloom filters. We combine several Bloom filters to form an interleaved Bloom filter and use this new data structure to quickly exclude reads for parts of the databases where they cannot match. This allows us to keep the databases in several indices which can be easily rebuilt if parts are updated while maintaining a fast search time. The second main contribution is an implementation of DREAM-Yara a distributed version of a fully sensitive read mapper under the DREAM [email protected]://gitlab.com/pirovc/dream_yara/

Author(s):  
Rainer Schnell ◽  
Christian Borgs

ABSTRACTObjectiveIn most European settings, record linkage across different institutions has to be based on personal identifiers such as names, birthday or place of birth. To protect the privacy of research subjects, the identifiers have to be encrypted. In practice, these identifiers show error rates up to 20% per identifier, therefore linking on encrypted identifiers usually implies the loss of large subsets of the databases. In many applications, this loss of cases is related to variables of interest for the subject matter of the study. Therefore, this kind of record-linkage will generate biased estimates. These problems gave rise to techniques of Privacy Preserving Record Linkage (PPRL). Many different PPRL techniques have been suggested within the last 10 years, very few of them are suitable for practical applications with large database containing millions of records as they are typical for administrative or medical databases. One proven technique for PPRL for large scale applications is PPRL based on Bloom filters.MethodUsing appropriate parameter settings, Bloom filter approaches show linkage results comparable to linkage based on unencrypted identifiers. Furthermore, this approach has been used in real-world settings with data sets containing up to 100 Million records. By the application of suitable blocking strategies, linking can be done in reasonable time.ResultHowever, Bloom filters have been subject of cryptographic attacks. Previous research has shown that the straight application of Bloom filters has a nonzero re-identification risk. We will present new results on recently developed techniques to defy all known attacks on PPRL Bloom filters. These computationally simple algorithms modify the identifiers by different cryptographic diffusion techniques. The presentation will demonstrate these new algorithms and show their performance concerning precision, recall and re-identification risk on large databases.


2021 ◽  
Vol 16 (4) ◽  
pp. 30-35
Author(s):  
Prachi Gurav ◽  
Sanjeev Panandikar

As the world progresses towards automation, manual search for data from large databases also needs to keep pace. When the database includes health data, even minute aspects need careful scrutiny. Keyword search techniques are helpful in extracting data from large databases. There are two keyword search techniques: Exact and Approximate. When the user wants to search through EHR, a short search time is expected. To this end, this work investigates Metaphone (Exact search) and Similar_Text (approximate search) Techniques. We have applied keyword search to the data, which includes the symptoms and names of medicines. Our results indicate that the search time for Similar_text is better than for Metaphone.


Author(s):  
Gautam Das

In recent years, advances in data collection and management technologies have led to a proliferation of very large databases. These large data repositories typically are created in the hope that, through analysis such as data mining and decision support, they will yield new insights into the data and the real-world processes that created them. In practice, however, while the collection and storage of massive datasets has become relatively straightforward, effective data analysis has proven more difficult to achieve. One reason that data analysis successes have proven elusive is that most analysis queries, by their nature, require aggregation or summarization of large portions of the data being analyzed. For multi-gigabyte data repositories, this means that processing even a single analysis query involves accessing enormous amounts of data, leading to prohibitively expensive running times. This severely limits the feasibility of many types of analysis applications, especially those that depend on timeliness or interactivity.


Author(s):  
Andrew Borthwick ◽  
Stephen Ash ◽  
Bin Pang ◽  
Shehzad Qureshi ◽  
Timothy Jones

Author(s):  
Jungwon Lee ◽  
Seoyeon Choi ◽  
Dayoung Kim ◽  
Yunyoung Choi ◽  
Wookyung Sun

Because the development of the internet of things (IoT) requires technology that transfers information between objects without human intervention, the core of IoT security will be secure authentication between devices or between devices and servers. Software-based authentication may be a security vulnerability in IoT, but hardware-based security technology can provide a strong security environment. A physical unclonable functions (PUFs) are a hardware security element suitable for lightweight applications. PUFs can generate challenge-response pairs(CRPs) that cannot be controlled or predicted by utilizing inherent physical variations that occur in the manufacturing process. In particular, pulse width memristive PUF (PWM-PUF) improves security performance by applying different write pulse widths and bank structures. Bloom filter (BF) is probabilistic data structures that answer membership queries using small memories. Bloom filter can improve search performance and reduce memory usage and are used in areas such as networking, security, big data, and IoT. In this paper, we propose a structure that applies Bloom filters based on the PWM-PUF to reduce PUF data transmission errors. The proposed structure uses two different Bloom filter types that store different information and that are located in front of and behind the PWM-PUF, improving security by removing challenges from attacker access. Simulation results show that the proposed structure decreases the data transmission error rate and reuse rate as the Bloom filter size increases, the simulation results also show that the proposed structure improves PWM-PUF security with a very small Bloom filter memory.


Author(s):  
Ivan Bruha

Research in intelligent information systems investigates the possibilities of enhancing their over-all performance, particularly their prediction accuracy and time complexity. One such discipline, data mining (DM), processes usually very large databases in a profound and robust way (Fayyad et al., 1996). DM points to the overall process of determining a useful knowledge from databases, that is, extracting high-level knowledge from low-level data in the context of large databases. This article discusses two newer directions in this field, namely knowledge combination and meta-learning (Vilalta & Drissi, 2002). There exist approaches to combine various paradigms into one robust (hybrid, multistrategy) system which utilizes the advantages of each subsystem and tries to eliminate their drawbacks. There is a general belief that integrating results obtained from multiple lower-level decision-making systems, each usually (but not required) based on a different paradigm, produce better performance. Such multi-level knowledgebased systems are usually referred to as knowledge integration systems. One subset of these systems is called knowledge combination (Fan et al., 1996). We focus on a common topology of the knowledge combination strategy with base learners and base classifiers (Bruha, 2004). Meta-learning investigates how learning systems may improve their performance through experience in order to become flexible. Its goal is to search dynamically for the best learning strategy. We define the fundamental characteristics of the meta-learning such as bias, and hypothesis space. Section 2 surveys the various directions in algorithms and topologies utilized in knowledge combination and meta-learning. Section 3 represents the main focus of this article: description of knowledge combination techniques, meta-learning, and a particular application including the corresponding flow charts. The last section presents the future trends in these topics.


Author(s):  
Rashed Mustafa ◽  
Md Javed Hossain ◽  
Thomas Chowdhury

Distributed Database Management System (DDBMS) is one of the prime concerns in distributed computing. The driving force of development of DDBMS is the demand of the applications that need to query very large databases (order of terabytes). Traditional Client- Server database systems are too slower to handle such applications. This paper presents a better way to find the optimal number of nodes in a distributed database management systems. Keywords: DDBMS, Data Fragmentation, Linear Search, RMI.   DOI: 10.3329/diujst.v4i2.4362 Daffodil International University Journal of Science and Technology Vol.4(2) 2009 pp.19-22


2020 ◽  
Vol 10 (7) ◽  
pp. 2226
Author(s):  
Junghwan Kim ◽  
Myeong-Cheol Ko ◽  
Jinsoo Kim ◽  
Moon Sun Shin

This paper proposes an elaborate route prefix caching scheme for fast packet forwarding in named data networking (NDN) which is a next-generation Internet structure. The name lookup is a crucial function of the NDN router, which delivers a packet based on its name rather than IP address. It carries out a complex process to find the longest matching prefix for the content name. Even the size of a name prefix is variable and unbounded; thus, the name lookup is to be more complicated and time-consuming. The name lookup can be sped up by using route prefix caching, but it may cause a problem when non-leaf prefixes are cached. The proposed prefix caching scheme can cache non-leaf prefixes, as well as leaf prefixes, without incurring any problem. For this purpose, a Bloom filter is kept for each prefix. The Bloom filter, which is widely used for checking membership, is utilized to indicate the branch information of a non-leaf prefix. The experimental result shows that the proposed caching scheme achieves a much higher hit ratio than other caching schemes. Furthermore, how much the parameters of the Bloom filter affect the cache miss count is quantitatively evaluated. The best performance can be achieved with merely 8-bit Bloom filters and two hash functions.


Sign in / Sign up

Export Citation Format

Share Document