scholarly journals Lighter: fast and memory-efficient error correction without counting

2014 ◽  
Author(s):  
Li Song ◽  
Liliana Florea ◽  
Ben Langmead

Lighter is a fast, memory-efficient tool for correcting sequencing errors. Lighter avoids counting k-mers. Instead, it uses a pair of Bloom filters, one holding a sample of the input k-mers and the other holding k-mers likely to be correct. As long as the sampling fraction is adjusted in inverse proportion to the depth of sequencing, Bloom filter size can be held constant while maintaining near-constant accuracy. Lighter is parallelized, uses no secondary storage, and is both faster and more memory-efficient than competing approaches while achieving comparable accuracy.

Author(s):  
Jungwon Lee ◽  
Seoyeon Choi ◽  
Dayoung Kim ◽  
Yunyoung Choi ◽  
Wookyung Sun

Because the development of the internet of things (IoT) requires technology that transfers information between objects without human intervention, the core of IoT security will be secure authentication between devices or between devices and servers. Software-based authentication may be a security vulnerability in IoT, but hardware-based security technology can provide a strong security environment. A physical unclonable functions (PUFs) are a hardware security element suitable for lightweight applications. PUFs can generate challenge-response pairs(CRPs) that cannot be controlled or predicted by utilizing inherent physical variations that occur in the manufacturing process. In particular, pulse width memristive PUF (PWM-PUF) improves security performance by applying different write pulse widths and bank structures. Bloom filter (BF) is probabilistic data structures that answer membership queries using small memories. Bloom filter can improve search performance and reduce memory usage and are used in areas such as networking, security, big data, and IoT. In this paper, we propose a structure that applies Bloom filters based on the PWM-PUF to reduce PUF data transmission errors. The proposed structure uses two different Bloom filter types that store different information and that are located in front of and behind the PWM-PUF, improving security by removing challenges from attacker access. Simulation results show that the proposed structure decreases the data transmission error rate and reuse rate as the Bloom filter size increases, the simulation results also show that the proposed structure improves PWM-PUF security with a very small Bloom filter memory.


2020 ◽  
Vol 10 (20) ◽  
pp. 7198
Author(s):  
Junghwan Kim ◽  
Myeong-Cheol Ko ◽  
Moon Sun Shin ◽  
Jinsoo Kim

Prefix caching is one of the notable techniques in enhancing the IP address lookup performance which is crucial in packet forwarding. A cached prefix can match a range of IP addresses, so prefix caching leads to a higher cache hit ratio than IP address caching. However, prefix caching has an issue to be resolved. When a prefix is matched in a cache, the prefix cannot be the result without assuring that there is no longer descendant prefix of the matching prefix which is not cached yet. This is due to the aspect of the IP address lookup seeking to find the longest matching prefix. Some prefix expansion techniques avoid the problem, but the expanded prefixes occupy more entries as well as cover a smaller range of IP addresses. This paper proposes a novel prefix caching scheme in which the original prefix can be cached without expansion. In this scheme, for each prefix, a Bloom filter is constructed to be used for testing if there is any matchable descendant. The false positive ratio of a Bloom filter generally grows as the number of elements contained in the filter increases. We devise an elaborate two-level Bloom filter scheme which adjusts the filter size at each level, to reduce the false positive ratio, according to the number of contained elements. The experimental result shows that the proposed scheme achieves a very low cache miss ratio without increasing the number of prefixes. In addition, most of the filter assertions are negative, which means the proposed prefix cache effectively hits the matching prefix using the filter.


2020 ◽  
Vol 117 (29) ◽  
pp. 16961-16968
Author(s):  
Justin Chu ◽  
Hamid Mohamadi ◽  
Emre Erhan ◽  
Jeffery Tse ◽  
Readman Chiu ◽  
...  

Alignment-free classification tools have enabled high-throughput processing of sequencing data in many bioinformatics analysis pipelines primarily due to their computational efficiency. Originallyk-mer based, such tools often lack sensitivity when faced with sequencing errors and polymorphisms. In response, some tools have been augmented with spaced seeds, which are capable of tolerating mismatches. However, spaced seeds have seen little practical use in classification because they bring increased computational and memory costs compared to methods that usek-mers. These limitations have also caused the design and length of practical spaced seeds to be constrained, since storing spaced seeds can be costly. To address these challenges, we have designed a probabilistic data structure called a multiindex Bloom Filter (miBF), which can store multiple spaced seed sequences with a low memory cost that remains static regardless of seed length or seed design. We formalize how to minimize the false-positive rate of miBFs when classifying sequences from multiple targets or references. Available within BioBloom Tools, we illustrate the utility of miBF in two use cases: read-binning for targeted assembly, and taxonomic read assignment. In our benchmarks, an analysis pipeline based on miBF shows higher sensitivity and specificity for read-binning than sequence alignment-based methods, also executing in less time. Similarly, for taxonomic classification, miBF enables higher sensitivity than a conventional spaced seed-based approach, while using half the memory and an order of magnitude less computational time.


Author(s):  
Antony Stevens

ABSTRACT ObjectiveBloom Filters have been used in a number of studies conducted for the Ministry of Health. They are usually recommended because of the possibility that they may participate in secure protocols for the exchange of data. In our case the speed of the program, once the filters have been prepared, is so high that that itself is sufficient motive for their adoption. Nevertheless if two calendar dates differ by one character this may merit more attention than a similar difference in personal names. This became evident in a large linkage between mortality records and hospital separations where the patient had died. Higher scores were obtained when the date fields differed by only one character, but when that character represented a year there would no reason to notice the pair. When the character difference was compatible with a difference of a few days this would be more interesting because in studies like the one just cited it would be reasonable to admit differences of a few days or even, perhaps, weeks between the events ( recording of the death of the patient ).ApproachHow then to represent the difference between dates in a Bloom Filter? A date can be represented as a Boolean vector where the day (or week) is set to '1'. It may be represented by several contiguous '1's to admit admissible uncertainty in comparisons. The similarity between two dates can then just be the Dice Coefficient of the corresponding vectors. ResultBut a vector representing a date may then be very large. It could be as much as 365 bits per year, far more than is usually used for the other fields. The number of logical word comparisons would go up and the program would become slower. Knowing that the admissible range is presented by contiguous '1's means that we can obtain the effect of constructing the Bloom Filter and calculating the Dice Coefficient more directly. Starting with the two dates we can obtain the number of bits that are shared, which will depend on the admissible range. The Dice Coefficient can then be calculated directly without the need to construct the Filter. ConclusionWe are then left with the decision on how to add the result to the value obtained from the other variables, and this will depend on what importance it is felt the date should have.


2020 ◽  
Vol 10 (19) ◽  
pp. 6692
Author(s):  
Jungwon Lee ◽  
Seoyeon Choi ◽  
Dayoung Kim ◽  
Yunyoung Choi ◽  
Wookyung Sun

Because the development of the Internet of Things (IoT) requires technology that transfers information between objects without human intervention, the core of IoT security will be secure authentication between devices or between devices and servers. Software-based authentication may be a security vulnerability in IoT, but hardware-based security technology can provide a strong security environment. Physical unclonable functions (PUFs) are hardware security element suitable for lightweight applications. PUFs can generate challenge–response pairs(CRPs) that cannot be controlled or predicted by utilizing inherent physical variations that occur in the manufacturing process. In particular, the pulsewidth-based memristive PUF (pm-PUF) improves security performance by applying different write pulse widths and bank structures. Bloom filters (BFs) are probabilistic data structures that answer membership queries using small memories. Bloom filters can improve search performance and reduce memory usage and are used in areas such as networking, security, big data, and IoT. In this paper, we propose a structure that applies Bloom filters based on the pm-PUF to reduce PUF data transmission errors. The proposed structure uses two different Bloom filter types that store different information and that are located in front of and behind the pm-PUF, reducing unnecessary access by removing challenges from attacker access. Simulation results show that the proposed structure decreases the data transmission error rate and reuse rate as the Bloom filter size increases; the simulation results also show that the proposed structure improves pm-PUF security with a very small Bloom filter memory.


2018 ◽  
Author(s):  
Justin Chu ◽  
Hamid Mohamadi ◽  
Emre Erhan ◽  
Jeffery Tse ◽  
Readman Chiu ◽  
...  

ABSTRACTAlignment-free classification of sequences against collections of sequences has enabled high-throughput processing of sequencing data in many bioinformatics analysis pipelines. Originally hash-table based, much work has been done to improve and reduce the memory requirement of indexing of k-mer sequences with probabilistic indexing strategies. These efforts have led to lower memory highly efficient indexes, but often lack sensitivity in the face of sequencing errors or polymorphism because they are k-mer based. To address this, we designed a new memory efficient data structure that can tolerate mismatches using multiple spaced seeds, called a multi-index Bloom Filter. Implemented as part of BioBloom Tools, we demonstrate our algorithm in two applications, read binning for targeted assembly and taxonomic read assignment. Our tool shows a higher sensitivity and specificity for read-binning than BWA MEM at an order of magnitude less time. For taxonomic classification, we show higher sensitivity than CLARK-S at an order of magnitude less time while using half the memory.


2008 ◽  
Vol 205 (2) ◽  
pp. 841-848
Author(s):  
Fei Wang ◽  
Zejian Yuan ◽  
Nanning Zheng ◽  
Yuehu Liu

2022 ◽  
Vol 12 (1) ◽  
pp. 0-0

In this study we implemented four different versions of Apriori, namely, basic and basic multi-threaded, bloom filter, trie, and count-min sketch, and proposed a new algorithm – NCLAT (Near Candidate-Less Apriori with Tidlists). We compared the runtimes and max memory usages of our implementations among each other as well as with the runtime of Borgelt’s Apriori implementation in some of the cases. NCLAT implementation is more efficient than the other Apriori implementations that we know of in terms of the number of times the database is scanned, and the number of candidates generated. Unlike the original Apriori algorithm which scans the database for every level and creates all of the candidates in advance for each level, NCLAT scans the database only once and creates candidate itemsets only for level one but not afterwards. Thus the number of candidates created is equal to the number of unique items in the database.


2015 ◽  
Vol 2015 ◽  
pp. 1-8 ◽  
Author(s):  
Inanç Birol ◽  
Justin Chu ◽  
Hamid Mohamadi ◽  
Shaun D. Jackman ◽  
Karthika Raghavan ◽  
...  

De novoassembly of the genome of a species is essential in the absence of a reference genome sequence. Many scalable assembly algorithms use the de Bruijn graph (DBG) paradigm to reconstruct genomes, where a table of subsequences of a certain length is derived from the reads, and their overlaps are analyzed to assemble sequences. Despite longer subsequences unlocking longer genomic features for assembly, associated increase in compute resources limits the practicability of DBG over other assembly archetypes already designed for longer reads. Here, we revisit the DBG paradigm to adapt it to the changing sequencing technology landscape and introduce three data structure designs for spaced seeds in the form of paired subsequences. These data structures address memory and run time constraints imposed by longer reads. We observe that when a fixed distance separates seed pairs, it provides increased sequence specificity with increased gap length. Further, we note that Bloom filters would be suitable to implicitly store spaced seeds and be tolerant to sequencing errors. Building on this concept, we describe a data structure for tracking the frequencies of observed spaced seeds. These data structure designs will have applications in genome, transcriptome and metagenome assemblies, and read error correction.


Sign in / Sign up

Export Citation Format

Share Document