scholarly journals HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm

2007 ◽  
Vol DMTCS Proceedings vol. AH,... (Proceedings) ◽  
Author(s):  
Philippe Flajolet ◽  
Éric Fusy ◽  
Olivier Gandouet ◽  
Frédéric Meunier

International audience This extended abstract describes and analyses a near-optimal probabilistic algorithm, HYPERLOGLOG, dedicated to estimating the number of \emphdistinct elements (the cardinality) of very large data ensembles. Using an auxiliary memory of m units (typically, "short bytes''), HYPERLOGLOG performs a single pass over the data and produces an estimate of the cardinality such that the relative accuracy (the standard error) is typically about $1.04/\sqrt{m}$. This improves on the best previously known cardinality estimator, LOGLOG, whose accuracy can be matched by consuming only 64% of the original memory. For instance, the new algorithm makes it possible to estimate cardinalities well beyond $10^9$ with a typical accuracy of 2% while using a memory of only 1.5 kilobytes. The algorithm parallelizes optimally and adapts to the sliding window model.

2010 ◽  
Vol DMTCS Proceedings vol. AM,... (Proceedings) ◽  
Author(s):  
Jérémie Lumbroso

International audience Building on the ideas of Flajolet and Martin (1985), Alon et al. (1987), Bar-Yossef et al. (2002), Giroire (2005), we develop a new algorithm for cardinality estimation, based on order statistics which, according to Chassaing and Gerin (2006), is optimal among similar algorithms. This algorithm has a remarkably simple analysis that allows us to take its $\textit{fine-tuning}$ and the $\textit{characterization of its properties}$ further than has been done until now. We prove that, asymptotically, it is $\textit{strictly unbiased}$ (contrarily to Probabilistic Counting, Loglog, Hyperloglog), we verify that its relative precision is about $1/\sqrt{m-2}$ when $m$ words of storage are used, and we fully characterize the limit law of the estimates it provides, in terms of gamma distribution―-this is the first such algorithm for which the limit law has been established. We also develop a Poisson analysis for the pre-asymptotic regime. In this way, we are able to devise a complete algorithm, covering all cardinalities ranges from $0$ to very large.


2017 ◽  
Vol 2017 ◽  
pp. 1-8
Author(s):  
Cem Bozkus ◽  
Basilio B. Fraguela

In recent years, vast amounts of data of different kinds, from pictures and videos from our cameras to software logs from sensor networks and Internet routers operating day and night, are being generated. This has led to new big data problems, which require new algorithms to handle these large volumes of data and as a result are very computationally demanding because of the volumes to process. In this paper, we parallelize one of these new algorithms, namely, the HyperLogLog algorithm, which estimates the number of different items in a large data set with minimal memory usage, as it lowers the typical memory usage of this type of calculation from O(n) to O(1). We have implemented parallelizations based on OpenMP and OpenCL and evaluated them in a standard multicore system, an Intel Xeon Phi, and two GPUs from different vendors. The results obtained in our experiments, in which we reach a speedup of 88.6 with respect to an optimized sequential implementation, are very positive, particularly taking into account the need to run this kind of algorithm on large amounts of data.


2021 ◽  
Vol 14 (11) ◽  
pp. 2369-2382
Author(s):  
Monica Chiosa ◽  
Thomas B. Preußer ◽  
Gustavo Alonso

Data analysts often need to characterize a data stream as a first step to its further processing. Some of the initial insights to be gained include, e.g., the cardinality of the data set and its frequency distribution. Such information is typically extracted by using sketch algorithms, now widely employed to process very large data sets in manageable space and in a single pass over the data. Often, analysts need more than one parameter to characterize the stream. However, computing multiple sketches becomes expensive even when using high-end CPUs. Exploiting the increasing adoption of hardware accelerators, this paper proposes SKT , an FPGA-based accelerator that can compute several sketches along with basic statistics (average, max, min, etc.) in a single pass over the data. SKT has been designed to characterize a data set by calculating its cardinality, its second frequency moment, and its frequency distribution. The design processes data streams coming either from PCIe or TCP/IP, and it is built to fit emerging cloud service architectures, such as Microsoft's Catapult or Amazon's AQUA. The paper explores the trade-offs of designing sketch algorithms on a spatial architecture and how to combine several sketch algorithms into a single design. The empirical evaluation shows how SKT on an FPGA offers a significant performance gain over high-end, server-class CPUs.


2005 ◽  
Vol DMTCS Proceedings vol. AD,... (Proceedings) ◽  
Author(s):  
Frédéric Giroire

International audience We introduce a new class of algorithms to estimate the cardinality of very large multisets using constant memory and doing only one pass on the data. It is based on order statistics rather that on bit patterns in binary representations of numbers. We analyse three families of estimators. They attain a standard error of $\frac{1}{\sqrt{M}}$ using $M$ units of storage, which places them in the same class as the best known algorithms so far. They have a very simple internal loop, which gives them an advantage in term of processing speed. The algorithms are validated on internet traffic traces.


BUANA SAINS ◽  
2019 ◽  
Vol 18 (2) ◽  
pp. 97
Author(s):  
Widya Pintaka Bayu Putra ◽  
Hendra Saumar

Body measurements are one of livestock selection criteria for breeds standardization. This research was carried out to select the best sire at BPTU-HPT Sapi Aceh Indrapuri through body measurements at 550 days of age. Records data of livestock from 2010-2013 were used in this study and consisted of chest girth (CG), withers height (WH) and body length (BL). The average of body measurements were 105.22+6.06 cm (CG), 88.42+4.37 cm (WH) and 83.03+4.74 cm (BL). The heritability (h2) estimation were ranged from medium (CG and WH) to high (BL) criteria. Higher of standard error (SE) than h2 values was indicated that the data in this study was less and the h2 values were not accurate. The cumulative breeding value (BV) of body measurements were reached from -3.69 (Sire: A.004) to +3.89 (Sire: P.0751). Ranking of sire based on BL were not accurate because of lower relative accuracy (RA) value than 1.00 in one of tested sires.


Zootaxa ◽  
2011 ◽  
Vol 2946 (1) ◽  
pp. 45 ◽  
Author(s):  
ROBERT H. CRUICKSHANK

Mooi & Gill (2010) have made a number of criticisms of statistical approaches to the phylogenetic analysis of molecular data as it is currently practiced. There are many different uses for molecular phylogenies, and for most of them statistical methods are entirely appropriate, but for taxonomic purposes the way that these methods have been used is questionable. In these cases it is necessary to introduce an extra step into the analysis – exploration of character conflict. Existing methods for exploring character conflict in molecular data such as spectral analysis, phylogenetic networks, likelihood mapping and sliding window analyses are briefly reviewed, but there is also a need for development of new tools to facilitate the analysis of large data sets. Incorporation of previous phylogenies as priors in Bayesian analyses could help to provide taxonomic stability, while still leaving room for new data to alter these conclusions if they contain sufficiently strong phylogenetic signal. Molecular phylogeneticists should make a clearer distinction between the different uses to which their phylogenies are put; methods suitable in one context may not be appropriate in others.


Sign in / Sign up

Export Citation Format

Share Document