HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm

Philippe Flajolet; Éric Fusy; Olivier Gandouet; Frédéric Meunier

doi:10.46298/dmtcs.3545

HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm

Discrete Mathematics & Theoretical Computer Science ◽

10.46298/dmtcs.3545 ◽

2007 ◽

Vol DMTCS Proceedings vol. AH,... (Proceedings) ◽

Cited By ~ 10

Author(s):

Philippe Flajolet ◽

Éric Fusy ◽

Olivier Gandouet ◽

Frédéric Meunier

Keyword(s):

Standard Error ◽

Sliding Window ◽

Large Data ◽

Estimation Algorithm ◽

Relative Accuracy ◽

Probabilistic Algorithm ◽

Cardinality Estimation ◽

Single Pass ◽

International Audience

International audience This extended abstract describes and analyses a near-optimal probabilistic algorithm, HYPERLOGLOG, dedicated to estimating the number of \emphdistinct elements (the cardinality) of very large data ensembles. Using an auxiliary memory of m units (typically, "short bytes''), HYPERLOGLOG performs a single pass over the data and produces an estimate of the cardinality such that the relative accuracy (the standard error) is typically about $1.04/\sqrt{m}$. This improves on the best previously known cardinality estimator, LOGLOG, whose accuracy can be matched by consuming only 64% of the original memory. For instance, the new algorithm makes it possible to estimate cardinalities well beyond $10^9$ with a typical accuracy of 2% while using a memory of only 1.5 kilobytes. The algorithm parallelizes optimally and adapts to the sliding window model.

Download Full-text

An optimal cardinality estimation algorithm based on order statistics and its full analysis

Discrete Mathematics & Theoretical Computer Science ◽

10.46298/dmtcs.2780 ◽

2010 ◽

Vol DMTCS Proceedings vol. AM,... (Proceedings) ◽

Author(s):

Jérémie Lumbroso

Keyword(s):

Order Statistics ◽

Estimation Algorithm ◽

Fine Tuning ◽

Simple Analysis ◽

Relative Precision ◽

Asymptotic Regime ◽

Cardinality Estimation ◽

International Audience ◽

Full Analysis

International audience Building on the ideas of Flajolet and Martin (1985), Alon et al. (1987), Bar-Yossef et al. (2002), Giroire (2005), we develop a new algorithm for cardinality estimation, based on order statistics which, according to Chassaing and Gerin (2006), is optimal among similar algorithms. This algorithm has a remarkably simple analysis that allows us to take its $\textit{fine-tuning}$ and the $\textit{characterization of its properties}$ further than has been done until now. We prove that, asymptotically, it is $\textit{strictly unbiased}$ (contrarily to Probabilistic Counting, Loglog, Hyperloglog), we verify that its relative precision is about $1/\sqrt{m-2}$ when $m$ words of storage are used, and we fully characterize the limit law of the estimates it provides, in terms of gamma distribution―-this is the first such algorithm for which the limit law has been established. We also develop a Poisson analysis for the pre-asymptotic regime. In this way, we are able to devise a complete algorithm, covering all cardinalities ranges from $0$ to very large.

Download Full-text

Accelerating the HyperLogLog Cardinality Estimation Algorithm

Scientific Programming ◽

10.1155/2017/2040865 ◽

2017 ◽

Vol 2017 ◽

pp. 1-8

Author(s):

Cem Bozkus ◽

Basilio B. Fraguela

Keyword(s):

Large Data ◽

Estimation Algorithm ◽

Xeon Phi ◽

Memory Usage ◽

Intel Xeon Phi ◽

Data Set ◽

Internet Routers ◽

Multicore System ◽

Sequential Implementation ◽

New Algorithms

In recent years, vast amounts of data of different kinds, from pictures and videos from our cameras to software logs from sensor networks and Internet routers operating day and night, are being generated. This has led to new big data problems, which require new algorithms to handle these large volumes of data and as a result are very computationally demanding because of the volumes to process. In this paper, we parallelize one of these new algorithms, namely, the HyperLogLog algorithm, which estimates the number of different items in a large data set with minimal memory usage, as it lowers the typical memory usage of this type of calculation from O(n) to O(1). We have implemented parallelizations based on OpenMP and OpenCL and evaluated them in a standard multicore system, an Intel Xeon Phi, and two GPUs from different vendors. The results obtained in our experiments, in which we reach a speedup of 88.6 with respect to an optimized sequential implementation, are very positive, particularly taking into account the need to run this kind of algorithm on large amounts of data.

Download Full-text

SKT

Proceedings of the VLDB Endowment ◽

10.14778/3476249.3476287 ◽

2021 ◽

Vol 14 (11) ◽

pp. 2369-2382

Author(s):

Monica Chiosa ◽

Thomas B. Preußer ◽

Gustavo Alonso

Keyword(s):

Frequency Distribution ◽

Empirical Evaluation ◽

Large Data ◽

Cloud Service ◽

Data Sets ◽

Data Set ◽

Single Pass ◽

Trade Offs ◽

Significant Performance ◽

Spatial Architecture

Data analysts often need to characterize a data stream as a first step to its further processing. Some of the initial insights to be gained include, e.g., the cardinality of the data set and its frequency distribution. Such information is typically extracted by using sketch algorithms, now widely employed to process very large data sets in manageable space and in a single pass over the data. Often, analysts need more than one parameter to characterize the stream. However, computing multiple sketches becomes expensive even when using high-end CPUs. Exploiting the increasing adoption of hardware accelerators, this paper proposes SKT , an FPGA-based accelerator that can compute several sketches along with basic statistics (average, max, min, etc.) in a single pass over the data. SKT has been designed to characterize a data set by calculating its cardinality, its second frequency moment, and its frequency distribution. The design processes data streams coming either from PCIe or TCP/IP, and it is built to fit emerging cloud service architectures, such as Microsoft's Catapult or Amazon's AQUA. The paper explores the trade-offs of designing sketch algorithms on a spatial architecture and how to combine several sketch algorithms into a single design. The empirical evaluation shows how SKT on an FPGA offers a significant performance gain over high-end, server-class CPUs.

Download Full-text

Order statistics and estimating cardinalities of massive data sets

Discrete Mathematics & Theoretical Computer Science ◽

10.46298/dmtcs.3353 ◽

2005 ◽

Vol DMTCS Proceedings vol. AD,... (Proceedings) ◽

Author(s):

Frédéric Giroire

Keyword(s):

Order Statistics ◽

Processing Speed ◽

Standard Error ◽

Internet Traffic ◽

Massive Data ◽

Data Sets ◽

Massive Data Sets ◽

Internal Loop ◽

New Class ◽

International Audience

International audience We introduce a new class of algorithms to estimate the cardinality of very large multisets using constant memory and doing only one pass on the data. It is based on order statistics rather that on bit patterns in binary representations of numbers. We analyse three families of estimators. They attain a standard error of $\frac{1}{\sqrt{M}}$ using $M$ units of storage, which places them in the same class as the best known algorithms so far. They have a very simple internal loop, which gives them an advantage in term of processing speed. The algorithms are validated on internet traffic traces.

Download Full-text

SRLA: A Real Time Sliding Time Window Super Point Cardinality Estimation Algorithm for High Speed Network Based on GPU

2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS) ◽

10.1109/hpcc/smartcity/dss.2018.00156 ◽

2018 ◽

Cited By ~ 1

Author(s):

Jie Xu ◽

Wei Ding ◽

Jian Gong ◽

Xiaoyan Hu ◽

Shaobo Sun

Keyword(s):

Real Time ◽

High Speed ◽

Time Window ◽

Estimation Algorithm ◽

Cardinality Estimation ◽

High Speed Network

Download Full-text

ESTIMASI NILAI PEMULIAAN UKURAN TUBUH PADA PEJANTAN SAPI ACEH ( Bos indicus ) UMUR 550 HARI

BUANA SAINS ◽

10.33366/bs.v18i2.1182 ◽

2019 ◽

Vol 18 (2) ◽

pp. 97

Author(s):

Widya Pintaka Bayu Putra ◽

Hendra Saumar

Keyword(s):

Body Length ◽

Standard Error ◽

Selection Criteria ◽

Bos Indicus ◽

Relative Accuracy ◽

Breeding Value ◽

Body Measurements ◽

Chest Girth

Body measurements are one of livestock selection criteria for breeds standardization. This research was carried out to select the best sire at BPTU-HPT Sapi Aceh Indrapuri through body measurements at 550 days of age. Records data of livestock from 2010-2013 were used in this study and consisted of chest girth (CG), withers height (WH) and body length (BL). The average of body measurements were 105.22+6.06 cm (CG), 88.42+4.37 cm (WH) and 83.03+4.74 cm (BL). The heritability (h2) estimation were ranged from medium (CG and WH) to high (BL) criteria. Higher of standard error (SE) than h2 values was indicated that the data in this study was less and the h2 values were not accurate. The cumulative breeding value (BV) of body measurements were reached from -3.69 (Sire: A.004) to +3.89 (Sire: P.0751). Ranking of sire based on BL were not accurate because of lower relative accuracy (RA) value than 1.00 in one of tested sires.

Download Full-text

Cardinality Estimation Algorithm in Large-Scale Anonymous Wireless Sensor Networks

Proceedings of the International Conference on Advanced Intelligent Systems and Informatics 2017 - Advances in Intelligent Systems and Computing ◽

10.1007/978-3-319-64861-3_53 ◽

2017 ◽

pp. 569-578

Author(s):

Ahmed Douik ◽

Salah A. Aly ◽

Tareq Y. Al-Naffouri ◽

Mohamed-Slim Alouini

Keyword(s):

Wireless Sensor Networks ◽

Sensor Networks ◽

Large Scale ◽

Estimation Algorithm ◽

Wireless Sensor ◽

Cardinality Estimation

Download Full-text

Exploring character conflict in molecular data

Zootaxa ◽

10.11646/zootaxa.2946.1.10 ◽

2011 ◽

Vol 2946 (1) ◽

pp. 45 ◽

Cited By ~ 1

Author(s):

ROBERT H. CRUICKSHANK

Keyword(s):

Phylogenetic Signal ◽

Sliding Window ◽

Large Data ◽

Molecular Data ◽

Large Data Sets ◽

Phylogenetic Networks ◽

Data Sets ◽

Bayesian Analyses ◽

Extra Step ◽

Likelihood Mapping

Mooi & Gill (2010) have made a number of criticisms of statistical approaches to the phylogenetic analysis of molecular data as it is currently practiced. There are many different uses for molecular phylogenies, and for most of them statistical methods are entirely appropriate, but for taxonomic purposes the way that these methods have been used is questionable. In these cases it is necessary to introduce an extra step into the analysis – exploration of character conflict. Existing methods for exploring character conflict in molecular data such as spectral analysis, phylogenetic networks, likelihood mapping and sliding window analyses are briefly reviewed, but there is also a need for development of new tools to facilitate the analysis of large data sets. Incorporation of previous phylogenies as priors in Bayesian analyses could help to provide taxonomic stability, while still leaving room for new data to alter these conclusions if they contain sufficiently strong phylogenetic signal. Molecular phylogeneticists should make a clearer distinction between the different uses to which their phylogenies are put; methods suitable in one context may not be appropriate in others.

Download Full-text