HyperLogLog算法如何工作? [英] How does the HyperLogLog algorithm work?

查看:156
本文介绍了HyperLogLog算法如何工作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近在我的空闲时间学习了不同的算法,我遇到的一个看起来很有趣的称为HyperLogLog算法 - 它估计列表中有多少个唯一项。

I've been learning about different algorithms in my spare time recently, and one that I came across which appears to be very interesting is called the HyperLogLog algorithm - which estimates how many unique items are in a list.

这对我来说特别有趣,因为它让我回到我的MySQL时代,当我看到基数值(我一直假设,直到最近,它是计算不估计)。

This was particularly interesting to me because it brought me back to my MySQL days when I saw that "Cardinality" value (which I always assumed until recently that it was calculated not estimated).

所以我知道如何在 n )中写一个算法,数组。我用JavaScript写:

So I know how to write an algorithm in O(n) that will calculate how many unique items are in an array. I wrote this in JavaScript:

function countUniqueAlgo1(arr) {
    var Table = {};
    var numUnique = 0;
    var numDataPoints = arr.length;
    for (var j = 0; j < numDataPoints; j++) {
        var val = arr[j];
        if (Table[val] != null) {
            continue;
        }
        Table[val] = 1;
        numUnique++;
    }
    return numUnique;
}

但问题是我的算法,而 >( n )使用大量内存(在中存储值)。

But the problem is that my algorithm, while O(n), uses a lot of memory (storing values in Table).

我一直在阅读本文,了解如何在 O中列出重复项 n )时使用最少的内存。

I've been reading this paper about how to count duplicates in a list in O(n) time and using minimal memory.

它解释了通过散列和计数位或某事,

It explains that by hashing and counting bits or something one can estimate within a certain probability (assuming the list is evenly distributed) the number of unique items in a list.

我已经阅读过这篇论文,但我似乎不明白它的意思。(假设列表是均匀分布的)列表中的唯一项数。 。有人可以给更多的外行人的解释吗?我知道什么哈希是,但我不明白他们是如何使用这个HyperLogLog算法。

I've read the paper, but I can't seem to understand it. Can someone give a more layperson's explanation? I know what hashes are, but I don't understand how they are used in this HyperLogLog algorithm.

推荐答案

这个算法是,如果你观察一个随机整数流,看到一个整数,其二进制表示以一些已知的前缀开始,流的基数有更高的机会是2 ^(前缀的大小)。

The main trick behind this algorithm is that if you, observing a stream of random integers, see an integer which binary representation starts with some known prefix, there is a higher chance that the cardinality of the stream is 2^(size of the prefix).

也就是说,在整数的随机流中,约50%的数字(以二进制)以1开头,25%以01以001开头。这意味着如果观察到一个随机流并看到一个001,这个流的基数为8的可能性更大。

That is, in a random stream of integers, ~50% of the numbers (in binary) starts with "1", 25% starts with "01", 12,5% starts with "001". This means that if you observe a random stream and see a "001", there is a higher chance that this stream has a cardinality of 8.

(前缀00 ..1没有特殊的意义,它只是因为很容易找到二进制数中的最高有效位在大多数处理器)

(The prefix "00..1" has no special meaning. It's there just because it's easy to find the most significant bit in a binary number in most processors)

当然,如果你观察只是一个整数,该值的机会错误是高的。这就是为什么算法在m个独立子流中分割流,并保持每个子流的似乎为00 ... 1前缀的最大长度。然后,通过取每个子流的平均值来估计最终值。

Of course, if you observe just one integer, the chance this value is wrong is high. That's why the algorithm divides the stream in "m" independent substreams and keep the maximum length of a seem "00...1" prefix of each substream. Then, estimates the final value by taking the mean value of each substream.

这是这个算法的主要思想。有一些缺失的细节(例如低估计值的校正),但它都写在纸上。对不起,可怕的英语。

That's the main idea of this algorithm. There are some missing details (the correction for low estimate values, for example), but it's all well written in the paper. Sorry for the terrible english.

这篇关于HyperLogLog算法如何工作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆