HyperLogLog 算法是如何工作的? [英] How does the HyperLogLog algorithm work?

查看:16
本文介绍了HyperLogLog 算法是如何工作的?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近在业余时间学习了不同的算法,我遇到的一个看起来很有趣的算法叫做 HyperLogLog 算法——它估计列表中有多少独特的项目.

I've been learning about different algorithms in my spare time recently, and one that I came across which appears to be very interesting is called the HyperLogLog algorithm - which estimates how many unique items are in a list.

这对我来说特别有趣,因为当我看到基数"值(直到最近我一直认为它是计算而不是估计的)时,它让我回到了我的 MySQL 时代.

This was particularly interesting to me because it brought me back to my MySQL days when I saw that "Cardinality" value (which I always assumed until recently that it was calculated not estimated).

所以我知道如何在 O(n) 中编写一个算法来计算数组中有多少唯一项.我是用 JavaScript 写的:

So I know how to write an algorithm in O(n) that will calculate how many unique items are in an array. I wrote this in JavaScript:

function countUniqueAlgo1(arr) {
    var Table = {};
    var numUnique = 0;
    var numDataPoints = arr.length;
    for (var j = 0; j < numDataPoints; j++) {
        var val = arr[j];
        if (Table[val] != null) {
            continue;
        }
        Table[val] = 1;
        numUnique++;
    }
    return numUnique;
}

但问题是我的算法虽然 O(n) 使用了大量内存(在 Table 中存储值).

But the problem is that my algorithm, while O(n), uses a lot of memory (storing values in Table).

我一直在阅读这篇论文关于如何计算列表中的重复项O(n) 时间并使用最少的内存.

I've been reading this paper about how to count duplicates in a list in O(n) time and using minimal memory.

它解释了通过散列和计数位或可以在一定概率(假设列表均匀分布)内估计列表中唯一项的数量.

It explains that by hashing and counting bits or something one can estimate within a certain probability (assuming the list is evenly distributed) the number of unique items in a list.

我读过论文,但我似乎无法理解.有人可以给出更外行的解释吗?我知道哈希是什么,但我不明白它们在这个 HyperLogLog 算法中是如何使用的.

I've read the paper, but I can't seem to understand it. Can someone give a more layperson's explanation? I know what hashes are, but I don't understand how they are used in this HyperLogLog algorithm.

推荐答案

这个算法背后的主要技巧是,如果你观察一个随机整数流,看到一个以某个已知前缀开始的二进制表示的整数,有一个流的基数为 2^(前缀的大小)的可能性更高.

The main trick behind this algorithm is that if you, observing a stream of random integers, see an integer which binary representation starts with some known prefix, there is a higher chance that the cardinality of the stream is 2^(size of the prefix).

也就是说,在随机整数流中,约 50% 的数字(二进制​​)以1"开头,25% 以01"开头,12.5% 以001"开头.这意味着,如果您观察随机流并看到001",则该流的基数为 8 的可能性更高.

That is, in a random stream of integers, ~50% of the numbers (in binary) starts with "1", 25% starts with "01", 12,5% starts with "001". This means that if you observe a random stream and see a "001", there is a higher chance that this stream has a cardinality of 8.

(前缀00..1"没有特殊含义.它存在只是因为在大多数处理器中很容易找到二进制数中的最高有效位)

(The prefix "00..1" has no special meaning. It's there just because it's easy to find the most significant bit in a binary number in most processors)

当然,如果你只观察一个整数,这个值出错的可能性就很高.这就是为什么该算法将流划分为m"个独立的子流并保持每个子流的可见前缀00...1"的最大长度.然后,通过取每个子流的平均值来估计最终值.

Of course, if you observe just one integer, the chance this value is wrong is high. That's why the algorithm divides the stream in "m" independent substreams and keep the maximum length of a seen "00...1" prefix of each substream. Then, estimates the final value by taking the mean value of each substream.

这就是这个算法的主要思想.有一些缺失的细节(例如,对低估计值的校正),但论文中写得很好.对不起,糟糕的英语.

That's the main idea of this algorithm. There are some missing details (the correction for low estimate values, for example), but it's all well written in the paper. Sorry for the terrible english.

这篇关于HyperLogLog 算法是如何工作的?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆