Log hash算法如何使用单个散列函数? [英] How LogLog algorithm with single hash function works

查看:145
本文介绍了Log hash算法如何使用单个散列函数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经发现了几十种LogLog算法的基本思想的解释,但是他们都缺少关于 哈希函数结果分解的工作原理的细节。 我的意思是使用单个哈希功能不精确,而使用许多功能太贵了。 他们如何克服单个散列函数的问题?



这个答案是我发现的最好的解释,但对我来说还是没有意义的:


他们使用一个哈希,但分为两部分。一个被称为
桶(桶的总数是2 ^ x),另一个 - 基本上是
与我们的哈希相同。我很难得到发生了什么,所以我b $ b将举个例子。假设你有两个元素和你的哈希
函数,它赋值为0到2 ^ 10生成2个值:344和
387.你决定拥有16个桶。所以你有:

  0101 011000 bucket 5将存储1 
0110 000011 bucket 6将存储4
你可以解释一下上面的例子吗?


你应该有16个桶,因为你的标题长度为4,对吗?那么如何才能有16个桶只有两个散列?我们估计只有水桶吗?所以第一个桶的大小是1,第二个大小是4,对吗?如何合并结果?

解决方案

散列函数拆分:我们的目标是使用许多超文本结构例如,假设16个超文本结构,每个都使用64位散列函数),而不是一个,以减少估计误差。直观的方法可能是处理每个这些超文本结构中的每个输入。然而,在这种情况下,我们需要确保超文本是彼此独立的,这意味着我们需要一组独立于彼此的16个哈希函数,这很难找到!



所以我们使用一种替代方法。我们将使用16个单独的超文本结构,而不是使用一系列64位哈希函数,每个结构只使用60位哈希函数。我们该如何做?简单来说,我们采用我们的64位散列函数,只忽略前4位,产生一个60位散列函数。我们用前4位做什么?我们使用它们来选择16个桶中的一个(每个桶只是一个超文本结构。注意,2 ^ 4位= 16桶)。现在,每个输入都分配给16个桶中的一个,其中使用60位散列函数来计算超级日志值。所以我们有16个超文本结构,每个都使用60位哈希函数。假设我们选择了一个体面的散列函数(意味着前4位是均匀分布的,并且它们与剩余的60位不相关),我们现在有16个独立的超文本结构。我们对16个估计值进行调和平均,以获得更少的错误估计的基数。



希望清除!


I have found tens of explanation of the basic idea of LogLog algorithms, but they all lack details about how does hash function result splitting works? I mean using single hash function is not precise while using many function is too expensive. How do they overcome the problem with single hash function?

This answer is the best explanation I've found, but still have no sense for me:

They used one hash but divided it into two parts. One is called a bucket (total number of buckets is 2^x) and another - is basically the same as our hash. It was hard for me to get what was going on, so I will give an example. Assume you have two elements and your hash function which gives values form 0 to 2^10 produced 2 values: 344 and 387. You decided to have 16 buckets. So you have:

0101 011000  bucket 5 will store 1
0110 000011  bucket 6 will store 4

Could you explain example above pls? You should have 16 buckets because you have header of length 4, right? So how you can have 16 buckets with only two hashes? Do we estimate only buckets, right? So the first bucket is of size 1, and the second of size 4, right? How to merge the results?

解决方案

Hash function splitting: our goal is to use many hyperloglog structures (as an example, let's say 16 hyperloglog structures, each of them using a 64-bit hash function) instead of one, in order to reduce the estimation error. An intuitive approach might be to process each of the inputs in each of these hyperloglog structures. However, in that case we would need to make sure that the hyperloglogs are independent of each other, meaning we would need a set of 16 hash functions which are independent of each other - that's hard to find!.

So we use an alternative approach. Instead of using a family of 64-bit hash functions, we will use 16 separate hyperloglog structures, each using just a 60-bit hash function. How do we do that? Easy, we take our 64-bit hash function and just ignore the first 4 bits, producing a 60-bit hash function. What do we do with the first 4 bits? We use them to choose one of 16 "buckets" (Each "bucket" is just a hyperloglog structure. Note that 2^4 bits=16 buckets). Now each of the inputs is assigned to exactly one of the 16 buckets, where a 60-bit hash function is used to calculate the hyperloglog value. So we have 16 hyperloglog structures, each using a 60-bit hash function. Assuming that we chose a decent hash function (meaning that the first 4 bits are uniformly distributed, and that they aren't correlated with the remaining 60 bits), we now have 16 independent hyperloglog structures. We take an harmonic average of their 16 estimates to get a much less error-prone estimate of the cardinality.

Hope that clears it up!

这篇关于Log hash算法如何使用单个散列函数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆