在Bloom过滤器中使用哪个哈希函数 [英] Which hash functions to use in a Bloom filter

查看:204
本文介绍了在Bloom过滤器中使用哪个哈希函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述




  • 要使用哪个函数?



  • $几乎每篇文档都可以读取 Bloom过滤器中使用的散列函数应该是独立的,均匀分布的。

    我知道这个(独立的,均匀分布的)是什么意思,但是我很难找到论证或讨论,哈希函数满足这些要求,因此是合适的。在很多文章中,我已经读过关于使用 FNV Murmur哈希函数的建议,但不是为什么(或者至少没有证据) 。

    在此先感谢!

    解决方案

    在构建Java布隆过滤器库时出现问题。请参阅 Github自述文件,详细了解我对Bloom的哈希函数的分析过滤器。

    我从两个角度看问题:


    • 速度有多快是计算吗?

    • 输出分配是多么统一?


      速度可以很容易地测量通过随机输入的基准。一致性有点困难,需要一些统计数据。使用卡方拟合优度检验,我测量了散列值分布与统一分布的相似程度。

      结果是:




      • 使用 Murmur3 可以在速度和均匀性之间找到最佳平衡点。 不使用Murmur2,因为它对于以小增量变化的输入是不统一的。
      • 使用加密散列函数 256为最好的一致性。

      • 应用 Kirsch-Mitzenmacher-Optimization 仅计算2个而不是k个散列函数( hash_i = hash1 + ix hash2 )。 >


      如果您的实现正在使用Java,我会推荐使用我们的Bloom过滤器哈希库。这是有据可查的和彻底的测试。有关详细信息,包括根据卡方检验的不同散列函数的基准结果及其不一致性,请参阅 Github的repo自述

      I've got the following question about choosing hash functions for Bloom filters:

      • Which functions to use?

      In nearly every document/paper you can read that the hash functions used in a Bloom filter should be independent and uniformly distributed.

      I know what is meant by this (independent and uniformly distributed), but I'm having trouble to find a argumentation or a discussion, which hash functions fulfill those requirements and are therefore suitable. In a lot of posts I've read about suggestions for the usage of the FNV or Murmur hash function, but not why (or at least without a proof) they are suitable.

      Thanks in advance!

      解决方案

      I asked myself the same question when building a Java Bloom filter library. See the Github readme for a detailed treatment of my analysis of hash functions for Bloom filters.

      I looked at the problem from two perspectives:

      • How fast is the computation?
      • How uniform is the output distribution?

      Speed can easily be measured by benchmarks on random input. Uniformity is a bit harder and requires some statistics. Using Chi-Square goodness of fit tests I measured how similar the distribution of hash values is to a uniform distribution.

      The result is:

      • Use Murmur3 for the best trade-off between speed and uniformity. Do not use Murmur2 as it is not uniform for inputs that change in small increments.
      • Use a cryptographic hash function like SHA-256 for the best uniformity.
      • Apply the Kirsch-Mitzenmacher-Optimization to only compute 2 instead of k hash functions (hash_i = hash1 + i x hash2).

      If your implementation is using Java I would recommend using our Bloom filter hash library. It is well documented and thoroughly tested. For the details, including the benchmark results for different hash function and their unformity according to Chi-Square test, see the Github readme of the repo.

      这篇关于在Bloom过滤器中使用哪个哈希函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆