使用布隆过滤器有什么好处? [英] What is the advantage to using bloom filters?

查看:24
本文介绍了使用布隆过滤器有什么好处?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在阅读布隆过滤器,但它们看起来很傻.您可以使用布隆过滤器完成的任何事情,您都可以使用单个哈希函数而不是多个哈希函数,以更少的空间、更高效地完成,或者看起来就是这样.为什么要使用布隆过滤器,它有什么用处?

I am reading up on bloom filters and they just seem silly. Anything you can accomplish with a bloom filter, you could accomplish in less space, more efficiently, using a single hash function rather than multiple, or that's what it seems. Why would you use a bloom filter and how is it useful?

推荐答案

来自维基百科:

布隆过滤器有很强的空间优于其他数据结构用于表示集合,例如自平衡二叉搜索树,尝试、哈希表或简单数组或条目的链接列表.最多其中至少需要存储数据项本身,可以需要一个小数字位,对于小整数,到任意数量的位,例如字符串(尝试是一个例外,因为他们可以共享存储具有相同前缀的元素).已链接结构会产生额外的线性指针的空间开销.一朵花过滤器具有 1% 的错误和最佳另一方面,k 的值,每只需要大约 9.6 位元素 — 无论大小要素.这个优势来了部分是因为它的紧凑性,继承了来自数组,部分来自其概率性质.如果有 1% 的错误阳性率似乎太高了,每个我们为每个元素添加大约 4.8 位的时间我们将其减少了十倍.

Bloom filters have a strong space advantage over other data structures for representing sets, such as self-balancing binary search trees, tries, hash tables, or simple arrays or linked lists of the entries. Most of these require storing at least the data items themselves, which can require anywhere from a small number of bits, for small integers, to an arbitrary number of bits, such as for strings (tries are an exception, since they can share storage between elements with equal prefixes). Linked structures incur an additional linear space overhead for pointers. A Bloom filter with 1% error and an optimal value of k, on the other hand, requires only about 9.6 bits per element — regardless of the size of the elements. This advantage comes partly from its compactness, inherited from arrays, and partly from its probabilistic nature. If a 1% false positive rate seems too high, each time we add about 4.8 bits per element we decrease it by ten times.

对我来说很清楚.

布隆过滤器本身不存储元素,这是关键点.您不使用布隆过滤器来测试元素是否存在,而是使用它来测试它是否肯定 存在,因为它保证没有漏报.这让您无需为集合中不存在的元素做额外的工作(例如磁盘 IO 来查找它们).

A bloom filter doesn't store the elements themselves, this is the crucial point. You don't use a bloom filter to test if an element is present, you use it to test whether it's certainly not present, since it guarantees no false negatives. This lets you not do extra work for elements that don't exist in a set (such as disk IO to look them up).

而且所有的空间都比哈希表之类的东西少得多(对于大型数据集,它可能部分位于磁盘上).尽管您可以将布隆过滤器结合与哈希表等结构一起使用,但一旦您确定该元素有机会出现.

And all in significantly less space than something like a hash table (which is likely going to be partially on disk for large data sets). Though you may use a bloom filter in conjunction with a structure like a hash table, once you're certain the element has a chance of being present.

因此示例使用模式可能是:

So an example usage pattern might be:

您在磁盘上有大量数据——您决定所需的误差范围(例如 1%),这规定了 m 的值.然后确定最优的k(根据文章中给出的公式).您可以使用此磁盘绑定数据填充过滤器一次.

You've got a lot of data, on disk -- you decide on what error bound you want (e.g. 1%), that prescribes the value of m. Then the optimal k is determined (from the formula given in the article). You populate your filter from this disk-bound data once.

现在您在 RAM 中拥有过滤器.当您需要处理某个元素时,您可以查询过滤器以查看它是否有可能存在于您的数据集中.如果没有,则不会进行额外的工作.没有磁盘读取等(如果它是散列或树等,您将不得不这样做).

Now you have the filter in RAM. When you need to process some element, you query your filter to see if it stands a chance of existing in your data set. If it doesn't, no extra work is done. No disk reads, etc. (Which you would have to do if it were a hash or tree, etc).

否则,如果过滤器显示是的,它就在那里",则有 1% 的可能性是错误的,因此您需要进行必要的工作以找出答案.99% 的情况下,它确实在那里,所以这项工作并非徒劳.

Otherwise, if the filter says "Yes, it's in there", there's a 1% chance that it's wrong, so you do the necessary work to find out. 99% of the time, it really will be there, so the work was not for naught.

这篇关于使用布隆过滤器有什么好处?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆