使用布隆过滤器有什么优势? [英] What is the advantage to using bloom filters?

查看:179
本文介绍了使用布隆过滤器有什么优势?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在阅读绽放过滤器,他们似乎很傻。任何你可以用绽放过滤器完成的任务,你可以用更少的空间,更有效地完成,使用单个散列函数而不是多个,或者这就是它的样子。为什么要使用一个bloom过滤器,它是如何有用的?

解决方案

维基百科


布鲁姆过滤器有一个强大的空间
优于其他数据结构
表示集合,例如
自平衡二叉搜索树,
尝试,哈希表或简单数组
或条目的链表。大多数
需要至少存储
数据项本身,这可以
需要从小数
的位,小的整数到
任意的任意位置比特数,例如
字符串(尝试是一个例外,因为
他们可以在
元素之间共享相同前缀的存储)。链接
结构会为指针带来额外的线性
空间开销。一个Bloom
过滤器,1%的错误和最优的
值为k,另一方面,
每个
元素只需要大约9.6位元,而不管$ b的大小$ b元素。这个优点来自于
部分来自于其紧凑性,从数组继承
,部分来自其
概率性质。如果1%的虚假
积极率似乎太高,每个
时间,我们每个元素
添加约4.8位元素,我们将其减少十倍。




对我来说很清楚。



绽放过滤器不存储元素本身,这是关键点。您不使用绽放过滤器来测试元素是否存在,您可以使用它来测试它是否确实存在,因为它不保证没有错误的否定。这样就可以不为一组中不存在的元素(例如磁盘IO查找它们)执行额外的工作。



而且所有的空间都比像哈希表(很可能部分在大型数据集的磁盘上)。虽然你可以使用一个结合像哈希表一起使用一个绽放过滤器,一旦你确定元素有机会出现。



所以一个示例使用模式可能是:



你有很多数据,在磁盘上 - 你决定你想要什么错误限制(例如1 %),它规定了 m 的值。然后确定最优的 (从文章中给出的公式)。您从此磁盘绑定数据中填充过滤器一次。



现在,您将内存中的过滤器。当您需要处理某些元素时,您可以查询过滤器,看看它是否存在于数据集中。如果没有,没有额外的工作。没有磁盘读取等(如果它是一个哈希或树等,你需要做的)。



否则,如果过滤器说是的,它在在那里,有1%的机会是错误的,所以你做必要的工作找出来。 99%的时间,它真的会在那里,所以工作不是没有的。


I am reading up on bloom filters and they just seem silly. Anything you can accomplish with a bloom filter, you could accomplish in less space, more efficiently, using a single hash function rather than multiple, or that's what it seems. Why would you use a bloom filter and how is it useful?

解决方案

From Wikipedia:

Bloom filters have a strong space advantage over other data structures for representing sets, such as self-balancing binary search trees, tries, hash tables, or simple arrays or linked lists of the entries. Most of these require storing at least the data items themselves, which can require anywhere from a small number of bits, for small integers, to an arbitrary number of bits, such as for strings (tries are an exception, since they can share storage between elements with equal prefixes). Linked structures incur an additional linear space overhead for pointers. A Bloom filter with 1% error and an optimal value of k, on the other hand, requires only about 9.6 bits per element — regardless of the size of the elements. This advantage comes partly from its compactness, inherited from arrays, and partly from its probabilistic nature. If a 1% false positive rate seems too high, each time we add about 4.8 bits per element we decrease it by ten times.

Pretty clear to me.

A bloom filter doesn't store the elements themselves, this is the crucial point. You don't use a bloom filter to test if an element is present, you use it to test whether it's certainly not present, since it guarantees no false negatives. This lets you not do extra work for elements that don't exist in a set (such as disk IO to look them up).

And all in significantly less space than something like a hash table (which is likely going to be partially on disk for large data sets). Though you may use a bloom filter in conjunction with a structure like a hash table, once you're certain the element has a chance of being present.

So an example usage pattern might be:

You've got a lot of data, on disk -- you decide on what error bound you want (e.g. 1%), that prescribes the value of m. Then the optimal k is determined (from the formula given in the article). You populate your filter from this disk-bound data once.

Now you have the filter in RAM. When you need to process some element, you query your filter to see if it stands a chance of existing in your data set. If it doesn't, no extra work is done. No disk reads, etc. (Which you would have to do if it were a hash or tree, etc).

Otherwise, if the filter says "Yes, it's in there", there's a 1% chance that it's wrong, so you do the necessary work to find out. 99% of the time, it really will be there, so the work was not for naught.

这篇关于使用布隆过滤器有什么优势?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆