布鲁姆过滤器的使用 [英] Bloom filter usage

查看:272
本文介绍了布鲁姆过滤器的使用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在努力了解布隆过滤器的有效性。我得到它的基本逻辑,空间压缩,快速查找,误报等我只是不能把这个概念变成现实生活中的情况作为是有益的。一个常见的​​应用是在Web缓存使用布隆过滤器。我们使用布隆过滤器来确定一个给定的URL是否是在高速缓存中或没有。为什么我们不只是访问缓存来确定?如果我们得到一个肯定,我们仍然需要到高速缓存中检索网页(这可能不是在那里),但如果一个没有,我们可以得到使用缓存(这可能是用于快速查找优化了相同的答案呢?)。

I am struggling to understand the usefulness of the bloom filter. I get its underlying logic, space compaction, fast lookups, false positives etc. I just cannot put that concept into a real-life situation as being beneficial. One frequent application is use of bloom filters in web caching. We use bloom filter to determine whether a given URL is in the cache or not. Why don't we simply access the cache to determine that? If we get a yes, we still need to go to cache to retrieve the webpage (which might not be there), but in case of a no, we could have got the same answer using the cache (which is probably optimized for fast lookups anyway?).

推荐答案

布鲁姆过滤器是专为情况下,假阴性是一个非常糟糕的事情和误报是可以接受的。

Bloom filters are designed for situations where a false negative is a Very Bad Thing and a false positive is acceptable.

例如,假设你是一个网页浏览器,并有诈骗网站的一个已知的黑名单。黑名单是巨大的 - 在数百GB的 - 所以你不能与浏览器的出货。但是,您可以将其存储在自己的服务器上。在这种情况下,你可以随保存所有的URL适当大小的布隆过滤器的浏览器。访问一个网站之前,你看它在过滤器。然后,如果你得到一个不的答案,你保证,该网址是不是列入黑名单,可以直接访问该网站。如果你得到一个回答是,该网站可能是邪恶的,所以你可以在浏览器调用你的主服务器,以获得真正的答案。你可以节省巨大的通话这里的服务器的数量没有牺牲精度的事实是非常重要的。

For example, suppose that you are making a web browser and have a known blacklist of scam websites. Your blacklist is massive - in the hundreds of gigabytes - so you can't ship it with the browser. However, you can store it on your own servers. In that case, you could ship the browser with a Bloom filter of an appropriate size that holds all the URLs. Before visiting a site, you look it up in the filter. Then, if you get a "no" answer, you're guaranteed that the URL is not blacklisted and can just visit the site. If you get a "yes" answer, the site might be evil, so you can have the browser call up your main server to get the real answer. The fact that you can save a huge number of calls to the server here without ever sacrificing accuracy is important.

缓存思想类似于此设置。可以查询过滤器,以查看是否在页是在高速缓存中。如果你得到一个不的答案,你保证它不缓存,可以做一个昂贵的操作从主源提取数据。否则,你可以检查缓存,看看它是否真的是存在的。在极少数情况下,您可能需要检查缓存,看看它是不是有,然后从主源拉,但你绝不会不小心错过一些真正在缓存中。

The cache idea is similar to this setup. You can query the filter to see if the page is in the cache. If you get a "no" answer, you're guaranteed it's not cached and can do an expensive operation to pull the data from the main source. Otherwise, you can then check the cache to see if it really is there. In rare instances you might need to check the cache, see that it isn't there, then pull from the main source, but you will never accidentally miss something really in cache.

希望这有助于!

这篇关于布鲁姆过滤器的使用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆