什么是异或过滤器? [英] What is an XOR filter?

查看:81
本文介绍了什么是异或过滤器?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有一种相对较新的数据结构 (2020),称为 XOR 过滤器 被用作布隆过滤器的替代品.

There's a relatively new data structure (2020) called the XOR filter that's being used as a replacement for a Bloom filter.

什么是异或过滤器?与布隆过滤器相比,它有哪些优势?它是如何工作的?

What is an XOR filter? What advantages does it offer over the Bloom filter? And how does it work?

推荐答案

XOR 过滤器被设计为在预先知道要存储在过滤器中的所有项目的情况下替代布隆过滤器.与布隆过滤器一样,它代表一个集合的近似值,其中不允许出现假阴性,但允许出现假阳性.

An XOR filter is designed as a drop-in replacement for a Bloom filter in the case where all the items to store in the filter are known in advance. Like the Bloom filter, it represents an approximation of a set where false negatives are not allowed, but false positives are.

与布隆过滤器一样,异或过滤器存储大量位.然而,与布隆过滤器不同,我们将每个位视为自己的数组槽,在 XOR 过滤器中,位被组合成 L 位序列,对于我们稍后将选择的某些参数 L.例如,XOR 过滤器可能如下所示:

Like a Bloom filter, an XOR filter stores a large array of bits. Unlike a Bloom filter, though, where we think of each bit as being its own array slot, in an XOR filter the bits are grouped together into L-bit sequences, for some parameter L we'll pick later. For example, an XOR filter might look like this:

 +-------+-------+-------+-------+-------+-------+-------+-------+-------+
 | 11011 | 10010 | 11101 | 11100 | 01001 | 10101 | 01011 | 11001 | 11011 |
 +-------+-------+-------+-------+-------+-------+-------+-------+-------+

接下来,我们选择三个散列函数h1h2h3 就像布隆过滤器一样,将项目散列到数组中的插槽中.这些散列函数让我们获取一个项 x 并计算它的 表代码,我们通过将点 h 中的项进行异或来完成1(x)、h2(x) 和 h3(x).此处显示了一个示例:

Next, we pick three hash functions h1, h2, and h3 that, like a Bloom filter, hash items to slots in the array. Those hash functions let us take an item x and compute its table code, which we do by XORing together the items in spots h1(x), h2(x), and h3(x). An example of this is shown here:

 +-------+-------+-------+-------+-------+-------+-------+-------+-------+
 | 11011 | 10010 | 11101 | 11100 | 01001 | 10101 | 01011 | 11001 | 11011 |
 +-------+-------+-------+-------+-------+-------+-------+-------+-------+
             ^                       ^       ^
             |                       |       |
            h3(x)                   h1(x)   h2(x)
              
            Table code for x:   10010 xor 01001 xor 10101
                              = 01110

为了完成这幅图,我们还需要一个称为指纹识别函数的哈希函数,记为f(x).指纹识别函数将一个值作为输入并输出一个 L 位数字,称为 x 的指纹.要查看 x 是否存储在表中,我们检查 x 的表代码是否与 x 的指纹匹配.如果是这样,我们说 x 是(可能)在表中.如果不是,我们就说 x(肯定)不在表中.

To complete the picture, we need one more hash function called the fingerprinting function, denoted f(x). The fingerprinting function takes a value as input and outputs an L-bit number called the fingerprint of x. To see whether x is stored in the table, we check whether the table code for x matches the fingerprint for x. If so, we say that x is (probably) in the table. If not, we say that x is (definitely) not in the table.

将这个想法与布隆过滤器进行比较是有帮助的.使用布隆过滤器,我们将 x 散列到多个位置,然后通过将所有内容进行 AND 运算从这些位置推导出一个值,最后检查我们得到的值是否等于 1.使用 XOR 过滤器,我们将 x 散列到三个位置,从这些位置通过异或运算得出一个值,最后检查我们得到的值是否等于f(x).

It's helpful to compare this idea against a Bloom filter. With a Bloom filter, we hash x to a number of positions, then derive a value from those positions by AND-ing everything together, and finally check if the value we got was equal to 1. With an XOR filter, we hash x to three positions, derive a value from those positions by XOR-ing them together, and finally check if the value we got is equal to f(x).

要更改 XOR 过滤器的误报率,我们只需更改 L 的值.具体来说,f(x) 巧合的可能性与以下给出的三个位置的 XOR 匹配h1(x)、h2(x) 和 h3(x) 是 2-L,因为这是随机 L 位值匹配另一个的概率.因此,为了得到 ε 的误报率,我们只需设置 L = log2 ε-1.

To change the false positive rate for the XOR filter, we simply change the value of L. Specifically, the likelihood that f(x) coincidentally happens to match the XOR of the three locations given by h1(x), h2(x), and h3(x) is 2-L, since that's the probability that a random L-bit value matches another. Therefore, to get a false positive rate of ε, we simply set L = log2 ε-1.

具有挑战性的部分是填表.事实证明,这样做有一个非常简单的策略.要存储 n 个元素的列表,请创建一个大小为 1.23n 的表.然后,使用这个递归过程:

The challenging part is filling in the table. It turns out that there's a really simple strategy for doing so. To store a list of n elements, create a table of size 1.23n. Then, use this recursive procedure:

  1. 如果没有可放置的物品,您就大功告成了.
  2. 选择具有以下属性的项目 x:x 散列到没有其他项目散列到的表槽(例如槽 k).
  3. 从要放置的项目列表中删除 x 并递归放置剩余的项目.
  4. 将槽 k 的值设置为一个数字,使得表槽 x 的 XOR 散列等于 f(x).(这总是可能的:简单地将其他两个表槽和 f(x) 的内容异或在一起,然后将其存储在槽 k 中.)
  1. If there are no items left to place, you're done.
  2. Pick an item x with the following property: x hashes to a table slot (say, slot k) that no other items hash to.
  3. Remove x from the list of items to place and recursively place the remaining items.
  4. Set the value of slot k to a number such that the XOR of the table slots x hashes to equals f(x). (This is always possible: simply XOR together the contents of the other two table slots and f(x), then store that in slot k.)

如果每个 table slot 至少有两个项目散列到它,这个过程有很小的机会卡在步骤 (2) 中,但可以证明,只要你使用至少 1.23n 个 table slot,概率这种情况发生的非常少.如果发生这种情况,只需选择新的哈希函数并重试.

This procedure has a slight chance of getting stuck in step (2) if every table slot has at least two items hashing to it, but it can be shown that as long as you use at least 1.23n table slots the probability that this occurs is extremely small. If that happens, simply choose new hash functions and try again.

XOR 过滤器与常规布隆过滤器相比有几个优点.

XOR filters have several advantages over regular Bloom filters.

  • 为了降低布隆过滤器的误报率,我们必须添加更多函数.具体来说,对于 ε 的错误率,我们需要使用 log2 ε-1 哈希函数.另一方面,XOR 过滤器总是恰好使用三个哈希函数.
  • 因此,Bloom 过滤器中的查找通常比 XOR 过滤器中的查找慢,因为探测到的每个表槽本质上都位于随机位置,并且可能会导致缓存未命中.使用布隆过滤器,我们每个项目有 log2 ε-1 缓存未命中.使用 XOR 过滤器,我们每个项目有 3 次缓存未命中.
  • 布隆过滤器使用更多空间.错误率为 ε 的布隆过滤器需要一个大小为 1.44n log2 ε-1 的表.XOR 过滤器具有 1.23n 个项目的数组,每个项目的长度为 log2 ε-1 位,总空间使用量为 1.23n log2 ε-1.
  • To decrease the false positive rate of a Bloom filter, we have to add in more functions. Specifically, for an error rate of ε, we need to use log2 ε-1 hash functions. On the other hand, an XOR filters always use exactly three hash functions.
  • As a consequence of this, lookups in a Bloom filter typically are slower than in an XOR filter, since each table slot probed is essentially in a random location and probably causes a cache miss. With Bloom filters, we have log2 ε-1 cache misses per item. With XOR filters, we have three cache misses per item.
  • Bloom filters use more space. A Bloom filter with error rate ε needs a table of size 1.44n log2 ε-1. An XOR filter has an array of 1.23n items, each of which is log2 ε-1 bits long, for a total space usage of 1.23n log2 ε-1.

XOR 过滤器相对于 Bloom 过滤器有一个主要缺点,那就是在构建过滤器之前必须提前知道要存储在 XOR 过滤器中的所有项目.这与布隆过滤器形成对比,布隆过滤器可以在很长一段时间内逐步添加项目.除此之外,XOR 过滤器还提供了更好的性能和内存使用.

XOR filters have one main disadvantage relative to Bloom filters, and that's that all the items to store in the XOR filter must be known in advance before the filter is constructed. This contrasts with Bloom filters, where items can be added incrementally over a long period of time. Aside from this, though, the XOR filter offers better performance and memory usage.

有关 XOR 过滤器的更多信息,以及它们在时间和空间方面与布隆过滤器和布谷鸟过滤器的比较,请查看 这套讲座幻灯片,解释了它们的工作原理,以及1.23 常数的来源以及为什么我们总是使用三个哈希函数.

For more information about XOR filters, along with how they compare in terms of time and space to Bloom filters and cuckoo filters, check out this set of lecture slides, which explains how they work, along with where the 1.23 constant comes from and why we always use three hash functions.

这篇关于什么是异或过滤器?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆