如何散列函数的输出映射到bloomfilter指数? [英] How to map hashfunction output to bloomfilter indices?

查看:252
本文介绍了如何散列函数的输出映射到bloomfilter指数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

谁能帮我提供的散列函数的输出是如何映射到布隆过滤指数的轮廓?这里是 bloomfilters

概述
解决方案
  

在散列函数的输出是如何映射到布隆过滤器索引大纲

对于每个 K 的散列函数的使用,它们映射到在布隆过滤器有点就像哈希映射到散列桶中的哈希表。所以,很常见的,你可能有发言权的哈希函数生成的32位整数,然后用模运营商变得有点指数 0℃;&LT ; I< ñ,其中 N 是在你的布隆过滤器的位数。

为了使这个非常具体的,比方说,一个散列函数生成的数字从0到2 ^ 32-1,并有在布隆过滤器1000位:

  INT bit_index = hash_function(将input_value)%1000;
 

重要的是要在这里指出,2 ^ 32-1的大规模比1000说更大的哈希函数,而不是产生pretty的均匀分布的数字,但只有0到1023之间包容很重要,那么模操作之后它会是两倍那bit_index将在0..23范围相比24..999(因为例如,输入2和1002两个结果以2:1的后模量值,但是只为25的输入产生的25的输出)。出于这个原因,如果有一个散列函数产生32位,则可能希望使用尺寸以比特数这是二的幂bloom滤波器,然后切片出的散列值的部分作为好像独立哈希函数为使用 - 您链接维基百科的文章中的所有解释。这需要一个质量好的散列函数,虽然,作为哈希函数的任何集群的缺陷将通过不折不扣的输出通过;具有比特的素数是减轻这种差散列的一种方式。不过,具有良好的散列函数,两个大国也可以很容易地使用按位与操作,并提取位指数 - 如果需要的话 - 位移,可以比整数模快,虽然散列函数很可能会在侏儒的考虑整体性能概况

编辑 - 处理意见......

假设你的MD5函数返回一个无符号字符* P的数据 MD5_DIGEST_LENGTH 字节,我建议你尝试:

  BOOST_STATIC_ASSERT(MD5_DIGEST_LENGTH> = sizeof的(INT));
INT bit_index = * reinter pret_cast<无符号整型*&GT(P)%num_of_bloom_filter_bits;
 

这实际上的一个特别坏主意的 - 对不起 - 我会解释为什么在某一时刻的两个原因。首先,要回答你的问题它做什么: BOOST_STATIC_ASSERT()的目的是给你一个编译错误,如果它是通过EX pression已评估,以<​​code>假。在这里,它基本上记录要求的一种方式, MD5_DIGEST_LENGTH - 这是在MD5哈希的文本再presentation字符的大小 - 至少长达作为字节数系统使用了一个 INT 整数类型。 (该尺寸大概是4个字节,但可能是8),这一要求是为了保证在下一行 reinter pret_cast 是安全的。什么是不被读取字节的值在MD5哈希的文本再presentation开始,就好像这些字节包含了一个 INT 。所以,说你的 INT 尺寸的的4,MD5哈希值是0cc175b9c0f1b6a831c399e269772661在您的评论:前4个字节包含0cc1。 ASCII码codeS该文本是48,99,99,49小数。当他们读入 INT ,根据CPU的价值可能会有所不同的存储方式,但基本上你会得到一个这些数字的时候256 ^ 3另加一倍256 ^ 2加第三次256加上最后的数字。

我说,这是一个特别坏主意的原因是:

  • 在MD5字符串中的每个字符是数字(ASCII codeS 48-57),或从A到F(97-102)的信。这16个值ONY,一个字节可以有变化的16日,和而 INT 值生成占32位,你真的只得到2 ^ 16个不同的值。
  • 在某些计算机, INT 取值必须在内存地址是2的倍数,4,8等。 reinter $对齐p $ pt_cast - 如果文本恰好开始在一个不兼容的地址,可能会崩溃您的计算机。注:英特尔和放大器; AMD的没有这样的对齐要求,虽然它可能是更快它们正确对齐数据进行操作。

所以,另一项建议:

  //创建合适大小的缓冲区持有有效的无符号长六角重新presentation ...
CHAR数据[的sizeof(无符号长)* 2 + 1];

//复制尽可能多的MD5文字与适合的缓冲,NUL终止它...
sprintf的(数据,sizeof的数据 -  1,MD5%* S);

//转换为unsigned long ...
米= strtoul将(数据,/ * endptr * / NULL,/ *基* / 16);
 

下面,如果MD5重新presentation比数据缓冲器短,只是它的初始部分会被安全地复制,所以不需要BOOST_STATIC_ASSERT

这是很容易使用非加密散列函数,因为他们会通常只返回你一个数字,而不是数量的readble文本缓冲区重新presentation,这样就可以避免了这么多废话。

Can anyone help me by providing an outline on how the hash function output is mapped to bloom filter indices? Here is an overview on bloomfilters.

解决方案

an outline on how the hash function output is mapped to a bloom filter indices

For each of the k hash functions in use, they map onto a bit in the bloom filter just as hashes map onto hash buckets in a hash table. So, very commonly you might have say a hash function generating 32 bit integers, then use the modulus % operator to get a bit index 0 << i < n where n is the number of bits in your bloom filter.

To make this very concrete, let's say a hash function generates numbers from 0 to 2^32-1, and there are 1000 bits in your bloom filter:

int bit_index = hash_function(input_value) % 1000;

It's important to note here that 2^32-1 is massively greater than 1000. Say the hash function instead generated pretty evenly distributed numbers but only between 0 and 1023 inclusive, then after the modulus operation it'd be twice as likely that bit_index would be in the 0..23 range as compared to 24..999 (because e.g. inputs 2 and 1002 both result in a post-modulus value of 2, but only an input of 25 produces an output of 25). For that reason, if you have a hash function generating 32 bits, you might want to use a bloom filter sized to a number of bits that's a power of two, then slice out sections of the hash value to use as as if independent hash functions - all explained in the wikipedia article you link. That requires a good quality hash function though, as any "clustering" flaws in the hash function will be passed through unmitigated to the output; having a prime number of bits is one way to mitigate such poor hashing. Still, with good hash functions, powers of two also make it easy to extract bit indices using bitwise AND operations and - if needed - bit shifting, which can be faster than integer modulus, though the hash functions are probably going to dwarf that consideration in the overall performance profile.

Edit - addressing comments...

Assuming your MD5 function's returning an unsigned char* "p" to MD5_DIGEST_LENGTH bytes of data, I suggested you try:

BOOST_STATIC_ASSERT(MD5_DIGEST_LENGTH >= sizeof(int));
int bit_index = *reinterpret_cast<unsigned int*>(p) % num_of_bloom_filter_bits;

That was actually a particularly bad idea - sorry - I'll explain the two reasons why in a moment. First, to answer your question about what it does: BOOST_STATIC_ASSERT() is designed to give you a compile error if the expression it's passed has evaluated to false. Here, it's basically a way of documenting the requirement that MD5_DIGEST_LENGTH - which is the size in characters of the textual representation of the MD5 hash - be at least as long as the number of bytes your system uses for an int integer type. (That size is probably 4 bytes, but might be 8.) That requirement is intended to ensure that the reinterpret_cast in the next line is safe. What that does is read a value from the bytes at the start of the textual representation of the MD5 hash as if those bytes contained an int. So, say your int size is 4, MD5 hash is "0cc175b9c0f1b6a831c399e269772661" as in your comment: the first 4 bytes contain "0cc1". The ASCII codes for that text are 48, 99, 99, 49 decimal. When they're read into an int, depending on the endianness of the CPU the value could differ, but basically you'll get one of those numbers times 256^3 plus another one times 256^2 plus a third times 256 plus the final number.

The reasons I said this was a particularly bad idea are:

  • each character in the MD5 string is either a digit (ASCII codes 48-57) or a letter from "a" through "f" (97-102). Those 16 values are ony a 16th of the variation that a byte can have, and while the int value you generate occupies 32 bits you really only get 2^16 distinct values.
  • on some computers, ints must be aligned at a memory address that's a multiple of 2, 4, 8 etc.. The reinterpret_cast - if the text happens to start at an incompatible address, could crash your computer. Note: Intel & AMDs have no such alignment requirement, though it may be faster for them to operate on properly aligned data.

So, another suggestion:

// create a buffer of the right size to hold a valid unsigned long in hex representation...
char data[sizeof(unsigned long) * 2 + 1];

// copy as much of the md5 text as will fit into the buffer, NUL terminating it...
sprintf(data, "%.*s", sizeof data - 1, md5);

// convert to an unsigned long...
m = strtoul(data, /*endptr*/ NULL, /*base*/ 16);

Here, if the md5 representation was shorter than the data buffer, just the initial part of it would be safely copied, so the BOOST_STATIC_ASSERT isn't required.

It's much easier to use a non-cryptographic hash function, as they'll generally just return you a number rather than a readble text buffer representation of the number, so you can avoid all this nonsense.

这篇关于如何散列函数的输出映射到bloomfilter指数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆