MAD(乘,加,除)散列函数如何工作? [英] How does the MAD (Multiply, Add, Divide) Hashing function work?
问题描述
我已经被分配为大学项目的任务,该任务是从头开始创建数据结构(例如minheap,hashtable等).但是Hashtable或更具体地说是Hash映射-函数给我带来了很多麻烦.我遇到了MAD(乘,加,除)函数,该函数基本上是:h(x)= [(a * x + b)%p]%N,其中a,b:随机整数,p:大质数N:哈希表中元素的数量.
我的问题是此函数如何(以及为什么)准确地将哈希表中的值均匀分布.
h(x) = [(a*x + b) % p] % N
让我们首先单独查看a*x + b
.如果您想像a
分解为2的幂,则a*x
是x
的总和向左偏移2的幂,因此x
中的每个位都会影响其他位的位置在a
中设置的位,以及在求和时产生的其他一些位带有特定位.加b
会混入另一组随机位:非常类似于XORing,但进位会带来一些额外的复杂性.如果说x
具有一个介于0和255之间的值,且具有abcdefgh
位(每个为0或1),那么到目前为止,我们已经得到:
(a&1 ? abcdefgh : 0) +
(a&2 ? abcdefgh0 : 0) +
(a&4 ? abcdefgh00 : 0) +
(a&8 ? abcdefgh000 : 0) +
... + // continues for a&16, a&32 etc.
ABCDEFGHIJKLMNOP // however many random bits in "b"
因此,在"1s"列中,我们将h
和P
求和,这可能与g
,h
和O
并入"2s"列,然后继续.
如果a
等于37,即32 + 4 + 1,则我们要添加x
本身,x << 2
和x << 5
:x
中的每个位都会影响其中的更多位.散列值(这很好,确实具有密码强度散列函数,更改密钥中的任何位-无论是单个位,一半还是全部-应该几乎随机地将散列值中的一半位翻转). /p>
回到完整的公式,让我们假设我们跳过了% p
而只有% N
,但是当前表的大小是2的幂:% N
然后等于某个数字的按位与运算.不重要的位.换句话说,它丢弃了我们在a * x + b
计算的更高有效位中建立的许多随机性.因此,为了使哈希函数可以安全地在任意数量的存储桶中使用,我们可以首先引入% p
,这意味着如果从求和步骤开始,哈希值中存在与2的幂次幂相关的模式,则它们是有效地分散在0..p范围内的随机位置上.
请考虑说一个介于0到255之间的哈希-如果N
为200,则哈希到0..55范围内的存储桶的可能性是原来的两倍.为了使这种影响不那么重要,我们希望散列值比MOD值具有更多的位,并且该原理以分层的方式应用于我们应为p
和N
选择的值:
-
a * x + b
的值应倾向于明显大于p
,并分布在比p
大得多的范围内,因此% p
会在存储桶中将它们更多地分开,但是 -
p
应该比N
大得多,因此我们没有具有显着更高的碰撞概率的低索引存储桶(如果您使用线性探测来解决碰撞,则尤其糟糕).
例如,如果我们要支持的N
值最大为2 24 ,并且我们使用32位无符号整数进行这些计算,那么a
和b
可以是随机的值在该范围内,我们可以将差值拆分为大约2 28 的素数.
I have been assigned as a university project the task to create data structures (such as minheap, hashtable etc.) from scratch. However the Hashtable or more specifically the Hash maps - functions have given me quite some trouble. I have come across the MAD (Multiply, Add, Divide) function which basically is: h(x) = [(a*x + b) % p] % N, where a,b : random integers, p : large prime number and N : number of elements in hashtable.
My question is how (and why) exactly does this function distribute evenly the values in the hashtable.
h(x) = [(a*x + b) % p] % N
Let's look at a*x + b
in isolation first. If you imagine a
broken down into a sum of powers of two, a*x
is then the sum of x
bit shifted left by a smattering of powers of two, such that each bit in x
impacts other bit positions that are set in a
, and some further bits when the summation produces carries at particular bits. Adding b
mixes in another set of random bits: much like XORing would, but with some extra complexity from the carries. If say x
has is a value between 0 and 255, with bits abcdefgh
(each being 0 or 1), then so far we've got:
(a&1 ? abcdefgh : 0) +
(a&2 ? abcdefgh0 : 0) +
(a&4 ? abcdefgh00 : 0) +
(a&8 ? abcdefgh000 : 0) +
... + // continues for a&16, a&32 etc.
ABCDEFGHIJKLMNOP // however many random bits in "b"
So, in the "1s" column we're summing h
and P
, which might carry into the "2s" column with g
, h
and O
, and on it goes.
If a
is say 37, which is 32+4+1, then we're adding x
itself, x << 2
, and x << 5
: each bit in x
thereby impacts more bits in the hash value (this is good, indeed with a cryptographic-strength hash function, changing any bits in the key - whether a single bit, half or all of them - should pretty much randomly flip about half the bits in the hash value).
Returning to the full formula, let's imagine we skipped the % p
and had just % N
, but the current table size is a power of two: % N
is then equivalent to a bitwise-AND operation for some number of less-significant bits. Put another way, it's throwing away a lot of the randomness we've built up in the more significant bits of our a * x + b
calculation. So, to make the hash function safe to use with any number of buckets, we can introduce % p
first, which means if there are patterns in the hash value related to power-of-two positions from the summation step, they're effectively scattered across random positions in the 0..p range.
Consider say a hash between 0 and 255 - if N
was 200, we'd be twice as likely to hash to a bucket in the 0..55 range. To make this effect less significant, we want the hash value to have many more bits than the MOD value, and this principle applies in a layered way to the values we should choose for p
and N
:
a * x + b
values should tend to be significantly larger thanp
, and be spread across a range much larger thanp
, so% p
separates them more across the buckets, butp
should be much larger thanN
, so we don't have low-indexed buckets with significantly higher collision probabilities (which is especially bad if you're using linear probing to resolve collisions).
For example, if we wanted to support values of N
up to 224, and we're doing these calculations with 32 bit unsigned integers so a
and b
have random values in that range, we might split the difference pick a prime around about 228.
这篇关于MAD(乘,加,除)散列函数如何工作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!