MAD(乘,加,除)散列函数如何工作? [英] How does the MAD (Multiply, Add, Divide) Hashing function work?

查看:129
本文介绍了MAD(乘,加,除)散列函数如何工作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经被分配为大学项目的任务,该任务是从头开始创建数据结构(例如minheap,hashtable等).但是Hashtable或更具体地说是Hash映射-函数给我带来了很多麻烦.我遇到了MAD(乘,加,除)函数,该函数基本上是:h(x)= [(a * x + b)%p]%N,其中a,b:随机整数,p:大质数N:哈希表中元素的数量.

我的问题是此函数如何(以及为什么)准确地将哈希表中的值均匀分布.

解决方案

h(x) = [(a*x + b) % p] % N

让我们首先单独查看a*x + b.如果您想像a分解为2的幂,则a*xx的总和向左偏移2的幂,因此x中的每个位都会影响其他位的位置在a中设置的位,以及在求和时产生的其他一些位带有特定位.加b会混入另一组随机位:非常类似于XORing,但进位会带来一些额外的复杂性.如果说x具有一个介于0和255之间的值,且具有abcdefgh位(每个为0或1),那么到目前为止,我们已经得到:

         (a&1 ? abcdefgh : 0) +
        (a&2 ? abcdefgh0 : 0) +
       (a&4 ? abcdefgh00 : 0) +
      (a&8 ? abcdefgh000 : 0) +
                     ...      +  // continues for a&16, a&32 etc.
        ABCDEFGHIJKLMNOP         // however many random bits in "b"

因此,在"1s"列中,我们将hP求和,这可能与ghO并入"2s"列,然后继续.

如果a等于37,即32 + 4 + 1,则我们要添加x本身,x << 2x << 5:x中的每个位都会影响其中的更多位.散列值(这很好,确实具有密码强度散列函数,更改密钥中的任何位-无论是单个位,一半还是全部-应该几乎随机地将散列值中的一半位翻转). /p>

回到完整的公式,让我们假设我们跳过了% p而只有% N,但是当前表的大小是2的幂:% N然后等于某个数字的按位与运算.不重要的位.换句话说,它丢弃了我们在a * x + b计算的更高有效位中建立的许多随机性.因此,为了使哈希函数可以安全地在任意数量的存储桶中使用,我们可以首先引入% p,这意味着如果从求和步骤开始,哈希值中存在与2的幂次幂相关的模式,则它们是有效地分散在0..p范围内的随机位置上.

请考虑说一个介于0到255之间的哈希-如果N为200,则哈希到0..55范围内的存储桶的可能性是原来的两倍.为了使这种影响不那么重要,我们希望散列值比MOD值具有更多的位,并且该原理以分层的方式应用于我们应为pN选择的值:

  • a * x + b的值应倾向于明显大于p,并分布在比p大得多的范围内,因此% p会在存储桶中将它们更多地分开,但是

  • p应该比N大得多,因此我们没有具有显着更高的碰撞概率的低索引存储桶(如果您使用线性探测来解决碰撞,则尤其糟糕).

例如,如果我们要支持的N值最大为2 24 ,并且我们使用32位无符号整数进行这些计算,那么ab可以是随机的值在该范围内,我们可以将差值拆分为大约2 28 的素数.

I have been assigned as a university project the task to create data structures (such as minheap, hashtable etc.) from scratch. However the Hashtable or more specifically the Hash maps - functions have given me quite some trouble. I have come across the MAD (Multiply, Add, Divide) function which basically is: h(x) = [(a*x + b) % p] % N, where a,b : random integers, p : large prime number and N : number of elements in hashtable.

My question is how (and why) exactly does this function distribute evenly the values in the hashtable.

解决方案

h(x) = [(a*x + b) % p] % N

Let's look at a*x + b in isolation first. If you imagine a broken down into a sum of powers of two, a*x is then the sum of x bit shifted left by a smattering of powers of two, such that each bit in x impacts other bit positions that are set in a, and some further bits when the summation produces carries at particular bits. Adding b mixes in another set of random bits: much like XORing would, but with some extra complexity from the carries. If say x has is a value between 0 and 255, with bits abcdefgh (each being 0 or 1), then so far we've got:

         (a&1 ? abcdefgh : 0) +
        (a&2 ? abcdefgh0 : 0) +
       (a&4 ? abcdefgh00 : 0) +
      (a&8 ? abcdefgh000 : 0) +
                     ...      +  // continues for a&16, a&32 etc.
        ABCDEFGHIJKLMNOP         // however many random bits in "b"

So, in the "1s" column we're summing h and P, which might carry into the "2s" column with g, h and O, and on it goes.

If a is say 37, which is 32+4+1, then we're adding x itself, x << 2, and x << 5: each bit in x thereby impacts more bits in the hash value (this is good, indeed with a cryptographic-strength hash function, changing any bits in the key - whether a single bit, half or all of them - should pretty much randomly flip about half the bits in the hash value).

Returning to the full formula, let's imagine we skipped the % p and had just % N, but the current table size is a power of two: % N is then equivalent to a bitwise-AND operation for some number of less-significant bits. Put another way, it's throwing away a lot of the randomness we've built up in the more significant bits of our a * x + b calculation. So, to make the hash function safe to use with any number of buckets, we can introduce % p first, which means if there are patterns in the hash value related to power-of-two positions from the summation step, they're effectively scattered across random positions in the 0..p range.

Consider say a hash between 0 and 255 - if N was 200, we'd be twice as likely to hash to a bucket in the 0..55 range. To make this effect less significant, we want the hash value to have many more bits than the MOD value, and this principle applies in a layered way to the values we should choose for p and N:

  • a * x + b values should tend to be significantly larger than p, and be spread across a range much larger than p, so % p separates them more across the buckets, but

  • p should be much larger than N, so we don't have low-indexed buckets with significantly higher collision probabilities (which is especially bad if you're using linear probing to resolve collisions).

For example, if we wanted to support values of N up to 224, and we're doing these calculations with 32 bit unsigned integers so a and b have random values in that range, we might split the difference pick a prime around about 228.

这篇关于MAD(乘,加,除)散列函数如何工作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆