MAD(乘，加，除)散列函数如何工作? [英] How does the MAD (Multiply, Add, Divide) Hashing function work?

查看：129 发布时间：2020/11/22 5:34:32 c++ hashmap hashtable

本文介绍了MAD(乘，加，除)散列函数如何工作?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我已经被分配为大学项目的任务，该任务是从头开始创建数据结构(例如minheap，hashtable等).但是Hashtable或更具体地说是Hash映射-函数给我带来了很多麻烦.我遇到了MAD(乘，加，除)函数，该函数基本上是:h(x)= [(a * x + b)％p]％N，其中a，b:随机整数，p:大质数N:哈希表中元素的数量.

我的问题是此函数如何(以及为什么)准确地将哈希表中的值均匀分布.

解决方案

h(x) = [(a*x + b) % p] % N

让我们首先单独查看a*x + b.如果您想像a分解为2的幂，则a*x是x的总和向左偏移2的幂，因此x中的每个位都会影响其他位的位置在a中设置的位，以及在求和时产生的其他一些位带有特定位.加b会混入另一组随机位:非常类似于XORing，但进位会带来一些额外的复杂性.如果说x具有一个介于0和255之间的值，且具有abcdefgh位(每个为0或1)，那么到目前为止，我们已经得到:

         (a&1 ? abcdefgh : 0) +
        (a&2 ? abcdefgh0 : 0) +
       (a&4 ? abcdefgh00 : 0) +
      (a&8 ? abcdefgh000 : 0) +
                     ...      +  // continues for a&16, a&32 etc.
        ABCDEFGHIJKLMNOP         // however many random bits in "b"

因此，在"1s"列中，我们将h和P求和，这可能与g，h和O并入"2s"列，然后继续.

如果a等于37，即32 + 4 + 1，则我们要添加x本身，x << 2和x << 5:x中的每个位都会影响其中的更多位.散列值(这很好，确实具有密码强度散列函数，更改密钥中的任何位-无论是单个位，一半还是全部-应该几乎随机地将散列值中的一半位翻转). /p>

回到完整的公式，让我们假设我们跳过了% p而只有% N，但是当前表的大小是2的幂:% N然后等于某个数字的按位与运算.不重要的位.换句话说，它丢弃了我们在a * x + b计算的更高有效位中建立的许多随机性.因此，为了使哈希函数可以安全地在任意数量的存储桶中使用，我们可以首先引入% p，这意味着如果从求和步骤开始，哈希值中存在与2的幂次幂相关的模式，则它们是有效地分散在0..p范围内的随机位置上.

请考虑说一个介于0到255之间的哈希-如果N为200，则哈希到0..55范围内的存储桶的可能性是原来的两倍.为了使这种影响不那么重要，我们希望散列值比MOD值具有更多的位，并且该原理以分层的方式应用于我们应为p和N选择的值:

a * x + b的值应倾向于明显大于p，并分布在比p大得多的范围内，因此% p会在存储桶中将它们更多地分开，但是
p应该比N大得多，因此我们没有具有显着更高的碰撞概率的低索引存储桶(如果您使用线性探测来解决碰撞，则尤其糟糕).

例如，如果我们要支持的N值最大为2 ²⁴，并且我们使用32位无符号整数进行这些计算，那么a和b可以是随机的值在该范围内，我们可以将差值拆分为大约2 ²⁸的素数.

I have been assigned as a university project the task to create data structures (such as minheap, hashtable etc.) from scratch. However the Hashtable or more specifically the Hash maps - functions have given me quite some trouble. I have come across the MAD (Multiply, Add, Divide) function which basically is: h(x) = [(a*x + b) % p] % N, where a,b : random integers, p : large prime number and N : number of elements in hashtable.

My question is how (and why) exactly does this function distribute evenly the values in the hashtable.

解决方案

h(x) = [(a*x + b) % p] % N

Let's look at a*x + b in isolation first. If you imagine a broken down into a sum of powers of two, a*x is then the sum of x bit shifted left by a smattering of powers of two, such that each bit in x impacts other bit positions that are set in a, and some further bits when the summation produces carries at particular bits. Adding b mixes in another set of random bits: much like XORing would, but with some extra complexity from the carries. If say x has is a value between 0 and 255, with bits abcdefgh (each being 0 or 1), then so far we've got:

         (a&1 ? abcdefgh : 0) +
        (a&2 ? abcdefgh0 : 0) +
       (a&4 ? abcdefgh00 : 0) +
      (a&8 ? abcdefgh000 : 0) +
                     ...      +  // continues for a&16, a&32 etc.
        ABCDEFGHIJKLMNOP         // however many random bits in "b"

So, in the "1s" column we're summing h and P, which might carry into the "2s" column with g, h and O, and on it goes.

If a is say 37, which is 32+4+1, then we're adding x itself, x << 2, and x << 5: each bit in x thereby impacts more bits in the hash value (this is good, indeed with a cryptographic-strength hash function, changing any bits in the key - whether a single bit, half or all of them - should pretty much randomly flip about half the bits in the hash value).

Returning to the full formula, let's imagine we skipped the % p and had just % N, but the current table size is a power of two: % N is then equivalent to a bitwise-AND operation for some number of less-significant bits. Put another way, it's throwing away a lot of the randomness we've built up in the more significant bits of our a * x + b calculation. So, to make the hash function safe to use with any number of buckets, we can introduce % p first, which means if there are patterns in the hash value related to power-of-two positions from the summation step, they're effectively scattered across random positions in the 0..p range.

Consider say a hash between 0 and 255 - if N was 200, we'd be twice as likely to hash to a bucket in the 0..55 range. To make this effect less significant, we want the hash value to have many more bits than the MOD value, and this principle applies in a layered way to the values we should choose for p and N:

a * x + b values should tend to be significantly larger than p, and be spread across a range much larger than p, so % p separates them more across the buckets, but
p should be much larger than N, so we don't have low-indexed buckets with significantly higher collision probabilities (which is especially bad if you're using linear probing to resolve collisions).

For example, if we wanted to support values of N up to 2²⁴, and we're doing these calculations with 32 bit unsigned integers so a and b have random values in that range, we might split the difference pick a prime around about 2²⁸.

这篇关于MAD(乘，加，除)散列函数如何工作?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

MAD(乘，加，除)散列函数如何工作? [英] How does the MAD (Multiply, Add, Divide) Hashing function work?

问题描述

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

MAD(乘，加，除)散列函数如何工作? [英] How does the MAD (Multiply, Add, Divide) Hashing function work?

问题描述

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

登录关闭