哈希函数为什么要使用素数模数? [英] Why should hash functions use a prime number modulus?

查看:350
本文介绍了哈希函数为什么要使用素数模数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

很久以前,我以1.25美元的价格买了一本数据结构书。在这一点上,散列函数的解释就是说,由于数学本质,它最终应该是一个素数。

A long time ago, I bought a data structures book off the bargain table for $1.25. In it, the explanation for a hashing function said that it should ultimately mod by a prime number because of "the nature of math".

你希望从$ 1.25书?

What do you expect from a $1.25 book?

无论如何,我已经有几年考虑数学的本质,仍然无法弄清楚。

Anyway, I've had years to think about the nature of math, and still can't figure it out.

即使存在素数级别的桶,数字的分配是否更真实?或者这是一个老程序员的故事,每个人都接受,因为每个人都 接受它?

Is the distribution of numbers truly more even when there are a prime number of buckets? Or is this an old programmer's tale that everyone accepts because everybody else accepts it?

推荐答案

简单哈希函数通过获取输入的组成部分(在字符串的情况下是字符),并将它们乘以一些常量的幂,并以某种整数类型将它们相加在一起。所以例如一个字符串的典型(虽然不是很好)散列可能是:

Usually a simple hash function works by taking the "component parts" of the input (characters in the case of a string), and multiplying them by the powers of some constant, and adding them together in some integer type. So for example a typical (although not especially good) hash of a string might be:

(first char) + k * (second char) + k^2 * (third char) + ...

然后如果一堆所有具有相同第一个字符的字符串都被输入,则结果将全部为相同的模k,至少直到整数类型溢出为止。

Then if a bunch of strings all having the same first char are fed in, then the results will all be the same modulo k, at least until the integer type overflows.

[例如, Java的字符串hashCode与此类似 - 它的字符反向顺序,k = 31。所以你可以以相同的方式得到模数31之间的模态关系,并且在相似之外的字符串之间模拟2 ^ 32的关系。这不会严重影响哈希表行为。]

[As an example, Java's string hashCode is eerily similar to this - it does the characters reverse order, with k=31. So you get striking relationships modulo 31 between strings that end the same way, and striking relationships modulo 2^32 between strings that are the same except near the end. This doesn't seriously mess up hashtable behaviour.]

哈希表通过将哈希的模数乘以桶数来实现。

A hashtable works by taking the modulus of the hash over the number of buckets.

在哈希表中重要的是不会为可能的情况产生冲突,因为冲突降低了哈希表的效率。

It's important in a hashtable not to produce collisions for likely cases, since collisions reduce the efficiency of the hashtable.

现在,假设有人把将一大堆值放入哈希表中,这些哈希表在项目之间有一些关系,就像所有具有相同的第一个字符一样。这是一个相当可预测的使用模式,我想说,所以我们不希望它产生太多的冲突。

Now, suppose someone puts a whole bunch of values into a hashtable that have some relationship between the items, like all having the same first character. This is a fairly predictable usage pattern, I'd say, so we don't want it to produce too many collisions.

事实证明,因为自然的数学,如果散列中使用的常数和数据桶是互质的,那么在一些常见的情况下,冲突最小化。如果它们不是互质的,则在不会最小化冲突的输入之间存在一些相当简单的关系。所有的哈希值都是相等的共模因子,这意味着它们都将落入具有该模值的共同因子的第1 / n个存储桶中。你得到n倍的碰撞,其中n是常见因素。由于n至少为2,所以我认为一个相当简单的用例会产生至少两倍于正常的冲突是不可接受的。如果有些用户打算将我们的分发分解成桶,我们希望它是一个奇怪的事故,而不是一些简单的可预测的用法。

It turns out that "because of the nature of maths", if the constant used in the hash, and the number of buckets, are coprime, then collisions are minimised in some common cases. If they are not coprime, then there are some fairly simple relationships between inputs for which collisions are not minimised. All the hashes come out equal modulo the common factor, which means they'll all fall into the 1/n th of the buckets which have that value modulo the common factor. You get n times as many collisions, where n is the common factor. Since n is at least 2, I'd say it's unacceptable for a fairly simple use case to generate at least twice as many collisions as normal. If some user is going to break our distribution into buckets, we want it to be a freak accident, not some simple predictable usage.

现在,哈希表实现显然没有控制超过放入其中的物品。他们不能阻止他们相关。所以要做的就是确保常数和桶数是互补的。这样一来,您就不依赖于最后组件来确定铲斗相对于一些小的共同因素的模数。据我所知,他们不一定要做到这一点,只是互惠的。

Now, hashtable implementations obviously have no control over the items put into them. They can't prevent them being related. So the thing to do is to ensure that the constant and the bucket counts are coprime. That way you aren't relying on the "last" component alone to determine the modulus of the bucket with respect to some small common factor. As far as I know they don't have to be prime to achieve this, just coprime.

但是,如果哈希函数和哈希表是独立编写的,那么哈希表不知道哈希函数的工作原理。它可能正在使用具有小因素的常数。如果你幸运,它可能会完全不一样,非线性。如果这个哈希值足够好,那么任何一个数据桶就算好了。但是偏执的哈希表不能假定一个很好的散列函数,所以应该使用大量的数据桶。类似地,偏执狂散列函数应该使用一个大的素数常数,以减少某人使用多个存储桶的机会,这恰好与常量有一个共同的因素。

But if the hash function and the hashtable are written independently, then the hashtable doesn't know how the hash function works. It might be using a constant with small factors. If you're lucky it might work completely differently and be nonlinear. If the hash is good enough, then any bucket count is just fine. But a paranoid hashtable can't assume a good hash function, so should use a prime number of buckets. Similarly a paranoid hash function should use a largeish prime constant, to reduce the chance that someone uses a number of buckets which happens to have a common factor with the constant.

实践中,我认为使用2的幂作为桶的数量是相当正常的。这是方便的,并且不需要搜索或预先选择正确大小的素数。所以你依靠散列函数不要使用甚至乘法器,这通常是一个安全的假设。但是,您仍然可以根据上面的哈希函数偶尔发现不好的散列行为,而主要的数据可能有助于进一步。

In practice, I think it's fairly normal to use a power of 2 as the number of buckets. This is convenient and saves having to search around or pre-select a prime number of the right magnitude. So you rely on the hash function not to use even multipliers, which is generally a safe assumption. But you can still get occasional bad hashing behaviours based on hash functions like the one above, and prime bucket count could help further.

提出一切都必须就是我知道一个足够的但不是一个必要的条件,以便在哈希表上分配好。它允许每个人互操作,而不需要假设其他人遵循相同的规则。

Putting about the principle that "everything has to be prime" is as far as I know a sufficient but not a necessary condition for good distribution over hashtables. It allows everybody to interoperate without needing to assume that the others have followed the same rule.

这篇关于哈希函数为什么要使用素数模数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆