如何选择素数来计算哈希码? [英] How to pick prime numbers to calculate the hash code?

查看:137
本文介绍了如何选择素数来计算哈希码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个问题跟随Jon Skeet给出的关于这个问题的答案:什么是重写的System.Object的最佳算法。 GetHashCode的?
为了计算哈希码,使用下面的算法:

  public override int GetHashCode()
{
未经检查//溢出很好,只需包装
{
int hash = 17;
//合适的无效性检查等,当然:)
hash = hash * 23 + field1.GetHashCode();
hash = hash * 23 + field2.GetHashCode();
hash = hash * 23 + field3.GetHashCode();
返回散列;
}
}

我不明白为什么数字17和23被选中。我们为什么不选择3和5?这也是素数。
有人可以解释一下选择的最佳质数是什么,为什么?解析方案

已经简要地尝试解释为什么 17 23 在这里不是好的素数。



许多使用散列码的.NET类将元素存储在存储桶中。假设有三个桶。然后,所有带有散列码0,3,6,9,...的对象都被存储在存储区0中。散列码为1,4,7,10,...的所有对象都存储在存储区1中。所有具有存储区2的对象,5,8,11,...被存储在存储桶2中。

现在假设你的 GetHashCode()使用 hash = hash * 3 + field3.GetHashCode(); 。这意味着,除非 hash 足够大,以便乘法环绕,但是在具有三个桶的散列集中,对象最终只能在 field3



由于桶之间的对象分布不均衡, HashSet< T> 无法提供良好的性能。



您需要一个与所有可能桶数共素的因子。由于相同的原因,桶的数量本身就是主要的,因此如果你的因素是主要的,唯一的风险是它与桶的数量相等。



.NET使用允许的数字的固定列表存储桶
$ b


  public static readonly int [] primes = {
3,7,11,17,23,29,37,47,59,71,89,107,131,163,197,239,293,353,431,521,631,761,919,
1103,1327,1597,1931,2333,2801,3371,4049,4861,5839,7013,8419,10103,12143,14591,
17519,21023,25229,30293,36353,4367,52361,62851, 75431,90523,108631,130363,156437,
187751,225307,270371,3244449,389357,467237,560689,672827,807403,968897,1162687,1395263,
1674319,2009191,2411033,2893249, 3471899,4166287,4999559,5999471,7199369};


您的因素应该是.NET不使用的因素,而其他自定义实现同样不太可能使用。这意味着 23 是一个不好的因素。 31 对于.NET自己的容器可能没有问题,但对于自定义实现可能同样糟糕。



时间,它不应该如此低,以至于会导致大量常见用途的碰撞。这是 3 5 的风险:假设您有一个自定义元组< int> 具有大量小整数的实现。请记住, int.GetHashCode()只是返回 int 本身。假设你的乘法因子是 3 。这意味着(0,9)(1,6)(2, 3)(3,0)全部给出相同的哈希码



正如Jon Skeet在他的回答中加入的评论中指出的那样,这两个问题都可以通过使用足够大的素数来避免:


编辑:正如评论中指出的那样,你可能会发现最好选择一个大的素数来代替。显然486187739是好的......


曾几何时,乘法的大质数可能是不好的,因为乘以大整数足够大性能差异很明显。在这种情况下乘以 31 会很好,因为它可以实现为 x * 31 => x * 32 -x => (x <5)-x 。但是,现在,乘法运算几乎不会导致任何性能问题,然后,一般来说,越大越好。


This question follows on the answer given by Jon Skeet on the question: "What is the best algorithm for an overridden System.Object.GetHashCode?". To calculate the hash code the following algorithm is used:

public override int GetHashCode()
{
    unchecked // Overflow is fine, just wrap
    {
        int hash = 17;
        // Suitable nullity checks etc, of course :)
        hash = hash * 23 + field1.GetHashCode();
        hash = hash * 23 + field2.GetHashCode();
        hash = hash * 23 + field3.GetHashCode();
        return hash;
    }
}

I don't understand why the numbers 17 and 23 are chosen. Why don't we pick 3 and 5? That are prime numbers as well. Can somebody explain what the best prime numbers to pick are and why?

解决方案

The comments on the answer you link to already briefly try to explain why 17 and 23 are not good primes to use here.

A lot of .NET classes that make use of hash codes store elements in buckets. Suppose there are three buckets. Then all objects with hash code 0, 3, 6, 9, ... get stored in bucket 0. All objects with hash code 1, 4, 7, 10, ... get stored in bucket 1. All objects with bucket 2, 5, 8, 11, ... get stored in bucket 2.

Now suppose that your GetHashCode() uses hash = hash * 3 + field3.GetHashCode();. This would mean that unless hash is large enough for the multiplication to wrap around, in a hash set with three buckets, which bucket an object would end up in depends only on field3.

With an uneven distribution of objects across buckets, HashSet<T> cannot give good performance.

You want a factor that is co-prime to all possible number of buckets. The number of buckets itself will be prime, for the same reasons, therefore if your factor is prime, the only risk is that it's equal to the number of buckets.

.NET uses a fixed list of allowed numbers of buckets:

public static readonly int[] primes = {
    3, 7, 11, 17, 23, 29, 37, 47, 59, 71, 89, 107, 131, 163, 197, 239, 293, 353, 431, 521, 631, 761, 919,
    1103, 1327, 1597, 1931, 2333, 2801, 3371, 4049, 4861, 5839, 7013, 8419, 10103, 12143, 14591,
    17519, 21023, 25229, 30293, 36353, 43627, 52361, 62851, 75431, 90523, 108631, 130363, 156437,
    187751, 225307, 270371, 324449, 389357, 467237, 560689, 672827, 807403, 968897, 1162687, 1395263,
    1674319, 2009191, 2411033, 2893249, 3471899, 4166287, 4999559, 5999471, 7199369};

Your factor should be one that .NET doesn't use, and that other custom implementations are equally unlikely to use. This means 23 is a bad factor. 31 could be okay with .NET's own containers, but could be equally bad with custom implementations.

At the same time, it should not be so low that it gives lots of collisions for common uses. This is a risk with 3 and 5: suppose you have a custom Tuple<int, int> implementation with lots of small integers. Keep in mind that int.GetHashCode() just returns that int itself. Suppose your multiplication factor is 3. That means that (0, 9), (1, 6), (2, 3) and (3, 0) all give the same hash codes.

Both of the problems can be avoided by using sufficiently large primes, as pointed out in a comment that Jon Skeet had incorporated into his answer:

EDIT: As noted in comments, you may find it's better to pick a large prime to multiply by instead. Apparently 486187739 is good...

Once upon a time, large primes for multiplication may have been bad because multiplication by large integers was sufficiently slow that the performance difference was noticeable. Multiplication by 31 would be good in that case because it can be implemented as x * 31 => x * 32 - x => (x << 5) - x. Nowadays, though, the multiplication is far less likely to cause any performance problems, and then, generally speaking, the bigger the better.

这篇关于如何选择素数来计算哈希码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆