在重写hashCode()时使用较大的素数作为乘数 [英] Using a larger prime as a multiplier when overriding hashCode()

查看:162
本文介绍了在重写hashCode()时使用较大的素数作为乘数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

过去几个小时我一直在阅读有关哈希码函数的内容,并且在自定义哈希码实现中使用素数作为乘数已经积累了一些问题。如果我能对以下问题有所了解,我将不胜感激:

I have been reading about hashcode functions for the past couple of hours and have accumulated a couple of questions regarding use of prime numbers as multipliers in custom hashcode implementations. I would be grateful if I could get some insight regarding following questions:


  • 在评论 @mattb的答案,@ hstoerr主张使用更大的素数(例如524287)而不是普通的素数31.我的问题是,鉴于以下实施一对或多个元素的哈希码函数:

  • In a comment to @mattb's answer here, @hstoerr advocates for use of larger primes (such as 524287) instead of the common prime 31. My question is, given the following implementation of a hashcode functions for a pair or elements:

@Override
public int hashCode() {
    final int prime = 31;
    int hash1 = (pg1 == null) ? 0 : pg1.hashCode();
    int hash2 = (pg2 == null) ? 0 : pg2.hashCode();
    return prime * (hash1 ^ hash2);
}


不是这个如果 prime 是一个大数字,导致返回的 int 溢出?

doesn't this lead to an overflow on the returned int if prime is a large number?


  • 假设溢出不是问题(JVM进行自动转换)最好是进行比特移位而不是投?

  • Assuming that the overflow is not a problem (JVM doing an automatic cast) is it better to do a bitshift instead of a cast?

我认为哈希码函数的性能会因哈希码的复杂性而有很大差异。素数乘数的大小是否会影响性能?

I imagine the performance of the hashcode function vary significantly based on the complexity of the hashcode. Does the size of the prime multiplier not effect the performance?

在自定义哈希码函数中使用多个素数而不是单个素数是否更好/更智能/更快乘数?如果没有,还有其他一些优势吗?请参阅@ jinguy对相关问题的回答中的示例:

Is it better/smarter/faster to use multiple primes in a custom hashcode function instead of a single multiplier? If not, is there some other advantage? See the example below from @jinguy's answer to a relevant question:

public int hashCode() {
    return a * 13 + b.hashCode() * 23 + (c? 31: 7);
}


其中 a int b 字符串 c boolean


  • 如何使用 long lhash = prime *(hash1 ^ hash2); 然后使用(int) ((lhash>> 32)^ lhash)?这是我在另一个问题上看到的东西,但是并没有真正解释为什么这样做是个好主意。

  • How about something like long lhash = prime * (hash1 ^ hash2); then using (int)((lhash >> 32) ^ lhash)? That's something I saw on another question here SO, but it wasn't really explained why it was a good idea to do it like that.

推荐答案

为小说提前道歉。随意提出建议或直接编辑。 --Chet

有溢出,但也不例外。

危险没有不是因为失去准确性而是失去范围。让我们使用一个荒谬的例子,其中prime是2的大功率,而8位无符号数字是为了简洁。并假设(hash1 ^ hash2)为255:

The danger doesn't come from losing accuracy, but losing range. Let's use a ridiculous example, where "prime" is a large power of 2, and 8-bit unsigned numbers for brevity. And assume that (hash1 ^ hash2) is 255:

        "prime": 1000 0000
(hash1 ^ hash2): 1111 1111

显示截断的数字括号,我们的结果是:

Showing the truncated digits in brackets, our result is:

        product: [0111 1111] 1000 0000

但乘以128与向左移动7位相同。所以我们知道无论(hash1 ^ hash2)的价值如何,产品中最不重要的位置都会有七个零。因此,如果(hash1 ^ hash2)是奇数(最低有效位= 1),则乘以128的结果将始终为128(在截断较高位数之后)。如果(hash1 ^ hash2)是偶数(LSB为0,则产品将始终为零。

But multiplying by 128 is the same as shifting left by 7 places. So we know that whatever the value of (hash1 ^ hash2), the least-significant places of the product will have seven zeros. So if (hash1 ^ hash2) is odd (least significant bit = 1), then the result of multiplying by 128 will always be 128 (after truncating the higher digits). And if (hash1 ^ hash2) is even (LSB is 0, then the product will always be zero.

这扩展到更大的位大小。一般的观点是,如果 prime 的低位为0,则表示正在进行移位(或多次移位+求和)操作这将在低位给你零。并且乘法乘积的范围将受到影响。

This extends to larger bit sizes. The general point is that if the lower bits of "prime" are zeros, you're doing a shift (or multiple shift + sum) operation that will give you zeros in the lower bits. And the range of the product of multiplication will suffer.

但是让我们尝试制作 prime 奇数,所以最低有效位总是为1.考虑将其分解为移位/添加操作。(hash1 ^ hash2)的未移位值将永远是其中一个加数。由偶数 prime 乘数转换为保证无用的最低有效位现在将至少基于来自原始(hash1 ^ hash2)值的位。

But let's try making "prime" odd, so that the least significant bit will always be 1. Think about decomposing this into shift / add operations. The unshifted value of (hash1 ^ hash2) will always be one of the summands. The least significant bits that were shifted into guaranteed uselessness by an even "prime" multiplier will now be set based on, at minimum, the bits from the original (hash1 ^ hash2) value.

现在,让我们考虑一个<$ c $的值c> prime 这实际上是素数。如果它超过2,那么我们知道这很古怪。所以较低的位没有转变为无用。通过选择足够大的素数,您可以在输出值范围内获得比使用较小素数时更好的分布。

Now, let's consider a value of prime which is actually prime. If it's more than 2, then we know it's odd. So the lower bits haven't been shifted into uselessness. And by choosing a sufficiently large prime, you get better distribution across the range of output values than you'd get with a smaller prime.

尝试使用16位的练习使用8443( 0010 0000 1111 1011 )和59( 0000 0000 0011 1011 )进行乘法运算。它们都是素数,59的低位与65531的低位匹配。例如,如果hash1和hash2都是ASCII字符值(0 ... 255),则所有结果(hash1 ^ hash2)* 59将是< = 15045.这意味着16位数字的大约1/4的哈希值范围(0..65535)未被使用。

Try some exercises with 16-bit multiplication using 8443 (0010 0000 1111 1011) and 59 (0000 0000 0011 1011). They're both prime, and the lower bits of 59 match the lower bits of 65531. For example, if hash1 and hash2 are both ASCII character values (0 .. 255), then all of the results of (hash1 ^ hash2) * 59 will be <= 15045. This means that roughly 1/4 of the range of hash values (0..65535) for a 16-bit number go unused.

(hash1 ^ hash2)* 8443 遍布地图。如果(hash1 ^ hash2)低至8,它会溢出。即使对于非常小的输入数字,它也会使用全部16位。即使输入数字在一个相对较小的范围内,整个范围内的哈希值聚集也要少得多。

But (hash1 ^ hash2) * 8443 is all over the map. It overflows if (hash1 ^ hash2) is as low as 8. It uses all 16 bits even for very small input numbers. There's much less clustering of hash values across the overall range, even if the input numbers are in a relatively small range.


假设溢出这不是一个问题(JVM做自动演员)是不是更好的做位移而不是演员?

Assuming that the overflow is not a problem (JVM doing an automatic cast) is it better to do a bitshift instead of a cast?

很可能不是。无论如何,JVM应该转化为主机处理器上的有效实现。整数乘法应该在硬件中实现。如果没有,JVM负责将操作转换为适合CPU的操作。整数乘法的情况很可能已经高度优化。如果在给定的CPU上作为shift-and-add更快地完成整数乘法,那么JVM应该以这种方式实现它。但是编写JVM的人不太可能关注多个移位和添加操作可以组合成单个整数的情况。

Most likely not. The JVM should translate into an efficient implementation on the host processor anyway. Integer multiplication should be implemented in hardware. And if not, the JVM is responsible for translating the operation into something reasonable for the CPU. It's very likely that the case of integer multiplication is highly optimized already. If integer multiplication is done more quickly on a given CPU as shift-and-add, the JVM should implement it that way. But it's less likely that the folks writing the JVM would care to watch for cases where multiple shift-and-add operations could have been combined into a single integer multiply.


我认为哈希码函数的性能会因哈希码的复杂性而有很大差异。主乘数的大小
是否会影响性能?

I imagine the performance of the hashcode function vary significantly based on the complexity of the hashcode. Does the size of the prime multiplier not effect the performance?

否。无论大小,设置的位数等等,在硬件中完成的操作都是相同的。它可能是几个时钟周期。它会根据具体的CPU而有所不同,但无论输入值如何,都应该是恒定时间操作。

No. The operations are the same when done in hardware regardless of the size, number of bits set, etc. It's probably a couple of clock cycles. It would vary depending on the specific CPU, but should be a constant-time operation regardless of the input values.


它更好/更聪明/更快在自定义哈希码函数中使用多个素数而不是单个乘数?如果没有,是否有
其他一些优势?

Is it better/smarter/faster to use multiple primes in a custom hashcode function instead of a single multiplier? If not, is there some other advantage?

只有当它减少了碰撞的可能性时,这取决于你正在使用的数字。如果你的哈希码依赖于 A B 并且它们在相同的范围内,你可以考虑使用不同的素数或者移动一个输入值以减少比特之间的重叠。由于你依赖于它们各自的哈希码,而不是它们的直接值,所以可以合理地假设它们的哈希码提供了良好的分布等。

Only if it reduces the possibility of collisions, and this depends on the numbers you're using. If your hash code depends on A and B and they're in the same range, you might consider using different primes or shifting one of the input values to reduce overlap between the bits. Since you're depending on their individual hash codes, and not their values directly, it's reasonable to assume that their hash codes provide good distribution, etc.

一个因素来了请注意您是否希望(x,y)的哈希码与(y,x)不同。如果您的哈希函数以相同的方式处理 A B ,那么哈希(x, y)= hash(y,x)。如果这是你想要的,那么一定要使用相同的乘数。不是,使用不同的乘数是有道理的。

One factor that comes to mind whether you want the hash code for (x, y) to be different from (y, x). If your hash function treats A and B in the same way, then hash(x, y) = hash(y, x). If that's what you want, then by all means use the same multiplier. It not, using a different multiplier would make sense.


如何像 long lhash = prime *(hash1 ^ hash2); 然后使用(int)((lhash>> 32)^ lhash)?这是我在另一个问题上看到的东西,但是并没有真正解释为什么这样做是个好主意。

How about something like long lhash = prime * (hash1 ^ hash2); then using (int)((lhash >> 32) ^ lhash)? That's something I saw on another question here SO, but it wasn't really explained why it was a good idea to do it like that.



<有趣的问题。在Java中,long是64位,而int是32位。因此,这会根据需要使用两倍的位生成哈希,然后从高位和低位组合得到结果。

Interesting question. In Java, longs are 64-bit and ints are 32-bit. So this generates a hash using twice as many bits as desired, and then derives the result from the high and low bits combined.

如果乘以数字 n 由素数 p ,以及最低 k n 全部为零,然后产品的最低 k n * p 也将全部为零。这很容易看出 - 如果你成倍增加,比如说, n = 0011 0000 p = 0011 1011 ,然后产品可以表示为两个班次操作的总和。或者,

If multiplying a number n by a prime p, and the lowermost k bits of n are all zeros, then the lowermost k bits of the product n * p will also be all zeros. This is fairly easy to see -- if you're multiplying, say, n = 0011 0000 and p = 0011 1011, then the product can be expressed as the sum of two shift operations. Or,

00110000 * p = 00100000 * p + 00010000 * p
             = p << 5 + p << 4

p = 59 并使用无符号8位整数和16位长整数,这里有一些例子。

Taking p = 59 and using unsigned 8-bit ints and 16-bit longs, here are some examples.

 64: 0011 1011 * 0100 0000 = [ 0000 1110 ] 1100 0000 (192)
128: 0011 1011 * 1000 0000 = [ 0001 1101 ] 1000 0000 (128)
192: 0011 1011 * 1100 0000 = [ 0010 1100 ] 0100 0000 (64)

通过丢弃结果的高位,当低位时,生成的哈希值的范围受到限制非素数被乘数都是零。这是否是特定上下文中的问题,特定于上下文。但是对于一般的散列函数,即使输入数字中存在模式,也应避免限制输出值的范围。在安全应用程序中,避免任何可能让某人根据输出中的模式推断原始值更为重要。只取低位就会显示一些原始位的确切值。如果我们假设操作涉及将输入数与一个大素数相乘,那么我们就知道原始数字在右边有与哈希输出一样多的零(因为素数最右边的位是1)。

By just dropping the high bits of the result, the range of the resulting hash value is limited when the low bits of the non-prime multiplicand are all zeros. Whether that's an issue in a specific context is, well, context-specific. But for a general hash function it's a good idea to avoid limiting the range of output values even when there are patterns in the input numbers. And in security applications, it's even more critical to avoid anything that would let someone make inferences about the original value based on patterns in the output. Just taking the low bits reveals the exact values of some of the original bits. If we make the assumption that the operation involved multiplying an input number with a large prime, then we know that the original number had as many zeros at the right as the hash output (because the prime's rightmost bit was 1).

通过使用低位对高位进行异或,输出的一致性较低。更重要的是,根据这些信息对输入值进行猜测要困难得多。根据XOR的工作原理,它可能意味着原始低位为0且高位为1,或者原始低位为1且高位为0.

By XORing the high bits with the low bits, there's less consistency in the output. And more importantly, it's much harder to make guesses about the input values based on this information. Based on how XOR works, it could mean the original low bit was 0 and the high bit was 1, or the original low bit was 1 and the high bit was 0.

 64: 0011 1011 * 0100 0000 = 0000 1110 1100 0000 => 1100 1110 (206)
128: 0011 1011 * 1000 0000 = 0001 1101 1000 0000 => 1001 1101 (157)
192: 0011 1011 * 1100 0000 = 0010 1100 0100 0000 => 0110 1100 (204)

这篇关于在重写hashCode()时使用较大的素数作为乘数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆