在python中哈希不同的元组给出相同的结果 [英] hashing different tuples in python give identical result

查看:326
本文介绍了在python中哈希不同的元组给出相同的结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理整数矩阵集,我认为将它们表示为元组是有意义的,因为它们是可哈希的.但是hash()函数给我元组带来了奇怪的结果:

I'm working with sets of integer matrices, and I thought representing them as tuples made sense, as they are hashable. However the hash() function gave me strange results for tuples:

hash(((1, -1, 0), (1, 0, 0), (1, 0, -1)))

Out[147]: -697649482279922733

hash(((1, 0, -1), (1, 0, 0), (1, -1, 0)))

Out[148]: -697649482279922733

如您所见,这两个不同的元组具有相同的哈希值.请注意,它们实际上非常相似(第一个和最后一个子元组的交换),但是我找不到更小的示例:例如((0,1),(0,0))((0,0),(0,1))具有不同的哈希值.

As you can see, these two different tuples have the same hash value. Note that they are actually pretty similar (exchange of the first and last subtuples), however I couldn't find a more minimal example: ((0,1),(0,0)) and ((0,0),(0,1)) have different hash values for example.

有什么线索吗?我简直不敢相信这真是太不可思议了……现在,我已经找到了问题所在,因此可以轻松地绕开它,但是我仍然认为在这里值得一提.

Any clue of what's going on? I can't believe that it's just incredibly bad luck... Now that I have found where the problem is I could bypass it easily, but I thought it was worth mentioning here anyway.

推荐答案

直到Python 3.8,元组的哈希都是基于内容的哈希,并使用以下公式(来自

Until Python 3.8, the hash of a tuple is based on the hashes of the content using the following formula (from the tuplehash() function):

Py_uhash_t mult = _PyHASH_MULTIPLIER; /* defined as 1000003UL == 0xf4243 */
x = 0x345678UL;
p = v->ob_item;
while (--len >= 0) {
    y = PyObject_Hash(*p++);
    if (y == -1)
        return -1;
    x = (x ^ y) * mult;
    /* the cast might truncate len; that doesn't change hash stability */
    mult += (Py_hash_t)(82520UL + len + len);
}
x += 97531UL;
if (x == (Py_uhash_t)-1)
    x = -2;
return x;

这是一种称为 FNV- 1(Fowler/Noll/Vo)哈希方法.

碰巧的是,该公式为(1, 0, -1)(1, -1, 0)产生了完全相同的输出:

As it happens, that formula produces the exact same output for (1, 0, -1) and (1, -1, 0):

>>> hash((1, -1, 0))
-2528505496374624146
>>> hash((1, 0, -1))
-2528505496374624146

因为包含的3个整数的哈希分别为10-2:

because the hashes for the 3 contained integers are 1, 0 and -2:

>>> hash(1)
1
>>> hash(0)
0
>>> hash(-1)
-2

并交换0-2对结果没有实际影响.

and swapping the 0 and the -2 has no actual influence on the outcome.

因此,两个示例中包含的3个元组的哈希值不会发生变化,因此最终的哈希值也不会发生变化.

So the hashes for the 3 contained tuples don't change between the two examples, so the final hash doesn't change either.

这只是一个巧合,我希望在实践中不会经常发生所有 ,并且字典和集合已经可以很好地处理冲突.

This is just a coincidence, and I'd expect that in practice this doesn't happen all that often and dictionaries and sets already can handle collisions just fine.

但是,在写完我的原始答案几年后,事实证明我的期望被放错了!上面的tuplehash()实现已经实施了14年,直到有人坚持认为存在 该计划有问题.事实证明,某些值 combinations (例如4-40.250.5)大大降低了该方法可能输出的可能的哈希值:

However, a few years after writing my original answer, it turns out that my expectation was misplaced! The above tuplehash() implementation was in place for 14 years, until someone insisted that there was a problem with the scheme. It turns out that certain value combinations (such as 4 and -4, or 0.25 and 0.5) drastically reduced the possible hash values the method could output:

>>> import sys; from itertools import product
>>> sys.version_info
sys.version_info(major=3, minor=7, micro=7, releaselevel='final', serial=0)
>>> values = (0.25, 0.5)
>>> sum(1 for _ in product(values, repeat=20))  # 20 elements in each tuple
1048576
>>> len(set(map(hash, product(values, repeat=20))))
32

上面创建了所有1048576(2 ** 20 == 1024 ** 2)可能的20值元组,它们组合了0.250.5.理想情况下,它们都应具有不同的哈希值,或者至少具有大量不同的哈希值.但是上面的tuplehash()函数仅产生32个唯一值.这32个唯一的散列中的每个散列都适用于32768(2 ** 15)个这样的组合:

The above creates all 1048576 (2 ** 20 == 1024 ** 2) possible 20-value tuples that combine 0.25 and 0.5. Ideally, they all should have a different hash value, or at least have a very large number of different hash values. But the above tuplehash() function only produced 32 unique values. Each of those 32 unique hashes applies to 32768 (2 ** 15) such combinations:

>>> from collections import Counter
>>> Counter(Counter(map(hash, product(values, repeat=20))).values())
Counter({32768: 32})

这实际上是一个问题!上面的问题对于1, -1, 0也起作用,只是它没有那么明显.在这里使用3 ** 12 == 531441组合进行测试:

This is actually quite a big problem! The above issue also comes into play for 1, -1, 0, it just isn't as pronounced; testing here with 3 ** 12 == 531441 combinations:

>>> values = (1, -1, 0)
>>> sum(1 for _ in product(values, repeat=12))
531441
>>> len(set(map(hash, product(values, repeat=12))))
238605
>>> Counter(Counter(map(hash, product(values, repeat=12))).values())
Counter({1: 153005, 2: 51006, 4: 21730, 8: 8424, 16: 3012, 32: 994, 64: 314, 128: 92, 256: 20, 512: 6, 1024: 2})

因此为这12个元素的元组生成的哈希中的153005全部使用单个哈希.

so 153005 of the hashes produced for those 12-element tuples are all using a single hash.

因此,在Python 3.8中,实现已从FNV-1a切换到了改编 xxHash快速摘要方案.参见新的tuplehash()函数实现有关详细信息.

So in Python 3.8, the implementation was switched from FNV-1a to an adaptation of the xxHash fast digest scheme. See the new tuplehash() function implementation for details.

此新方法在您所提出问题的示例中表现出色:

This new method performs great on the examples from your question:

>>> sys.version_info
sys.version_info(major=3, minor=8, micro=1, releaselevel='final', serial=0)
>>> hash((1, -1, 0))
426056430309831993
>>> hash((1, 0, -1))
-7823806182320511195
>>> hash(((1, -1, 0), (1, 0, 0), (1, 0, -1)))
-6252168277346219339
>>> hash(((1, 0, -1), (1, 0, 0), (1, -1, 0)))
-5221381175350594014

以及我上面讨论的病理病例:

as well as the pathological cases I discussed above:

>>> values = (0.25, 0.5)
>>> len(set(map(hash, product(values, repeat=20))))
1048576
>>> values = (1, -1, 0)
>>> len(set(map(hash, product(values, repeat=12))))
531441

这篇关于在python中哈希不同的元组给出相同的结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆