Python frozenset哈希算法/实现 [英] Python frozenset hashing algorithm / implementation

查看:132
本文介绍了Python frozenset哈希算法/实现的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正试图理解为Python的内置 frozenset 数据类型定义的哈希函数背后的机制。实现显示在底部以供参考。我特别感兴趣的是选择散射操作的基本原理:

lambda h:(h ^(h <16)^ 89869747)* 3644798167

其中 h 是每个元素的散列。有谁知道这些来自哪里? (也就是说,有没有特别的理由去挑选这些数字?)或者他们是简单地任意选择的?




这里是来自官方CPython实现的代码片段,

  static Py_hash_t 
frozenset_hash(PyObject * self )
{
PySetObject * so =(PySetObject *)self;
Py_uhash_t h,hash = 1927868237UL;
setentry *条目;
Py_ssize_t pos = 0;

if(so-> hash!= -1)
return so-> hash;

hash * =(Py_uhash_t)PySet_GET_SIZE(self)+ 1;
while(set_next(so,& pos,& entry)){
/ *努力增加密集散列
值的位散。这一点很重要,因为一些用例有许多
组合,其中有少量元素与附近的
散列值,因此许多不同的组合会折叠到仅仅
一些不同的散列值。 * /
h =入口 - >散列;
hash ^ =(h ^(h <16)^ 89869747UL)* 3644798167UL;
}
hash = hash * 69069U + 907133923UL;
if(hash == -1)
hash = 590923713UL;
so-> hash = hash;
返回散列;
}

Python中的等效实现

  def _hash(self):
MAX = sys.maxint
MASK = 2 * MAX + 1
n = len(self)
h = 1927868237 *(n + 1)
h& = MASK
for x in self:
hx = hash(x)
h ^ =(hx ^(hx< ;< 16)^ 89869747)* 3644798167
h& = MASK
h = h * 69069 + 907133923
h& = MASK
if h> MAX:
h - = MASK + 1
if h == -1:
h = 590923713
return h


解决方案

除非Raymond Hettinger(代码的作者)加入,否则我们永远不会知道;-)但通常很少有科学在这些事情上超出了你的期望:你采取一些一般原则和测试套件,几乎任意地摆弄常量,直到结果看起来足够好。

在这里工作的一些明显的一般原则:


  1. 为了获得所需的快速位差,您希望乘以一个大整数。由于CPython的散列结果必须适用于许多平台的32位,所以需要32位的整数才是最好的选择。确实,(3644798167).bit_length()== 32

  2. 为避免系统性地丢失低位(s),你想要乘以一个奇数。 3644798167很奇怪。 更通常地,为避免输入哈希中的复合模式,您希望乘以一个素数。而且你还需要一个乘数,它的二进制表示没有明显的重复模式。 bin(3644798167)=='0b11011001001111110011010011010111'。这是一件好事; - )

其他常数对我来说完全是任意的。

  if h == -1:
h = 590923713

部分是由于另一个原因需要的:在内部,CPython从整型数组中返回 -1 有价值的C函数表示需要提出异常;即,这是一个错误的回报。因此,对于CPython中的任何对象,您永远不会看到 -1 的哈希码。返回的值而不是 -1 完全是任意的 - 每次只需要相同的值(而不是-1)。 p>

编辑:玩弄



我不知道雷蒙德用来测试这个。以下是我会用到的:查看一组连续整数的所有子集的哈希统计量。这些是有问题的,因为对于许多整数来说, hash(i)== i i

 >>>所有(哈希(我)==我为我在范围内(1000000))

只需将xor'ing哈希合并在一起就可以产生大量的输入消息。



所以这里有一个小函数来生成所有子集,另一个函数用来做一个简单的xor在所有哈希代码中:

  def hashxor(xs):
h = 0
for x in xs:
h ^ = hash(x)
return h

def genpowerset(xs):itertools导入组合
(len(xs) )+ 1):
for t in combination(xs,length):
yield t


$ b $然后一个驱动程序和一个小函数来显示碰撞统计数据:

$ p $ def show_stats(d):
total = sum(d.values())
printtotal,total,unique hashes,len(d),\
collisions,total - len(d)

def drive(n,hasher = hashxor):
from collections import defaultdict
d = defaultdict(int)

for genpowerset(range(n)):
d [hasher(t)] + = 1
show_stats(d)

使用垃圾简单的散列器是灾难性的:

 >>驱动器(20)
总计1048576独特散列32个冲突1048544

OTOH使用专为frozensets设计的 _hash()在这种情况下完美工作:

 >>>驱动器(20,_hash)
总计1048576独特哈希1048576冲突0

然后你可以玩用它来看看什么 - 而不是 - 在 _hash()中产生真正的区别。例如,如果

  h = h * 69069 + 907133923 
code>

被删除。我不知道为什么那条线在那里。同样,如果内部循环中的 ^ 89869747 被删除,它会继续在这些输入上做一个完美的工作 - 不知道为什么那里会有。并且初始化可以从以下内容改变:

  h = 1927868237 *(n + 1)

到:

  h = n 

这一切都与我所期望的一致:它是内部循环中的乘法常数,至关重要,原因已经解释过。例如,为其添加1(使用3644798168),然后不再是素数或奇数,并且统计数据降级为:

 总计1048576个独立哈希851968个冲突196608 

仍然非常有用,但绝对更糟。把它改成一个小的素数,比如13,而且更糟糕:

pre code总计1048576唯一散列483968冲突564608

使用具有明显二元模式的乘数,例如 0b01010101010101010101010101010101 ,以及更糟糕的是:

 总计1048576独特哈希值163104冲突885472 

玩耍!这些东西很有趣: - )

I'm currently trying to understand the mechanism behind the hash function defined for Python's built-in frozenset data type. The implementation is shown at the bottom for reference. What I'm interested in particular is the rationale for the choice of this scattering operation:

lambda h: (h ^ (h << 16) ^ 89869747) * 3644798167

where h is the hash of each element. Does anyone know where these came from? (That is, was there any particular reason to pick these numbers?) Or were they simply chosen arbitrarily?


Here is the snippet from the official CPython implementation,

static Py_hash_t
frozenset_hash(PyObject *self)
{
    PySetObject *so = (PySetObject *)self;
    Py_uhash_t h, hash = 1927868237UL;
    setentry *entry;
    Py_ssize_t pos = 0;

    if (so->hash != -1)
        return so->hash;

    hash *= (Py_uhash_t)PySet_GET_SIZE(self) + 1;
    while (set_next(so, &pos, &entry)) {
        /* Work to increase the bit dispersion for closely spaced hash
           values.  The is important because some use cases have many
           combinations of a small number of elements with nearby
           hashes so that many distinct combinations collapse to only
           a handful of distinct hash values. */
        h = entry->hash;
        hash ^= (h ^ (h << 16) ^ 89869747UL)  * 3644798167UL;
    }
    hash = hash * 69069U + 907133923UL;
    if (hash == -1)
        hash = 590923713UL;
    so->hash = hash;
    return hash;
}

and an equivalent implementation in Python:

def _hash(self):
    MAX = sys.maxint
    MASK = 2 * MAX + 1
    n = len(self)
    h = 1927868237 * (n + 1)
    h &= MASK
    for x in self:
        hx = hash(x)
        h ^= (hx ^ (hx << 16) ^ 89869747)  * 3644798167
        h &= MASK
    h = h * 69069 + 907133923
    h &= MASK
    if h > MAX:
        h -= MASK + 1
    if h == -1:
        h = 590923713
    return h

解决方案

Unless Raymond Hettinger (the code's author) chimes in, we'll never know for sure ;-) But there's usually less "science" in these things than you might expect: you take some general principles, and a test suite, and fiddle the constants almost arbitrarily until the results look "good enough".

Some general principles "obviously" at work here:

  1. To get the desired quick "bit dispersion", you want to multiply by a large integer. Since CPython's hash result has to fit in 32 bits on many platforms, an integer that requires 32 bits is best for this. And, indeed, (3644798167).bit_length() == 32.

  2. To avoid systematically losing the low-order bit(s), you want want to multiply by an odd integer. 3644798167 is odd.

  3. More generally, to avoid compounding patterns in the input hashes, you want to multiply by a prime. And 3644798167 is prime.

  4. And you also want a multiplier whose binary representation doesn't have obvious repeating patterns. bin(3644798167) == '0b11011001001111110011010011010111'. That's pretty messed up, which is a good thing ;-)

The other constants look utterly arbitrary to me. The

if h == -1:
    h = 590923713

part is needed for another reason: internally, CPython takes a -1 return value from an integer-valued C function as meaning "an exception needs to be raised"; i.e., it's an error return. So you'll never see a hash code of -1 for any object in CPython. The value returned instead of -1 is wholly arbitrary - it just needs to be the same value (instead of -1) each time.

EDIT: playing around

I don't know what Raymond used to test this. Here's what I would have used: look at hash statistics for all subsets of a set of consecutive integers. Those are problematic because hash(i) == i for a great many integers i.

>>> all(hash(i) == i for i in range(1000000))
True

Simply xor'ing hashes together will yield massive cancellation on inputs like that.

So here's a little function to generate all subsets, and another to do a dirt-simple xor across all hash codes:

def hashxor(xs):
    h = 0
    for x in xs:
        h ^= hash(x)
    return h

def genpowerset(xs):
    from itertools import combinations
    for length in range(len(xs) + 1):
        for t in combinations(xs, length):
            yield t

Then a driver, and a little function to display collision statistics:

def show_stats(d):
    total = sum(d.values())
    print "total", total, "unique hashes", len(d), \
          "collisions", total - len(d)

def drive(n, hasher=hashxor):
    from collections import defaultdict
    d = defaultdict(int)

    for t in genpowerset(range(n)):
        d[hasher(t)] += 1
    show_stats(d)

Using the dirt-simple hasher is disastrous:

>> drive(20)
total 1048576 unique hashes 32 collisions 1048544

Yikes! OTOH, using the _hash() designed for frozensets does a perfect job in this case:

>>> drive(20, _hash)
total 1048576 unique hashes 1048576 collisions 0

Then you can play with that to see what does - and doesn't - make a real difference in _hash(). For example, it still does a perfect job on these inputs if

    h = h * 69069 + 907133923

is removed. And I have no idea why that line is there. Similarly, it continues to do a perfect job on these inputs if the ^ 89869747 in the inner loop is removed - don't know why that's there either. And initialization can be changed from:

    h = 1927868237 * (n + 1)

to:

    h = n

without harm here too. That all jibes with what I expected: it's the multiplicative constant in the inner loop that's crucial, for reasons already explained. For example, add 1 to it (use 3644798168) and then it's no longer prime or odd, and the stats degrade to:

total 1048576 unique hashes 851968 collisions 196608

Still quite usable, but definitely worse. Change it to a small prime, like 13, and it's worse:

total 1048576 unique hashes 483968 collisions 564608

Use a multiplier with an obvious binary pattern, like 0b01010101010101010101010101010101, and worse again:

total 1048576 unique hashes 163104 collisions 885472

Play around! These things are fun :-)

这篇关于Python frozenset哈希算法/实现的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆