在字典中覆盖Python的散列函数 [英] Overriding Python's Hashing Function in Dictionary

查看:112
本文介绍了在字典中覆盖Python的散列函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试为某些对象创建一个自定义哈希函数,这个对象将被放入字典中。散列函数是唯一的(不是Python的标准)。这对我来说非常重要:使用独特的功能。每个键的值是一个列表。



假设我覆盖 __ hash __ ,最后提出正确的哈希值一个东西。会:

  dict = {} 
dict [number_here] = value

将值哈希到位置号码 number_here ,或者仍然是在Python的哈希表会计算出这个数字?



打印 dict 只显示项目,而不是他们的位置那。但是,当我做 hash(4)时,结果是4.所以我假设这个意思是整数被哈希到他们各自的位置?



有人可以验证我的调查结果,还是向我解释,如果我错了?

解决方案

python dict 实现使用哈希值来基于密钥稀疏地存储值,并避免存储中的冲突。它使用 hash()作为起始点的结果,它不是确定的位置。



因此,虽然 hash(4)返回 4 ,但底层C中的确切位置结构也基于什么其他键已经存在,以及当前表的大小。根据需要调整python哈希表的大小(例如添加项目)。



由于dict没有订购,这不是您需要担心的事情,希望影响。如果您需要在dict中订购,请使用 collections.OrderedDict()实现,这将分别跟踪订单。



python哈希表实现的细节



您可能想要了解哈希表如何在维基百科; Python使用开放寻址来实现它。



在表中选择一个插槽时,将采用散列值(整数)的模数和当前的表大小,因此在大小为32的表上,所以密钥 45 ,哈希值 45 最初将存储在插槽14中。 / p>

如果有冲突(已经有其他存储在插槽14中的东西,它不是整数 45 ),那么插槽值被扰动直到找到一个空插槽,或者找到相同的键。扰动是用公式完成的:

  perturb = slot = hash 
while slot_is_full和item_in_slot_is_not_equal_to_key:
slot =(5 * slot)+ 1 + perturb
perturb>> = 5

所以,当有碰撞时,另外一个插槽是逐渐采取较小的步骤,直到扫描整个表格。请注意,如果需要,表格已经被调整大小以腾出空间。



为了使其正常工作,自定义类型需要一个 __ hash __() 方法需要实施 __ eq __() 以确定两个实例是否表示相同的键。匹配哈希值是不足够。对于 dict 实现,要考虑两个实例来表示完全相同的键,它们的哈希值都必须匹配,并且它们必须为 ==返回True 等号运算符。这样的对象被认为是可哈希的



(对于Python 2.x,实施 __ cmp __() hook 将代替实现 __ eq __();在Python 3中已经删除了对此的支持。


I am trying to create a custom hash function for some object that I'll be hashing into a dictionary. The hashing function is unique (not the standard Python one). This is very important to me: to use the unique function. Each key's value is a list.

Assuming I override __hash__ and end up coming up with the right hash number for an object. Would:

dict = {}
dict[number_here] = value

hash the value into the position number number_here, or would it still be at the position that Python's hash table would compute for that number?

Printing dict only shows the items and not which position they're that. However, when I do hash(4), the result is 4. So I'm assuming this means integers are hashed to their respective locations?

Could someone please verify my findings or explain to me if I'm wrong?

解决方案

The python dict implementation uses the hash value to both sparsely store values based on the key and to avoid collisions in that storage. It uses the result of hash() as a starting point, it is not the definitive position.

Thus, although hash(4) returns 4, the exact 'position' in the underlying C structure is also based on what other keys are already there, and how large the current table is. The python hash table is resized as needed (items added), for example.

Since a dict has no ordering, this is not something you need to worry about, or can hope to influence. If you need ordering in a dict, use the collections.OrderedDict() implementation instead, which keeps track of ordering separately.

The details of the python hash table implementation

You may want to read up on how hash tables work on Wikipedia; Python uses open addressing for it's implementation.

When selecting a slot in the table, the modulus of the hash value (an integer) and the current table size is taken, thus on a table of size 32, so the key 45, hash value 45 would initially be stored in slot 14.

If there is a collision (there already is something else stored in slot 14 and it's not the integer 45), then the slot value is perturbed until an empty slot is found or the same key is found instead. Perturbing is done with the formula:

perturb = slot = hash
while slot_is_full and item_in_slot_is_not_equal_to_key:
    slot = (5*slot) + 1 + perturb
    perturb >>= 5

So, when there is a collision, another slot is picked at progressively smaller steps until it scans the whole table. Note that the table will already have been resized to make space if necessary.

In order for this to work correctly, custom types need both a __hash__() method and need to implement __eq__() to determine if two instances represent the same key. Matching hash values is not enough. For the dict implementation to consider two instances to represent the exact same key, both their hash values must match, and they must return True for the == equality operator. Such objects are considered hashable.

(For Python 2.x, implementing the __cmp__() hook would do instead of implementing __eq__(); support for this has been removed in Python 3).

这篇关于在字典中覆盖Python的散列函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆