任意对象的python哈希函数的替代方案 [英] Alternative to python hash function for arbitrary objects
问题描述
在 python2.7 中,我成功地使用 hash()
将对象放入永久存储在磁盘上的存储桶中.样机代码如下所示:
In python2.7, I'm successfully using hash()
to place objects into buckets stored persistently on disk. A mockup code looks like this:
class PersistentDict(object):
def __setitem__(self, key, value):
bucket_index = (hash(key)&0xffffffff) % self.bucket_count
self._store_to_bucket(bucket_index, key, value)
def __getitem__(self, key):
bucket_index = (hash(key)&0xffffffff) % self.bucket_count
return self._fetch_from_bucket(bucket_index)[key]
在 python3 中,hash()
使用随机或固定盐,这使得它不可用/不适合这个 [1].显然,不可能对特定调用使用固定盐.所以,我需要一个替代方案:
In python3, hash()
uses a random or fixed salt, which makes it unusable/suboptimal for this [1]. Apparently, it's not possible to use a fixed salt for specific invocations. So, I need an alternative:
- 必须在解释器调用中保持稳定
- 可能需要在执行时提供参数,例如在通话中设置盐
- 必须支持任意对象(
dict
/set
支持的任何对象)
- Must be stable across interpreter invocations
- May require parameters supplied at execution time, e.g. setting a salt in the call
- Must support arbitrary objects (anything supported by
dict
/set
)
我已经尝试使用 hashlib 中的哈希函数(慢!) 和来自 zlib 的校验和(显然不适合散列,但是meh)可以很好地处理字符串/字节.然而,它们仅适用于类似字节的对象,而 hash()
几乎适用于所有对象.
I've already tried using hash functions from hashlib (slow!) and checksums from zlib (apparently not ideal for hashing, but meh) which work fine with strings/bytes. However, they work only on bytes-like objects, whereas hash()
works with almost everything.
[1] 使用 hash()
来识别桶是:
[1] Using hash()
to identify buckets is either:
- 跨解释器调用不可靠,如果盐是随机的
- 防止应用程序使用随机加盐功能(如果盐是固定的)
- 如果使用不同的盐创建两个
PersistentDict
则不可用
推荐答案
我已经成功地使用了 hash
和 zlib.adler32
的组合.最直接的实现是这样的:
I've had success using a combination of hash
and zlib.adler32
. The most straightforward implementation is this:
def hashkey(obj, salt=0):
"""
Create a key suitable for use in hashmaps
:param obj: object for which to create a key
:type: str, bytes, :py:class:`datetime.datetime`, object
:param salt: an optional salt to add to the key value
:type salt: int
:return: numeric key to `obj`
:rtype: int
"""
if obj is None:
return 0
if isinstance(obj, str):
return zlib.adler32(obj.encode(), salt) & 0xffffffff
elif isinstance(obj, bytes):
return zlib.adler32(obj, salt) & 0xffffffff
elif isinstance(obj, datetime_type):
return zlib.adler32(str(obj).encode(), salt) & 0xffffffff
return hash(obj) & 0xffffffff
在 Python 3.4.3 中,这比调用普通的 hash
慢得多,后者大约需要 0.07 微秒.对于常规对象,hashkey
使用 ~1.0 usec 代替.bytes
为 0.8 usec,str
为 0.7.
With Python 3.4.3, this is a lot slower than calling plain hash
, which takes roughly 0.07 usec. For a regular object, hashkey
takes ~1.0 usec instead. 0.8 usec for bytes
and 0.7 for str
.
开销大致如下:
- 0.1 usec 用于函数调用 (
hash(obj)
vsdef pyhash(obj): return hash(obj)
) - 0.2 usec 到 0.5 usec 用于通过
isinstance
选择哈希函数 - 0.75 usec for
zlib.adler32
或zlib.crc32
vshash
:~0.160 usec vs ~ 0.75 usec(adler 和 crc 是 +/- 4 使用 c) - 0.15 usec 用于
str
对象的obj.encode()
("foobar"
) - 1.5 usec for
str(obj).encode()
ofdatetime.datetime
对象
- 0.1 usec for the function call (
hash(obj)
vsdef pyhash(obj): return hash(obj)
) - 0.2 usec to 0.5 usec for selecting the hash function via
isinstance
- 0.75 usec for
zlib.adler32
orzlib.crc32
vshash
: ~0.160 usec vs ~ 0.75 usec (adler and crc are +/- 4 usec) - 0.15 usec for
obj.encode()
ofstr
objects ("foobar"
) - 1.5 usec for
str(obj).encode()
ofdatetime.datetime
objects
最优化来自 if
语句的排序.如果大多数人期望使用普通对象,那么以下是我能想到的最快的方法:
The most optimization comes from ordering of the if
statements. If one mostly expects plain objects, the following is the fastest I could come up with:
def hashkey_c(obj, salt=0):
if obj.__class__ in hashkey_c.types:
if obj is None:
return 0
if obj.__class__ is str:
return zlib.adler32(obj.encode(), salt) & 0xffffffff
elif obj.__class__ is bytes:
return zlib.adler32(obj, salt) & 0xffffffff
elif obj.__class__ is datetime_type:
return zlib.adler32(str(obj).encode(), salt) & 0xffffffff
return hash(obj) & 0xffffffff
hashkey_c.types = {str, bytes, datetime_type, type(None)}
总时间:str
和 bytes
约为 0.7 usec,datetime
非常糟糕,对象、int 等约为 0.35 usec.使用 adict
将类型映射到可比较的哈希,如果单独使用对 dict
键(又名类型)的显式检查(即不是 hashkey.dict_types 中的 obj.__class__
但 obj.__class__ in hashkey.explicit_dict_types
).
Total time: ~0.7 usec for str
and bytes
, abysmal for datetime
, 0.35 usec for objects, ints, etc. Using a dict
to map type to hash comparable, if one uses an explicit check on the dict
keys (aka types) separately (i.e. not obj.__class__ in hashkey.dict_types
but obj.__class__ in hashkey.explicit_dict_types
).
一些附加说明:
- 对于使用默认
hash
在解释器启动时都不稳定- 对于包含加盐类型的不可变容器(定义
__hash__
),它不能正常工作,例如(1, 2, '三')
__hash__
实现的任何对象,包括 None
,hash
is not stable across interpreter starts for any object using the default__hash__
implementation, includingNone
- It does not work properly for immutable containers (which define
__hash__
) containing a salted type, e.g.(1, 2, 'three')
这篇关于任意对象的python哈希函数的替代方案的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!