任意对象的python哈希函数的替代方案 [英] Alternative to python hash function for arbitrary objects

查看:79
本文介绍了任意对象的python哈希函数的替代方案的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 python2.7 中,我成功地使用 hash() 将对象放入永久存储在磁盘上的存储桶中.样机代码如下所示:

In python2.7, I'm successfully using hash() to place objects into buckets stored persistently on disk. A mockup code looks like this:

class PersistentDict(object):
  def __setitem__(self, key, value):
    bucket_index = (hash(key)&0xffffffff) % self.bucket_count
    self._store_to_bucket(bucket_index, key, value)

  def __getitem__(self, key):
    bucket_index = (hash(key)&0xffffffff) % self.bucket_count
    return self._fetch_from_bucket(bucket_index)[key]

在 python3 中,hash() 使用随机或固定盐,这使得它不可用/不适合这个 [1].显然,不可能对特定调用使用固定盐.所以,我需要一个替代方案:

In python3, hash() uses a random or fixed salt, which makes it unusable/suboptimal for this [1]. Apparently, it's not possible to use a fixed salt for specific invocations. So, I need an alternative:

  • 必须在解释器调用中保持稳定
  • 可能需要在执行时提供参数,例如在通话中设置盐
  • 必须支持任意对象(dict/set 支持的任何对象)
  • Must be stable across interpreter invocations
  • May require parameters supplied at execution time, e.g. setting a salt in the call
  • Must support arbitrary objects (anything supported by dict/set)

我已经尝试使用 hashlib 中的哈希函数(慢!) 和来自 zlib 的校验和(显然不适合散列,但是meh)可以很好地处理字符串/字节.然而,它们适用于类似字节的对象,而 hash() 几乎适用于所有对象.

I've already tried using hash functions from hashlib (slow!) and checksums from zlib (apparently not ideal for hashing, but meh) which work fine with strings/bytes. However, they work only on bytes-like objects, whereas hash() works with almost everything.

[1] 使用 hash() 来识别桶是:

[1] Using hash() to identify buckets is either:

  • 跨解释器调用不可靠,如果盐是随机的
  • 防止应用程序使用随机加盐功能(如果盐是固定的)
  • 如果使用不同的盐创建两个 PersistentDict 则不可用

推荐答案

我已经成功地使用了 hashzlib.adler32 的组合.最直接的实现是这样的:

I've had success using a combination of hash and zlib.adler32. The most straightforward implementation is this:

def hashkey(obj, salt=0):
  """
  Create a key suitable for use in hashmaps

  :param obj: object for which to create a key
  :type: str, bytes, :py:class:`datetime.datetime`, object
  :param salt: an optional salt to add to the key value
  :type salt: int
  :return: numeric key to `obj`
  :rtype: int
  """
  if obj is None:
    return 0
  if isinstance(obj, str):
    return zlib.adler32(obj.encode(), salt) & 0xffffffff
  elif isinstance(obj, bytes):
    return zlib.adler32(obj, salt) & 0xffffffff
  elif isinstance(obj, datetime_type):
    return zlib.adler32(str(obj).encode(), salt) & 0xffffffff
  return hash(obj) & 0xffffffff

在 Python 3.4.3 中,这比调用普通的 hash 慢得多,后者大约需要 0.07 微秒.对于常规对象,hashkey 使用 ~1.0 usec 代替.bytes 为 0.8 usec,str 为 0.7.

With Python 3.4.3, this is a lot slower than calling plain hash, which takes roughly 0.07 usec. For a regular object, hashkey takes ~1.0 usec instead. 0.8 usec for bytes and 0.7 for str.

开销大致如下:

  • 0.1 usec 用于函数调用 (hash(obj) vs def pyhash(obj): return hash(obj))
  • 0.2 usec 到 0.5 usec 用于通过 isinstance
  • 选择哈希函数
  • 0.75 usec for zlib.adler32zlib.crc32 vs hash:~0.160 usec vs ~ 0.75 usec(adler 和 crc 是 +/- 4 使用 c)
  • 0.15 usec 用于 str 对象的 obj.encode() ("foobar")
  • 1.5 usec for str(obj).encode() of datetime.datetime 对象
  • 0.1 usec for the function call (hash(obj) vs def pyhash(obj): return hash(obj))
  • 0.2 usec to 0.5 usec for selecting the hash function via isinstance
  • 0.75 usec for zlib.adler32 or zlib.crc32 vs hash: ~0.160 usec vs ~ 0.75 usec (adler and crc are +/- 4 usec)
  • 0.15 usec for obj.encode() of str objects ("foobar")
  • 1.5 usec for str(obj).encode() of datetime.datetime objects

最优化来自 if 语句的排序.如果大多数人期望使用普通对象,那么以下是我能想到的最快的方法:

The most optimization comes from ordering of the if statements. If one mostly expects plain objects, the following is the fastest I could come up with:

def hashkey_c(obj, salt=0):
  if obj.__class__ in hashkey_c.types:
    if obj is None:
      return 0
    if obj.__class__ is str:
      return zlib.adler32(obj.encode(), salt) & 0xffffffff
    elif obj.__class__ is bytes:
      return zlib.adler32(obj, salt) & 0xffffffff
    elif obj.__class__ is datetime_type:
      return zlib.adler32(str(obj).encode(), salt) & 0xffffffff
  return hash(obj) & 0xffffffff
hashkey_c.types = {str, bytes, datetime_type, type(None)}

总时间:strbytes 约为 0.7 usec,datetime 非常糟糕,对象、int 等约为 0.35 usec.使用 adict 将类型映射到可比较的哈希,如果单独使用对 dict 键(又名类型)的显式检查(即不是 hashkey.dict_types 中的 obj.__class__obj.__class__ in hashkey.explicit_dict_types).

Total time: ~0.7 usec for str and bytes, abysmal for datetime, 0.35 usec for objects, ints, etc. Using a dict to map type to hash comparable, if one uses an explicit check on the dict keys (aka types) separately (i.e. not obj.__class__ in hashkey.dict_types but obj.__class__ in hashkey.explicit_dict_types).

一些附加说明:

    对于使用默认 __hash__ 实现的任何对象,包括 None
  • hash 在解释器启动时都不稳定
  • 对于包含加盐类型的不可变容器(定义 __hash__),它不能正常工作,例如(1, 2, '三')
  • hash is not stable across interpreter starts for any object using the default __hash__ implementation, including None
  • It does not work properly for immutable containers (which define __hash__) containing a salted type, e.g. (1, 2, 'three')

这篇关于任意对象的python哈希函数的替代方案的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆