任意对象的python哈希函数的替代方案 [英] Alternative to python hash function for arbitrary objects

查看：79 发布时间：2021/6/26 20:37:31 python python-3.x python-2.x

本文介绍了任意对象的python哈希函数的替代方案的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在 python2.7 中，我成功地使用 hash() 将对象放入永久存储在磁盘上的存储桶中.样机代码如下所示:

In python2.7, I'm successfully using hash() to place objects into buckets stored persistently on disk. A mockup code looks like this:

class PersistentDict(object):
  def __setitem__(self, key, value):
    bucket_index = (hash(key)&0xffffffff) % self.bucket_count
    self._store_to_bucket(bucket_index, key, value)

  def __getitem__(self, key):
    bucket_index = (hash(key)&0xffffffff) % self.bucket_count
    return self._fetch_from_bucket(bucket_index)[key]

在 python3 中，hash() 使用随机或固定盐，这使得它不可用/不适合这个 [1].显然，不可能对特定调用使用固定盐.所以，我需要一个替代方案:

In python3, hash() uses a random or fixed salt, which makes it unusable/suboptimal for this [1]. Apparently, it's not possible to use a fixed salt for specific invocations. So, I need an alternative:

必须在解释器调用中保持稳定
可能需要在执行时提供参数，例如在通话中设置盐
必须支持任意对象(dict/set 支持的任何对象)

Must be stable across interpreter invocations
May require parameters supplied at execution time, e.g. setting a salt in the call
Must support arbitrary objects (anything supported by dict/set)

我已经尝试使用 hashlib 中的哈希函数(慢！) 和来自 zlib 的校验和(显然不适合散列，但是meh)可以很好地处理字符串/字节.然而，它们仅适用于类似字节的对象，而 hash() 几乎适用于所有对象.

I've already tried using hash functions from hashlib (slow!) and checksums from zlib (apparently not ideal for hashing, but meh) which work fine with strings/bytes. However, they work only on bytes-like objects, whereas hash() works with almost everything.

[1] 使用 hash() 来识别桶是:

[1] Using hash() to identify buckets is either:

跨解释器调用不可靠，如果盐是随机的
防止应用程序使用随机加盐功能(如果盐是固定的)
如果使用不同的盐创建两个 PersistentDict 则不可用

推荐答案

我已经成功地使用了 hash 和 zlib.adler32 的组合.最直接的实现是这样的:

I've had success using a combination of hash and zlib.adler32. The most straightforward implementation is this:

def hashkey(obj, salt=0):
  """
  Create a key suitable for use in hashmaps

  :param obj: object for which to create a key
  :type: str, bytes, :py:class:`datetime.datetime`, object
  :param salt: an optional salt to add to the key value
  :type salt: int
  :return: numeric key to `obj`
  :rtype: int
  """
  if obj is None:
    return 0
  if isinstance(obj, str):
    return zlib.adler32(obj.encode(), salt) & 0xffffffff
  elif isinstance(obj, bytes):
    return zlib.adler32(obj, salt) & 0xffffffff
  elif isinstance(obj, datetime_type):
    return zlib.adler32(str(obj).encode(), salt) & 0xffffffff
  return hash(obj) & 0xffffffff

在 Python 3.4.3 中，这比调用普通的 hash 慢得多，后者大约需要 0.07 微秒.对于常规对象，hashkey 使用 ~1.0 usec 代替.bytes 为 0.8 usec，str 为 0.7.

With Python 3.4.3, this is a lot slower than calling plain hash, which takes roughly 0.07 usec. For a regular object, hashkey takes ~1.0 usec instead. 0.8 usec for bytes and 0.7 for str.

开销大致如下:

0.1 usec 用于函数调用 (hash(obj) vs def pyhash(obj): return hash(obj))
0.2 usec 到 0.5 usec 用于通过 isinstance
0.75 usec for zlib.adler32 或 zlib.crc32 vs hash:~0.160 usec vs ~ 0.75 usec(adler 和 crc 是 +/- 4 使用 c)
0.15 usec 用于 str 对象的 obj.encode() ("foobar")
1.5 usec for str(obj).encode() of datetime.datetime 对象

0.1 usec for the function call (hash(obj) vs def pyhash(obj): return hash(obj))
0.2 usec to 0.5 usec for selecting the hash function via isinstance
0.75 usec for zlib.adler32 or zlib.crc32 vs hash: ~0.160 usec vs ~ 0.75 usec (adler and crc are +/- 4 usec)
0.15 usec for obj.encode() of str objects ("foobar")
1.5 usec for str(obj).encode() of datetime.datetime objects

最优化来自 if 语句的排序.如果大多数人期望使用普通对象，那么以下是我能想到的最快的方法:

The most optimization comes from ordering of the if statements. If one mostly expects plain objects, the following is the fastest I could come up with:

def hashkey_c(obj, salt=0):
  if obj.__class__ in hashkey_c.types:
    if obj is None:
      return 0
    if obj.__class__ is str:
      return zlib.adler32(obj.encode(), salt) & 0xffffffff
    elif obj.__class__ is bytes:
      return zlib.adler32(obj, salt) & 0xffffffff
    elif obj.__class__ is datetime_type:
      return zlib.adler32(str(obj).encode(), salt) & 0xffffffff
  return hash(obj) & 0xffffffff
hashkey_c.types = {str, bytes, datetime_type, type(None)}

总时间:str 和 bytes 约为 0.7 usec，datetime 非常糟糕，对象、int 等约为 0.35 usec.使用 adict 将类型映射到可比较的哈希，如果单独使用对 dict 键(又名类型)的显式检查(即不是 hashkey.dict_types 中的 obj.__class__ 但 obj.__class__ in hashkey.explicit_dict_types).

Total time: ~0.7 usec for str and bytes, abysmal for datetime, 0.35 usec for objects, ints, etc. Using a dict to map type to hash comparable, if one uses an explicit check on the dict keys (aka types) separately (i.e. not obj.__class__ in hashkey.dict_types but obj.__class__ in hashkey.explicit_dict_types).

一些附加说明:

__hash__

None

hash 在解释器启动时都不稳定
对于包含加盐类型的不可变容器(定义 __hash__)，它不能正常工作，例如(1, 2, '三')

hash is not stable across interpreter starts for any object using the default __hash__ implementation, including None
It does not work properly for immutable containers (which define __hash__) containing a salted type, e.g. (1, 2, 'three')

这篇关于任意对象的python哈希函数的替代方案的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

任意对象的python哈希函数的替代方案 [英] Alternative to python hash function for arbitrary objects

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

任意对象的python哈希函数的替代方案 [英] Alternative to python hash function for arbitrary objects

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭