python中的快速、大宽度、非加密字符串散列 [英] fast, large-width, non-cryptographic string hashing in python

查看:20
本文介绍了python中的快速、大宽度、非加密字符串散列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要在 python 中使用高性能字符串散列函数来生成至少具有 34 位输出的整数(64 位是有意义的,但 32 位太少了).Stack Overflow 上还有其他几个类似的问题,但在我能找到的每个被接受/支持的答案中,都属于少数类别之一,这些类别不适用(出于给定的原因).

I have a need for a high-performance string hashing function in python that produces integers with at least 34 bits of output (64 bits would make sense, but 32 is too few). There are several other questions like this one on Stack Overflow, but of those every accepted/upvoted answer I could find fell in to one of a few categories, which don't apply (for the given reason.)

  • 使用内置的 hash() 函数.这个函数,至少在我正在开发的机器上(使用 python 2.7,和一个 64 位cpu) 生成一个适合 32 位的整数 - 对于我的目的来说不够大.
  • 使用 hashlib. hashlib 提供加密散列例程,这比用于非加密目的所需的慢.我认为这是不言而喻的,但如果您需要基准和引文来让您相信这一事实,那么我可以提供.
  • 使用 string.__hash__() 函数作为原型来编写您自己的函数.我怀疑这将是正确的方法,除了这个特定函数的效率在于它使用了 c_mul 函数,它包含 32 位 - 同样,对于我来说太小了!非常令人沮丧,它是如此接近完美!
  • Use the built-in hash() function. This function, at least on the machine I'm developing for (with python 2.7, and a 64-bit cpu) produces an integer that fits within 32 bits - not large enough for my purposes.
  • Use hashlib. hashlib provides cryptographic hash routines, which are far slower than they need to be for non-cryptographic purposes. I find this self-evident, but if you require benchmarks and citations to convince you of this fact then I can provide that.
  • Use the string.__hash__() function as a prototype to write your own function. I suspect this will be the correct way to go, except that this particular function's efficiency lies in its use of the c_mul function, which wraps around 32 bits - again, too small for my use! Very frustrating, it's so close to perfect!

理想的解决方案应具有以下属性,按重要性相对松散的顺序排列.

An ideal solution would have the following properties, in a relative, loose order of importance.

  1. 输出范围至少扩展 34 位,可能是 64 位,同时在 所有 位上保持一致 雪崩特性.(连接 32 位哈希往往会违反雪崩特性,至少在我的愚蠢示例中是这样.)
  2. 便携.在两台不同的机器上给定相同的输入字符串,我应该两次得到相同的结果.这些值将存储在一个文件中,以备日后重复使用.
  3. 高性能.越快越好,因为在我正在运行的程序的执行过程中,这个函数将被调用大约 200 亿次(这是目前对性能至关重要的代码.)它不需要用 C 编写,它真的只需要优于 md5(在字符串的内置 hash() 领域中的某个地方).
  4. 接受一个扰动"(这里用什么词更好?)整数作为输入来修改输出.我在下面放了一个例子(列表格式规则不允许我把它放在更近的地方.)我想这不是 100% 必要的,因为它可以通过手动扰动函数的输出来模拟,但是将它作为输入给了我好温暖的感觉.
  5. 完全用 Python 编写.如果绝对,肯定需要用 C 编写,那么我想这是可以做到的,但我会用 Python 编写一个比 C 中更快的函数慢 20% 的函数,这只是由于项目使用两种不同语言的协调头痛.是的,这是一个逃避,但这是这里的愿望清单.
  1. Have an output range extending at least 34 bits long, likely 64 bits, while preserving consistent avalanche properties over all bits. (Concatenating 32-bit hashes tends to violate the avalanche properties, at least with my dumb examples.)
  2. Portable. Given the same input string on two different machines, I should get the same result both times. These values will be stored in a file for later re-use.
  3. High-performance. The faster the better as this function will get called roughly 20 billion times during the execution of the program I'm running (it is the performance-critical code at the moment.) It doesn't need to be written in C, it really just needs to outperform md5 (somewhere in the realm of the built-in hash() for strings).
  4. Accept a 'perturbation' (what's the better word to use here?) integer as input to modify the output. I put an example below (the list formatting rules wouldn't let me place it nearer.) I suppose this isn't 100% necessary since it can be simulated by perturbing the output of the function manually, but having it as input gives me a nice warm feeling.
  5. Written entirely in Python. If it absolutely, positively needs to be written in C then I guess that can be done, but I'd take a 20% slower function written in python over the faster one in C, just due to project coordination headache of using two different languages. Yes, this is a cop-out, but this is a wish list here.

'Perturbed' 散列示例,其中散列值被一个小的整数值 n 剧烈改变

'Perturbed' hash example, where the hash value is changed drastically by a small integer value n

def perturb_hash(key,n):
    return hash((key,n))

最后,如果你对我到底在做什么感到好奇,我需要这样一个特定的哈希函数,我正在完全重写 pybloom 模块以显着提高其性能.我成功了(它现在运行速度提高了大约 4 倍并使用了大约 50% 的空间),但我注意到有时如果过滤器变得足够大,它的假阳性率会突然飙升.我意识到这是因为哈希函数没有处理足够的位.32 位只能处理 40 亿位(请注意,过滤器处理的是位而不是字节),而我用于基因组数据的一些过滤器会加倍或更多(因此最少需要 34 位.)

Finally, if you're curious as to what the heck I'm doing that I need such a specific hash function, I'm doing a complete re-write of the pybloom module to enhance its performance considerably. I succeeded at that (it now runs about 4x faster and uses about 50% of the space) but I noticed that sometimes if the filter got large enough it was suddenly spiking in false-positive rates. I realized it was because the hash function wasn't addressing enough bits. 32 bits can only address 4 billion bits (mind you, the filter addresses bits and not bytes) and some of the filters I'm using for genomic data double that or more (hence 34 bit minimum.)

谢谢!

推荐答案

看看 MurmurHash3 的 128 位变体.算法页面包含一些性能数据.应该可以将其移植到 Python,纯的或作为 C 扩展.(更新作者建议使用 128 位变体并丢弃您不需要的位).

Take a look at the 128-bit variant of MurmurHash3. The algorithm's page includes some performance numbers. Should be possible to port this to Python, pure or as a C extension. (Updated the author recommends using the 128-bit variant and throwing away the bits you don't need).

如果 MurmurHash2 64 位适合您,pyfasthash 包中有一个 Python 实现(C 扩展),其中包括一些其他非加密哈希变体,尽管其中一些仅提供 32 位输出.

If MurmurHash2 64-bit works for you, there is a Python implementation (C extension) in the pyfasthash package, which includes a few other non-cryptographic hash variants, though some of these only offer 32-bit output.

更新 我为 Murmur3 哈希函数做了一个快速的 Python 包装.Github 项目在这里,您可以在 Python 包索引以及;它只需要一个 C++ 编译器来构建;无需提升.

Update I did a quick Python wrapper for the Murmur3 hash function. Github project is here and you can find it on Python Package Index as well; it just needs a C++ compiler to build; no Boost required.

使用示例和时序对比:

import murmur3
import timeit

# without seed
print murmur3.murmur3_x86_64('samplebias')
# with seed value
print murmur3.murmur3_x86_64('samplebias', 123)

# timing comparison with str __hash__
t = timeit.Timer("murmur3.murmur3_x86_64('hello')", "import murmur3")
print 'murmur3:', t.timeit()

t = timeit.Timer("str.__hash__('hello')")
print 'str.__hash__:', t.timeit()

输出:

15662901497824584782
7997834649920664675
murmur3: 0.264422178268
str.__hash__: 0.219163894653

这篇关于python中的快速、大宽度、非加密字符串散列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆