快速,大宽度,非加密字符串哈希在Python中 [英] fast, large-width, non-cryptographic string hashing in python

查看:100
本文介绍了快速,大宽度,非加密字符串哈希在Python中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要python中的高性能字符串散列函数,它至少产生 34 位输出的整数(64位有意义,但32位太少)。还有其他几个像Stack Overflow这样的问题,但是我可以找到的每个接受/最高回答的答案都属于几个类别中的一个,这些类别不适用(出于特定原因)。




  • 使用内置的 hash()函数。至少在我正在开发的机器上(使用python 2.7和64位cpu)会产生一个适合32位的整数 - 对于我的目的而言不够大。

  • 使用hashlib。 hashlib提供的密码哈希例程比远远低于它们需要用于非加密目的。我发现这是不言而喻的,但如果你需要基准和引用来说服你这个事实,那么我可以提供。
  • 使用字符串。 __hash __()函数作为原型编写自己的函数。我怀疑这将是正确的方法,除了这个特定函数的效率在于它使用c_mul函数,其中包裹32位 - 再次,太小,我的用途!非常令人沮丧的是,它非常接近完美!


一个理想的解决方案将具有以下属性,


  1. 输出范围至少延长34位,可能为64位,同时保留一致雪崩属性全部位。 (连接32位散列往往会违反雪崩属性,至少在我的哑巴例子中。)

  2. 便携。给定两个不同机器上的相同输入字符串,我应该都会得到相同的结果。这些值将存储在一个文件中供以后重新使用。

  3. 高性能。越快越好,因为这个函数在执行我正在运行的程序期间会被调用大约200亿次(这是目前性能关键的代码)。它不需要用C编写,它确实只需要超过md5(字符串内置hash()的某个地方)。
  4. 接受一个'扰动'(这里最好用的是什么?)整数输入来修改输出。我在下面举了一个例子(列表格式规则不会让我把它放在更近的位置)。我想这不是100%必要的,因为它可以通过手动扰乱函数的输出来模拟,但将它作为输入可以让我一种温暖的感觉。

  5. 完全用Python编写。如果它绝对地肯定需要使用C语言编写,那么我认为这是可以做到的,但是我会用Python编写的函数比使用C编写速度更快的函数慢20%,这仅仅是因为项目使用两种不同语言的协调头痛。是的,这是一个cop-out,但这是一个愿望清单。



'扰动'散列示例,其中散列值(key,n):
返回散列值((键())被一个小的整数值急剧改变n


  def perturb_hash ,n))

最后,如果您好奇我在做什么我需要这样一个特定的散列函数,我正在对pybloom模块进行完整的重写,以大大提高它的性能。我成功了(现在它运行速度提高了4倍,占用了大约50%的空间),但我注意到有时如果过滤器变得足够大,突然间会出现误报率。我意识到这是因为哈希函数没有寻址足够的位。 32位只能寻址40亿位(请注意,过滤器可以处理位而不是字节),而我用于基因组数据的一些过滤器的数据量会增加一倍或更多(因此最少为34位)。

谢谢!

解决方案

查看 MurmurHash3的128位变体算法页面包含一些性能数字。应该可以将其移植到Python,纯粹或作为C扩展。 (更新,作者建议使用128位变体并丢弃不需要的位)。



如果MurmurHash2 64位为您工作,在 pyfasthash软件包中有一个Python实现(C扩展),其中包括一些其他非易失性内容,密码哈希变体,尽管其中一些只提供32位输出。



更新我为Murmur3哈希函数做了一个快速的Python包装。 Github项目在这里,你可以在 Python包装索引;它只需要一个C ++编译器来构建;

使用示例和时间比较:

  import murmur3 
进口时间

#无种子
打印murmur3.murmur3_x86_64('samplebias')
#种子值
打印murmur3.murmur3_x86_64('samplebias ',123)

#与str的时间比较__hash__
t = timeit.Timer(murmur3.murmur3_x86_64('hello'),import murmur3)
print'murmur3 :',t.timeit()

t = timeit.Timer(str .__ hash __('hello'))
print'str .__ hash__:',t.timeit()$



$ b

输出: c $ c> 15662901497824584782
7997834649920664675
murmur3:0.264422178268
str .__ hash__:0.219163894653


I have a need for a high-performance string hashing function in python that produces integers with at least 34 bits of output (64 bits would make sense, but 32 is too few). There are several other questions like this one on Stack Overflow, but of those every accepted/upvoted answer I could find fell in to one of a few categories, which don't apply (for the given reason.)

  • Use the built-in hash() function. This function, at least on the machine I'm developing for (with python 2.7, and a 64-bit cpu) produces an integer that fits within 32 bits - not large enough for my purposes.
  • Use hashlib. hashlib provides cryptographic hash routines, which are far slower than they need to be for non-cryptographic purposes. I find this self-evident, but if you require benchmarks and citations to convince you of this fact then I can provide that.
  • Use the string.__hash__() function as a prototype to write your own function. I suspect this will be the correct way to go, except that this particular function's efficiency lies in its use of the c_mul function, which wraps around 32 bits - again, too small for my use! Very frustrating, it's so close to perfect!

An ideal solution would have the following properties, in a relative, loose order of importance.

  1. Have an output range extending at least 34 bits long, likely 64 bits, while preserving consistent avalanche properties over all bits. (Concatenating 32-bit hashes tends to violate the avalanche properties, at least with my dumb examples.)
  2. Portable. Given the same input string on two different machines, I should get the same result both times. These values will be stored in a file for later re-use.
  3. High-performance. The faster the better as this function will get called roughly 20 billion times during the execution of the program I'm running (it is the performance-critical code at the moment.) It doesn't need to be written in C, it really just needs to outperform md5 (somewhere in the realm of the built-in hash() for strings).
  4. Accept a 'perturbation' (what's the better word to use here?) integer as input to modify the output. I put an example below (the list formatting rules wouldn't let me place it nearer.) I suppose this isn't 100% necessary since it can be simulated by perturbing the output of the function manually, but having it as input gives me a nice warm feeling.
  5. Written entirely in Python. If it absolutely, positively needs to be written in C then I guess that can be done, but I'd take a 20% slower function written in python over the faster one in C, just due to project coordination headache of using two different languages. Yes, this is a cop-out, but this is a wish list here.

'Perturbed' hash example, where the hash value is changed drastically by a small integer value n

def perturb_hash(key,n):
    return hash((key,n))

Finally, if you're curious as to what the heck I'm doing that I need such a specific hash function, I'm doing a complete re-write of the pybloom module to enhance its performance considerably. I succeeded at that (it now runs about 4x faster and uses about 50% of the space) but I noticed that sometimes if the filter got large enough it was suddenly spiking in false-positive rates. I realized it was because the hash function wasn't addressing enough bits. 32 bits can only address 4 billion bits (mind you, the filter addresses bits and not bytes) and some of the filters I'm using for genomic data double that or more (hence 34 bit minimum.)

Thanks!

解决方案

Take a look at the 128-bit variant of MurmurHash3. The algorithm's page includes some performance numbers. Should be possible to port this to Python, pure or as a C extension. (Updated the author recommends using the 128-bit variant and throwing away the bits you don't need).

If MurmurHash2 64-bit works for you, there is a Python implementation (C extension) in the pyfasthash package, which includes a few other non-cryptographic hash variants, though some of these only offer 32-bit output.

Update I did a quick Python wrapper for the Murmur3 hash function. Github project is here and you can find it on Python Package Index as well; it just needs a C++ compiler to build; no Boost required.

Usage example and timing comparison:

import murmur3
import timeit

# without seed
print murmur3.murmur3_x86_64('samplebias')
# with seed value
print murmur3.murmur3_x86_64('samplebias', 123)

# timing comparison with str __hash__
t = timeit.Timer("murmur3.murmur3_x86_64('hello')", "import murmur3")
print 'murmur3:', t.timeit()

t = timeit.Timer("str.__hash__('hello')")
print 'str.__hash__:', t.timeit()

Output:

15662901497824584782
7997834649920664675
murmur3: 0.264422178268
str.__hash__: 0.219163894653

这篇关于快速,大宽度,非加密字符串哈希在Python中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆