这是对python内置哈希函数的适当使用吗? [英] Is this an appropriate use of python's built-in hash function?

查看:109
本文介绍了这是对python内置哈希函数的适当使用吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要比较大块数据是否相等,并且需要每秒比较许多对,快速.保证每个对象的长度相同,在可能的情况下,未知位置上可能存在细微的差异.

下面的时间表明,如果在数据的开头附近存在差异,则使用==运算符的速度非常快,而如果在结尾处存在差异,则使用速度显着降低.

>>> import os
>>> s = os.urandom(1600*1200 - 1)
>>> Aimg = b"A" + s
>>> Bimg = b"B" + s
>>> img1 = s + b"1"
>>> img2 = s + b"2"
>>> %timeit Aimg == Bimg
61.8 ns ± 0.484 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
>>> %timeit img1 == img2
159 µs ± 2.83 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

在我的用例中,差异可能位于字节的中间或结尾(上下文:它是未压缩的图像数据).我寻找一种使用哈希或校验和加快处理速度的方法.使用md5的速度较慢,但​​是Python内置的hash实际上确实可以加快速度.

>>> %timeit img1 == img2
160 µs ± 5.96 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit hash(img1) == hash(img2) and img1 == img2
236 ns ± 5.91 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

我对这种哈希的技术细节感兴趣,是否足够像哈希,以致当hash(a) == hash(b)很有可能是 ?如果哈希冲突相当少见,则误报是可以接受的,这是在平均情况下加快比较速度的快速途径.

解决方案

Python的哈希函数旨在提高速度,并映射到64位空间中.由于生日悖论,这意味着您可能会在大约50亿个条目(可能更早,因为散列函数不是加密的).另外,hash的精确定义取决于Python的实现,并且可能是特定于体系结构的,甚至是特定于机器的.如果您希望在多台计算机上获得相同的结果,请不要使用它.

md5被设计为加密散列函数;输入中即使有轻微的扰动也会完全改变输出.它还映射到一个128位空间,这使得除非您专门寻找一个碰撞,否则根本不会遇到碰撞.

如果您可以处理冲突(例如,可以通过使用MD5或SHA2等加密算法来测试存储桶中所有成员之间的相等性),那么Python的哈希函数就可以了.

另一件事:为了节省空间,如果将数据写入磁盘,则应以二进制形式存储数据. (即struct.pack('!q', hash('abc'))/hashlib.md5('abc').digest()).

作为旁注: is 不等同于Python中的==.您是说==.

I need to compare large chunks of data for equality, and I need to compare many pairs per second, fast. Each object is guaranteed to be the same length, it is possible and likely that there may only be slight differences located at unknown positions.

Timings below show that using == operator is very fast if there is a difference near the start of the data, and significantly slower if differences are located towards the end.

>>> import os
>>> s = os.urandom(1600*1200 - 1)
>>> Aimg = b"A" + s
>>> Bimg = b"B" + s
>>> img1 = s + b"1"
>>> img2 = s + b"2"
>>> %timeit Aimg == Bimg
61.8 ns ± 0.484 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
>>> %timeit img1 == img2
159 µs ± 2.83 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In my use case, differences may be located towards the middle or the end of the bytes (context: it is uncompressed image data). I looked for a way to speed things up using a hash or checksum. Using md5 was slower but Python's built-in hash did actually speed things up.

>>> %timeit img1 == img2
160 µs ± 5.96 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit hash(img1) == hash(img2) and img1 == img2
236 ns ± 5.91 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

I'm interested in technical detail of this hash, is it sufficiently hash-like that when hash(a) == hash(b) then a == b is very likely? False positives are acceptable if a hash collision is reasonably rare, the intention is a fast-path to speed up comparisons in the average case.

解决方案

Python's hash function is designed for speed, and maps into a 64-bit space. Due to the birthday paradox, this means you'll likely get a collision at about 5 billion entries (probably way earlier, since the hash function is not cryptographical). Also, the precise definition of hash is up to the Python implementation, and may be architecture- or even machine-specific. Don't use it you want the same result on multiple machines.

md5 is designed as a cryptographic hash function; even slight perturbations in the input totally change the output. It also maps into a 128-bit space, which makes it unlikely you'll ever encounter a collision at all unless you're specifically looking for one.

If you can handle collisions (i.e. test for equality between all members in a bucket, possibly by using a cryptographic algorithm like MD5 or SHA2), Python's hash function is perfectly fine.

One more thing: To save space, you should store the data in binary form if you write it to disk. (i.e. struct.pack('!q', hash('abc')) / hashlib.md5('abc').digest()).

As a side note: is is not equivalent to == in Python. You mean ==.

这篇关于这是对python内置哈希函数的适当使用吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆