人们能够构建一个“好”的吗?哈希函数使用CRC32C作为基础? [英] Can one construct a "good" hash function using CRC32C as a base?

查看:188
本文介绍了人们能够构建一个“好”的吗?哈希函数使用CRC32C作为基础?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

鉴于SSE 4.2(Intel Core i7& i5部分)包含CRC32指令,研究是否可以构建更快速的通用散列函数似乎是合理的。根据,只有16位CRC32均匀分布。那么,为了克服这个问题,还有哪些其他的转变?

更新
这个怎么样?只有16位适合散列值。精细。如果你的桌子是65535或更少,那么很好。如果没有,则通过Nehalem POPCNT(总体计数)指令运行CRC值以获取设置的位数。然后,将其用作表格数组的索引。如果您的表格位于1mm条目的南部,则此功能起作用。我敢打赌,性能最好的散列函数更便宜/更快。既然 GCC 4.5 有一个CRC32内部函数,它应该很容易测试......如果只有我有足够的空余时间来处理它。

David

解决方案

重访, 2014年8月

Arnaud Bouchez 在最近的评论中提出,并且鉴于其他答案和评论,我承认原始答案需要修改或最不合格。



首先,也许最重要的是,对问题的公正回答取决于预期的使用哈希码:good[hash function ...]是什么意思?在哪里/如何使用哈希? (例如,它是否用于散列相对较短的输入密钥?是否用于索引/查找目的,生成消息摘要或其他用途?期望的哈希码本身,CRC32或其派生的所有32位,多长时间bit,less ... etc?

OP问题要求更快 通用哈希函数 ,因此重点放在SPEED上(CPU的密集程度较低和/或能够利用各种性质的并行处理的东西)。我们可能在此注意到,散列码本身的计算时间往往只是问题的一部分应用散列(例如,如果散列码的大小或其内在特征导致许多需要处理额外周期的冲突),而对通用的要求留下了许多关于可能用途的问题。 p>

有鉴于此,一个简短而更好的答案是,也许:

,hardwar在较新的英特尔处理器上实现CRC32C可用于构建更快的散列码;但要注意,根据散列的具体实现和其应用,由于碰撞的频率以及需要使用更长的代码,整体结果可能是次优的。另外,当然,应该仔细检查哈希的加密用法,因为在这方面CRC32算法本身非常薄弱。原始答案引用了一篇评估哈希值的文章由Bret Mulvey负责,正如Mdlg的回答所指出的那样,这篇文章的结论在CRC32方面是错误的,因为它基于的CRC32的实现是有缺陷/有缺陷的。尽管在CRC32方面存在这个重大错误,但文章总体上提供了有关散列算法性质的有用指导。这篇文章的URL现在已经不存在了;我在 archive.today 上找到它,但我不知道作者是否拥有它在另一个位置以及是否更新它。



其他答案在此引用 CityHash 1.0 作为使用CRC32C的哈希库的示例。显然,这是用在一些较长(超过32位)散列码的情况下,但不适用于CityHash32()函数本身。此外,城市散列函数对CRC32的使用相对较小,与生成散列码所执行的所有移位和混排以及其他操作相比较。 (这并不是对CityHash的批评,对此我没有亲身体验,我会从一个粗略的回顾中看到CityHash函数产生良好的源代码,例如ell分布式代码,但速度并不快)



最后,您也可以在>准备重复的问题






原创答案和编辑(2010年4月)



先验这听起来像个坏主意!



CRC32不是为散列目的而设计的,它的分布可能不是一致的,因此它是一个相对较差的散列 - 码。此外,它的加扰能力相对较弱,使得加密应用中使用的单向散列值非常差。我在寻找那种效果的在线引用...]



Google的第一个[关键字= CRC32分发]命中似乎证实了这一点:

评估散列表的CRC32



编辑:上面引用的页面,的确如此完整文章提供了在Hash功能中寻找什么的良好基础。

阅读[快速]本文,证实了一般性的 CRC32不应当用作散列的一揽子声明,然而,根据散列的具体目的,可以至少部分使用CRC32作为哈希码。


Fo例如,较低(或更高,取决于具体实施)CRC32码的16位具有相对均匀的分布,并且假设人们不关心哈希码的密码特性(即,例如类似的密钥产生非常相似的代码这一事实),可以构建一个散列码,该散列码使用例如由两个半部分产生的两个CRC32码的较低或较高的16位的串联(或者任意的)
需要运行测试以查看内置的CRC32指令相对于其他散列函数的效率是否会导致调用指令的开销两次,并将代码拼接在一起等不会导致整体速度较慢的功能。


Given that SSE 4.2 (Intel Core i7 & i5 parts) includes a CRC32 instruction, it seems reasonable to investigate whether one could build a faster general-purpose hash function. According to this only 16 bits of a CRC32 are evenly distributed. So what other transformation would one apply to overcome that?

Update How about this? Only 16 bits are suitable for a hash value. Fine. If your table is 65535 or less then great. If not, run the CRC value through the Nehalem POPCNT (population count) instruction to get the number of bits set. Then, use that as an index into an array of tables. This works if your table is south of 1mm entries. I'd bet that's cheaper/faster that the best-performing hash functions. Now that GCC 4.5 has a CRC32 intrinsic it should be easy to test...if only I had the copious spare time to work on it.

David

解决方案

Revisited, August 2014
Prompted by Arnaud Bouchez in a recent comment, and in view of other answers and comments, I acknowledge that the original answer needs to be altered or for the least qualified. I left the original as-is, at the end, for reference.

First, and maybe most important, a fair answer to the question depends on the intended use of the hash code: What does one mean by "good" [hash function...]? Where/how will the hash be used? (e.g. is it for hashing a relatively short input key? Is it for indexing / lookup purposes, to produce message digests or yet other uses? How long is the desired hash code itself, all 32 bits [of CRC32 or derivatives thereof], more bits, fewer... etc?
The OP questions calls for "a faster general-purpose hash function", so the focus is on SPEED (something less CPU intensive and/or something which can make use of parallel processing of various nature). We may note here that the computation time for the hash code itself is often only part of the problem in an application of hash (for example if the size of the hash code or its intrinsic characteristics result in many collisions which require extra cycles to be dealt with). Also the requirement for "general purpose" leaves many questions as to the possible uses.

With this in mind, a short and better answer is, maybe:

Yes, the hardware implementations of CRC32C on newer Intel processors can be used to build faster hash codes; beware however that depending on the specific implementation of the hash and on its application the overall results may be sub-optimal because of the frequency of collisions, of the need to use longer codes. Also, for sure, cryptographic uses of the hash should be carefully vetted because the CRC32 algorithm itself is very weak in this regard.

The original answer cited a article on Evaluating Hash functions by Bret Mulvey and as pointed in Mdlg's answer, the conclusion of this article are erroneous in regards to CRC32 as the implementation of CRC32 it was based on was buggy/flawed. Despite this major error in regards to CRC32, the article provides useful guidance as to the properties of hash algorithms in general. The URL to this article is now defunct; I found it on archive.today but I don't know if the author has it at another location and also whether he updated it.

Other answers here cite CityHash 1.0 as an example of a hash library that uses CRC32C. Apparently, this is used in the context of some longer (than 32 bits) hash codes but not for the CityHash32() function itself. Also, the use of CRC32 by City Hash functions is relatively small, compared with all the shifting and shuffling and other operations that are performed to produce the hash code. (This is not a critique of CityHash for which I have no hands-on experience. I'll go on a limb, from a cursory review of the source code that CityHash functions produce good, e.g. ell distributed codes, but are not significantly faster than various other hash functions.)

Finally, you may also find insight on this issue in a quasi duplicate question on SO .


Original answer and edit (April 2010)

A priori, this sounds like a bad idea!.

CRC32 was not designed for hashing purposes, and its distribution is likely to not be uniform, hence making it a relatively poor hash-code. Furthermore, its "scrambling" power is relatively weak, making for a very poor one-way hash, as would be used in cryptographic applications.

[BRB: I'm looking for online references to that effect...]

Google's first [keywords = CRC32 distribution] hit seems to confirm this :
Evaluating CRC32 for hash tables

Edit: The page cited above, and indeed the complete article provides a good basis of what to look for in Hash functions.
Reading [quickly] this article, confirmed the blanket statement that in general CRC32 should not be used as a hash, however, and depending on the specific purpose of the hash, it may be possible to use, at least in part, a CRC32 as a hash code.

For example the lower (or higher, depending on implementation) 16 bits of the CRC32 code have a relatively even distribution, and, provided that one isn't concerned about the cryptographic properties of the hash code (i.e. for example the fact that similar keys produce very similar codes), it may be possible to build a hash code which uses, say, a concatenation of the lower [or higher] 16 bits for two CRC32 codes produced with the two halves (or whatever division) of the original key.
One would need to run tests to see if the efficiency of the built-in CRC32 instruction, relative to an alternative hash functions, would be such that the overhead of calling the instruction twice and splicing the code together etc. wouldn't result in an overall slower function.

这篇关于人们能够构建一个“好”的吗?哈希函数使用CRC32C作为基础?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆