在哈希冲突和字符串性能方面的最佳哈希算法 [英] Best hashing algorithm in terms of hash collisions and performance for strings

查看:22
本文介绍了在哈希冲突和字符串性能方面的最佳哈希算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果我们有以下优先级(按此顺序),最好的哈希算法是什么:

  1. 最小的哈希冲突
  2. 性能

它不一定是安全的.基本上,我试图根据某些对象的属性组合创建索引.所有属性都是字符串.

对 c# 实现的任何引用将不胜感激.

解决方案

忘记最好"这个词吧.无论任何人可能想出哪种散列算法,除非您需要散列的数据集非常有限,否则如果仅以正确的方式(或从您的角度来看),每个平均表现非常好的算法都可能变得完全无用错误")数据.

与其浪费太多时间思考如何在不使用太多 CPU 时间的情况下让哈希更无冲突,我宁愿开始思考如何减少冲突的问题".例如.如果每个哈希桶实际上都是一个表,并且该表中的所有字符串(发生冲突)都按字母顺序排序,则您可以使用二分搜索(仅为 O(log n))在桶表中进行搜索,这意味着,即使当每第二个哈希桶有 4 次冲突时,您的代码仍然具有不错的性能(与无冲突表相比,它会慢一点,但不会那么多).这里的一大优势是,如果您的表足够大并且您的散列不是太简单,则产生相同散列值的两个字符串通常看起来完全不同(因此二分查找可以在平均一两个字符后停止比较字符串; 使每次比较都非常快).

实际上,我自己之前也遇到过这样的情况:使用二分搜索直接在排序表中搜索结果比散列更快!尽管我的散列算法很简单,但散列这些值需要花费相当长的时间.性能测试表明,只有当我得到超过 700-800 个条目时,散列确实比二分搜索快.然而,由于表永远不会增长到超过 256 个条目,并且平均表低于 10 个条目,基准测试清楚地表明,在每个系统、每个 CPU 上,二分查找速度更快.在这里,通常已经比较数据的第一个字节的事实足以导致下一次 bsearch 迭代(因为数据曾经在前一到两个字节中已经非常不同)被证明是一个很大的优势.

所以总结一下:我会采用一个不错的散列算法,它平均不会引起太多冲突,而且速度相当快(我什至会接受更多的冲突,如果它只是非常快!),而是优化我的代码如何在发生冲突时获得最小的性能损失(他们会!除非您的散列空间至少等于或大于您的数据空间,并且您可以将唯一的散列值映射到每个可能的数据集,否则他们会这样做).

What would be the best hashing algorithm if we had the following priorities (in that order):

  1. Minimal hash collisions
  2. Performance

It doesn't have to be secure. Basically I'm trying to create an index based on a combination of properties of some objects. All the properties are strings.

Any references to c# implementations would be appreciated.

解决方案

Forget about the term "best". No matter which hash algorithm anyone might come up with, unless you have a very limited set of data that needs to be hashed, every algorithm that performs very well on average can become completely useless if only being fed with the right (or from your perspective "wrong") data.

Instead of wasting too much time thinking about how to get the hash more collision-free without using too much CPU time, I'd rather start thinking about "How to make collisions less problematic". E.g. if every hash bucket is in fact a table and all strings in this table (that had a collision) are sorted alphabetically, you can search within a bucket table using binary search (which is only O(log n)) and that means, even when every second hash bucket has 4 collisions, your code will still have decent performance (it will be a bit slower compared to a collision free table, but not that much). One big advantage here is that if your table is big enough and your hash is not too simple, two strings resulting in the same hash value will usually look completely different (hence the binary search can stop comparing strings after maybe one or two characters on average; making every compare very fast).

Actually I had a situation myself before where searching directly within a sorted table using binary search turned out to be faster than hashing! Even though my hash algorithm was simple, it took quite some time to hash the values. Performance testing showed that only if I get more than about 700-800 entries, hashing is indeed faster than binary search. However, as the table could never grow larger than 256 entries anyway and as the average table was below 10 entries, benchmarking clearly showed that on every system, every CPU, the binary search was faster. Here, the fact that usually already comparing the first byte of the data was enough to lead to the next bsearch iteration (as the data used to be very different in the first one to two byte already) turned out as a big advantage.

So to summarize: I'd take a decent hash algorithm, that doesn't cause too many collisions on average and is rather fast (I'd even accept some more collisions, if it's just very fast!) and rather optimize my code how to get the smallest performance penalty once collisions do occur (and they will! They will unless your hash space is at least equal or bigger than your data space and you can map a unique hash value to every possible set of data).

这篇关于在哈希冲突和字符串性能方面的最佳哈希算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆