如何选择哈希表的大小? [英] How to choose size of hash table?

查看:679
本文介绍了如何选择哈希表的大小?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有200.000个单词,并且我将使用hash*33 + word[i]作为哈希函数,那么为了最小化内存/页面调度问题,优化表的大小应该是多少?

Suppose I have 200.000 of words, and I am going to use hash*33 + word[i] as a hash function, what should be the size of table for optimization, for minimum memory/paging issue?

使用的平台-C(c99版本),

Platform used - C (c99 version),

单词是英文字符,ASCII值

words are English char words, ASCII values

一次初始化哈希表(链接列表样式的存储桶),

One time initialization of hash table (buckets of link list style),

用于下一个搜索,例如字典搜索.

used for searching next, like dictionary search.

碰撞后,该单词将作为新节点添加到存储桶中.

After collision , that word will be added as new node into bucket.

推荐答案

一个好的经验法则是将负载系数保持在75%或更低(有些人会说70%)以保持(非常接近)O(1 ) 抬头. 假设您具有良好的哈希函数.

A good rule of thumb is to keep the load factor at 75% or less (some will say 70%) to maintain (very close to) O(1) lookup. Assuming you have a good hash function.

基于此,您至少需要约266,700个桶(占75%)或285,700个桶(占70%).假设没有碰撞.

Based on that, you would want a minimum of about 266,700 buckets (for 75%), or 285,700 buckets for 70%. That's assuming no collisions.

也就是说,最好的选择是使用各种哈希表大小的一些示例数据运行测试,并查看会遇到多少冲突.

That said, your best bet is to run a test with some sample data at various hash table sizes and see how many collisions you get.

您可能还会考虑比hash*33 + word[i]更好的哈希函数. Jenkins哈希及其变体需要更多的计算,但是它们提供了更好的分布,因此通常会使减少碰撞并减小所需的桌子尺寸.

You might also consider a better hash function than hash*33 + word[i]. The Jenkins hash and its variants require more computation, but they give a better distribution and thus will generally make for fewer collisions and a smaller required table size.

您也可以将内存用于解决问题. 500,000的表大小为您提供了40%的最小加载因子,这可以弥补您的哈希函数的缺点.但是,您很快就会达到收益递减的地步.也就是说,将表的大小设置为100万将使您的理论负载率为20%,但是几乎可以肯定,您实际上并不会意识到这一点.

You could also just throw memory at the problem. A table size of 500,000 gives you a minimum load factor of 40%, which could make up for shortcomings of your hash function. However, you'll soon reach a point of diminishing returns. That is, making the table size 1 million gives you a theoretical load factor of 20%, but it's almost certain that you won't actually realize that.

长话短说:使用更好的哈希函数,并在不同的表大小下进行一些测试.

Long story short: use a better hash function and do some testing at different table sizes.

存在最小完美哈希之类的东西.如果您知道输入数据是什么(即它不会改变),则可以创建一个哈希函数来保证O(1)查找.这也是非常节省空间的.但是,我不知道为200,000个项目创建最小的完美哈希值将是多么困难.

There is such a thing as a minimal perfect hash. If you know what your input data is (i.e., it doesn't change), then you can create a hash function that guarantees O(1) lookup. It's also very space efficient. However, I don't know how difficult it would be to create a minimal perfect hash for 200,000 items.

这篇关于如何选择哈希表的大小?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆