多少个哈希桶 [英] How many hash buckets

查看:83
本文介绍了多少个哈希桶的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果我注意到哈希表(或在哈希表上构建的任何其他数据结构)正在填满,那么在什么时候应该构建一个具有更多存储桶的新表.到目前为止,给定表中的n个项目,您如何确定在一个新的项目中要使用多少个存储桶?

If I notice that a hash table (or any other data structure built on a hash table) is filling up, at what point should you build a new table with more buckets. And given n items in the table so far, how do you figure out how many buckets to use in the new one?

所以说我有100个水桶.当其中有50件物品时,我应该重新组织它吗? 500? 5000?还是我应该寻找最满的水桶和钥匙?然后,当我达到这一点时,我可以将新的哈希表做多大?

So let's say I have 100 buckets. Should I reorganize it when there are 50 items in it? 500? 5000? Or should I look for the most-full bucket and key on that? Then when I hit that point how big do I make the new hash table?

与此相关的是,如果您事先大致知道要放入多少物品,是否有一种方法可以计算出存储桶的数量以获得良好的平均性能?

Related to this, if you know beforehand roughly how many items will go in, is there a way to compute the number of buckets to get a good average performance?

我知道真正的答案取决于许多其他考虑因素,例如在特定示例中速度与大小的重要性如何,但我正在寻找一般准则.

I know the real answer depends on a lot of other considerations like how important is speed vs. size in a specific example, but I'm looking for general guildlines.

我也知道,除非良好的性能分析表明这是瓶颈,否则我不应该对这种事情进行优化.我只是在考虑一个使用大量哈希表的项目,想知道如何解决这个问题.

I also know that I shouldn't be optimizing this sort of thing unless good profiling has indicated that this is a bottleneck. I'm just thinking about a project that would use a lot of hash tables and wondered how to approach this.

推荐答案

一个好的经验法则(并不总是很理想,好吧,只是一个经验法则)是,如果哈希表最多填充80个,则重新哈希%.这意味着如果您有100个存储桶和80个项目,无论您之前发生过多少次碰撞,都将有时间增加容量.

A good rule of the thumb (not always ideal, well, just a rule of the thumb) is to re-hash if the hashtable is filled up to 80%. That means if you have 100 buckets and 80 items inside, regardless how many collision you had before, it's getting time to increase capacity.

您应该增加多少?好吧,也没有完美的价值.最简单的解决方案是将每次增加的容量增加一倍.因此它达到200、400、800,依此类推.如果您认为这太多了(毕竟,当哈希表变得非常大并且您可能永远不会填满16 MB时,它将从8 MB内存跳到16 MB),请选择一个较小的增长因子.我建议至少推荐1/3(将其从100增加到133),或者让它每次增加50%作为妥协.

How much should you increase it? Well, there is also no perfect value. Simplest solution is to double capacity on each increase. So it goes to 200, 400, 800, and so on. If you think this is too much (after all it will jump from 8 MB memory to 16 MB when the hashtable gets really large and you may never fill up the 16 MB), choose a smaller grow factor. At least 1/3 is recommend (growing it from 100 to 133) I'd say, maybe let it grow by 50% each time as a compromise.

请注意,所有这些还取决于如何处理冲突.一种简单的处理方法(我个人最喜欢的方法)是在发生冲突时将项目存储在链接列表中.如果将3个项目放在相同的键上,仍然最多只能进行3个比较才能找到它.由于链表对于搜索而言效率很低,因此您可能希望更早地增加容量,例如如果使用60%的容量来保持哈希表快速. OTOH,您可以做一些更复杂的事情,并保留有关碰撞次数的统计信息.只要您几乎没有冲突(如果您具有很好的哈希函数),即使使用了其99%的容量,也根本不需要重新哈希.同样,如果您以复杂的方式处理冲突(例如,每个节点又是一个已排序的表,并且您可以在其中进行二进制搜索),则如果表加载到200%,则查找可能仍然足够快(因此,您的项目数是原来的两倍)作为容量).在这种情况下,您可以保持统计数据最大的排序表有多大,并且当它变得大于(假设)8个条目时,您认为它变得太慢了,然后重新哈希.

Note that all this also depends how collisions are handled. A simple way to handle them (my personal favorite) is to store the items in a linked list when there is a collision. If 3 items are placed at the same key, there are still only up to 3 compares to find it. Since linked list are very ineffective for searching, you may want to increase capacity earlier, e.g. if 60% capacity is used to keep the hashtable fast. OTOH, you can do something more sophisticated and keep stats about the number of collisions. As long as you hardly have any collisions (if you have a very good hash function) there is no need to re-hash at all, even if 99% of its capacity is in use. Also if you handle collisions in a sophisticated way (e.g. each node is again a sorted table and you can perform binary search within these) your lookup might still be fast enough if the table is loaded to 200% (so you have twice as many items as capacity). In that case you could keep stats how big the largest sorted table is and when it gets bigger than, let's say, 8 entries, you think this is getting too slow and then you re-hash.

重新哈希处理非常慢,因此应尽可能避免.因此,如果您需要重新哈希,请不要只是增加容量太小,否则添加更多项目时必须很快重新哈希.因此,当您需要重新哈希时,使容量显着大于表中当前项目的数量,其他所有容量都将太少.

Re-hashing is very slow, so it should be avoided as often as possible. Thus if you need to re-hash, don't just grow capacity too little, otherwise you have to re-hash again pretty soon when adding more items. So when you need to re-hash, make the capacity significantly larger than the number of items currently in the table, everything else is too little capacity.

这篇关于多少个哈希桶的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆