为什么哈希表扩展通常是通过加倍的大小完成的? [英] Why are hash table expansions usually done by doubling the size?

查看:163
本文介绍了为什么哈希表扩展通常是通过加倍的大小完成的?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对哈希表进行了一些研究,我经常遵循经验,当有一定数量的条目(最大值或通过75%的负载因子)时,哈希表应该被扩展



几乎总是,建议将哈希表的大小加倍(或加倍1,即2n + 1)。但是,我没有找到一个很好的理由。



为什么加倍大小,而不是说增加25%,或将其增加到下一个素数的大小或下一个k个素数(例如三个)?



我已经知道,选择初始哈希表大小通常是个好主意这是一个素数,至少如果你的散列函数使用诸如通用散列的模数。而且我知道这就是为什么通常建议做2n + 1而不是2n(例如, http ://www.concentric.net/~Ttwang/tech/hashsize.htm



但是,正如我所说,我还没有看到任何真正的解释为什么加倍或加倍一个实际上是一个很好的选择,而不是选择一个新的哈希表的大小的一些其他方法。



(是的,我有阅读关于哈希表的维基百科文章:) http://en.wikipedia.org/wiki/Hash_table

解决方案

如果例如调整大小是恒定增量,哈希表不能声明摊销常数时间插入 。在这种情况下,调整大小(随散列表的大小而增长)的成本将使插入的元素的总数中的一个插入的成本是线性的。由于调整大小随着表格的大小而变得越来越昂贵,所以必须发生越来越少的,以保持插入的摊销成本不变。



大多数实现允许平均桶占用增长到在调整大小之前提前固定的绑定(0.5和3之间的任何地方,这都是可接受的值)。根据这个惯例,在调整大小之后,平均桶占用变成一半。调整大小倍数保持平均桶占用宽带* 2。



子笔记:由于统计聚类,您必须将平均桶占用率降至最低如果您希望许多存储桶最多有一个元素(最大速度用于发现忽略高速缓存大小的复杂效果),或者如果您想要最小数量的空桶(对应于浪费的空间),则为0.5)。 / p>

I've done a little research on hash tables, and I keep running across the rule of thumb that when there are a certain number of entries (either max or via a load factor like 75%) the hash table should be expanded.

Almost always, the recommendation is to double (or double plus 1, i.e., 2n+1) the size of the hash table. However, I haven't been able to find a good reason for this.

Why double the size, rather than, say, increasing it 25%, or increasing it to the size of the next prime number, or next k prime numbers (e.g., three)?

I already know that it's often a good idea to choose an initial hash table size which is a prime number, at least if your hash function uses modulus such as universal hashing. And I know that's why it's usually recommended to do 2n+1 instead of 2n (e.g., http://www.concentric.net/~Ttwang/tech/hashsize.htm)

However as I said, I haven't seen any real explanation for why doubling or doubling-plus-one is actually a good choice rather than some other method of choosing a size for the new hash table.

(And yes I've read the Wikipedia article on hash tables :) http://en.wikipedia.org/wiki/Hash_table

解决方案

Hash-tables could not claim "amortized constant time insertion" if, for instance, the resizing was by a constant increment. In that case the cost of resizing (which grows with the size of the hash-table) would make the cost of one insertion linear in the total number of elements to insert. Because resizing becomes more and more expensive with the size of the table, it has to happen "less and less often" to keep the amortized cost of insertion constant.

Most implementations allow the average bucket occupation to grow to until a bound fixed in advance before resizing (anywhere between 0.5 and 3, which are all acceptable values). With this convention, just after resizing the average bucket occupation becomes half that bound. Resizing by doubling keeps the average bucket occupation in a band of width *2.

Sub-note: because of statistical clustering, you have to take an average bucket occupation as low as 0.5 if you want many buckets to have at most one elements (maximum speed for finding ignoring the complex effects of cache size), or as high as 3 if you want a minimum number of empty buckets (that correspond to wasted space).

这篇关于为什么哈希表扩展通常是通过加倍的大小完成的?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆