为什么在哈希表大小加倍呢? [英] Why is the hash table resized by doubling it?

查看:314
本文介绍了为什么在哈希表大小加倍呢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

检查在Java和哈希表code例子在网上google搜索似乎是表的大小调整通过加倍完成它。
但大多数教科书说,对于该表的最佳尺寸是素数。
所以我的问题是:
是加倍的方法,因为:

  1. 这是很容易实现,或
  2. 是找到一个素数的效率太低(但我觉得找 使用下一任去了 N + = 2 和测试素性 模数为O(loglogN),这是便宜)
  3. 或者,这是我的误解,只有特定的哈希表的变种 只需要黄金表的大小?

更新:
psented使用一个素数的教科书的方式$ P $是必需的某些属性的工作(如二次探测需求的黄金尺寸表证明,例如,如果一个表不完整的项目X将被插入)。
张贴重复的链接要求一般为任意数量的,例如增加25%或下任首相,而我们增加一倍,以保持调整操作罕见的,所以我们可以保证分期时间的答案被接受的状态。
这不回答具有为素数表的大小,并使用一个素用于调整,使得比双更大的问题。这样的想法是保持素尺寸考虑到调整大小开销的属性

解决方案
  

问:但是大多数教科书上说,对于表中的最佳尺寸是素数

关于大小素性:

  

什么来规模素性,这取决于冲突解决算法的选择。有些算法需要首要台面尺寸(双散列,二次哈希),其他人不这样做,他们可能受益于2功率表的大小,因为它可以很便宜模运算。但是,在最近的可用表的大小,在2次不同,哈希表的内存使用情况可能是不可靠的。因此,即使使用线性散列或单独的链接,你可以选择2大小非的力量。在这种情况下,反过来,这是值得选择格外黄金尺寸,因为:

  如果你选择素表的大小(或者是因为算法需要这个,还是因为你不满意幂的2尺寸意味着内存使用不可靠),表槽计算(模数由表的大小)可与哈希相结合。请参见这个答案了解。

     

点的2功率的表的大小是不可取的,当哈希函数的分布是坏的(由尼尔·科菲的答案)是不切实际的,因为即使你有不良的散列函数,雪崩,并仍然使用电源的-2规模会更快的切换到素表的大小,因为一个整数分频仍慢于现代CPU的几个multimplications和移位操作,通过良好的雪崩功能要求,即G。从MurmurHash3。


  

问:同时说实话,我迷路了一下,如果你真的推荐的素数或不。看来,这取决于哈希表变量和散列函数的质量?

  1. 散列函数的质量没有关系,你总是可以通过MurMur3 avalancing改善散列函数,比从开机的-2表大小切换到素表的大小更便宜,见上面。

  2. 我建议选择黄金大小,QHash或二次散列算法(不一样的) ,的,当你需要的在哈希表客座率precise控制 predictably高实际负载。随着功率的-2表的大小,最小大小调整系数为2,一般我们不能保证在哈希表将具有实际负载系数超过0.5的高度。 看到这个答案。

    ,否则,我建议去与电源的-2大小的哈希表与线性探测。

  

问:是由于增加一倍的做法:
  这是很容易实现的,或

基本上,在很多情况下,是的。 请参见这对于客座率大型回答的:

  

负载因数不哈希表的数据结构的一个重要组成部分 - 它是定义行为的规则为dymamic系统(生长/收缩哈希表是一个动态系统)的方式版

。   而且,在我看来,在95%的这种方式是通过简化的现代哈希表的情况下,动力系统表现不理想的。

什么是的翻番的?这只是simpliest大小调整策略。该策略可以是任意复杂的,在您使用的情况下的最佳性能。它可以考虑present哈希表的大小,生长强度(得到多少行动是因为previous调整大小一样),等没人禁止你实现这样的自定义调整的逻辑。

  

问:是找到一个素数的效率太低(但我认为,要找到下一任去了N + = 2和测试用模素数为O(loglogN),这是便宜)

这是一个很好的做法,以precompute素哈希表大小的某个子集,使用运行时二进制搜索它们之间做出选择。请参阅<一href="https://github.com/OpenHFT/Koloboke/blob/344089c9fc7c2b53ba7d1299eb29214206e1ab1d/lib/impl/src/main/java/net/openhft/koloboke/collect/impl/hash/DHashCapacities.java#L256">the双击列表哈希能力和交代,<一个href="https://github.com/OpenHFT/Koloboke/blob/344089c9fc7c2b53ba7d1299eb29214206e1ab1d/lib/impl/src/main/java/net/openhft/koloboke/collect/impl/hash/QHashCapacities.java#L288">QHash 的能力。或者,即使使用<一个href="https://github.com/OpenHFT/Koloboke/blob/344089c9fc7c2b53ba7d1299eb29214206e1ab1d/lib/impl/src/main/java/net/openhft/koloboke/collect/impl/hash/QHashCapacities.java#L153">direct查找时,这是非常快的。

  

问:或这是我的误解,只有特定的哈希表的变体只需要黄金表大小

是的,只有某些类型的requre,见上面。

Checking in java and googling online for hashtable code examples it seems that the resizing of the table is done by doubling it.
But most textbooks say that the best size for the table is a prime number.
So my question is:
Is the approach of doubling because:

  1. It is easy to implement, or
  2. Is finding a prime number too inefficient (but I think that finding the next prime going over n+=2 and testing for primality using modulo is O(loglogN) which is cheap)
  3. Or this is my misunderstanding and only certain hashtable variants only require prime table size?

Update:
The way presented in textbooks using a prime number is required for certain properties to work (e.g. quadratic probing needs a prime size table to prove that e.g. if a table is not full item X will be inserted).
The link posted as duplicate asks generally about increasing by any number e.g. 25% or next prime and the answer accepted states that we double in order to keep the resizing operation "rare" so we can guarantee amortized time.
This does not answer the question of having a table size that is prime and using a prime for resizing such that is even greater than double. So the idea is to keep the properties of the prime size taking into account the resizing overhead

解决方案

Q: But most textbooks say that the best size for the table is a prime number.

Regarding size primality:

What comes to primality of size, it depends on collision resolution algorithm your choose. Some algorithms require prime table size (double hashing, quadratic hashing), others don't, and they could benefit from table size of power of 2, because it allows very cheap modulo operations. However, when closest "available table sizes" differ in 2 times, memory usage of hash table might be unreliable. So, even using linear hashing or separate chaining, you can choose non power of 2 size. In this case, in turn, it's worth to choose particulary prime size, because:

If you pick prime table size (either because algorithm requires this, or because you are not satisfied with memory usage unreliability implied by power-of-2 size), table slot computation (modulo by table size) could be combined with hashing. See this answer for more.

The point that table size of power of 2 is undesirable when hash function distribution is bad (from the answer by Neil Coffey) is impractical, because even if you have bad hash function, avalanching it and still using power-of-2 size would be faster that switching to prime table size, because a single integral division is still slower on modern CPUs that several of multimplications and shift operations, required by good avalanching functions, e. g. from MurmurHash3.


Q: Also to be honest I got lost a bit on if you actually recommend primes or not. Seems that it depends on the hash table variant and the quality of the hash function?

  1. Quality of hash function doesn't matter, you always can "improve" hash function by MurMur3 avalancing, that is cheaper than switching to prime table size from power-of-2 table size, see above.

  2. I recommend choosing prime size, with QHash or quadratic hash algorithm (aren't same), only when you need precise control over hash table load factor and predictably high actual loads. With power-of-2 table size, minimum resize factor is 2, and generally we cannot guarantee the hash table will have actual load factor any higher than 0.5. See this answer.

    Otherwise, I recommend to go with power-of-2 sized hash table with linear probing.

Q: Is the approach of doubling because:
It is easy to implement, or

Basically, in many cases, yes. See this large answer regarding load factors:

Load factor is not an essential part of hash table data structure -- it is the way to define rules of behaviour for the dymamic system (growing/shrinking hash table is a dynamic system).

Moreover, in my opinion, in 95% of modern hash table cases this way is over simplified, dynamic systems behave suboptimally.

What is doubling? It's just the simpliest resizing strategy. The strategy could be arbitrarily complex, performing optimally in your use cases. It could consider present hash table size, growth intensity (how much get operations were done since previous resize), etc. Nobody forbids you to implement such custom resizing logic.

Q: Is finding a prime number too inefficient (but I think that finding the next prime going over n+=2 and testing for primality using modulo is O(loglogN) which is cheap)

There is a good practice to precompute some subset of prime hash table sizes, to choose between them using binary search in runtime. See the list double hash capacities and explaination, QHash capacities. Or, even using direct lookup, that is very fast.

Q: Or this is my misunderstanding and only certain hashtable variants only require prime table size?

Yes, only certain types requre, see above.

这篇关于为什么在哈希表大小加倍呢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆