ImmutableCollections SetN实现细节 [英] ImmutableCollections SetN implementation detail

查看:133
本文介绍了ImmutableCollections SetN实现细节的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我很难理解java-9中的实现细节 ImmutableCollections.SetN ;具体为什么需要增加内部数组两次。



假设你这样做:

  Set.of(1,2,3,4)// 4个元素,但内部数组为8 

更确切地说,我完全理解为什么这样做(双重扩展),如果是 HashMap - 你从来没有(几乎)想要 load_factor 是一个。值!= 1 可以改善搜索时间,例如条目更好地分散到存储桶中。



但是以防万一一个不可变集 - 我无法分辨。特别是因为选择内部数组的索引的方式。



让我提供一些细节。首先如何搜索索引:

  int idx = Math.floorMod(pe.hashCode()^ SALT,elements.length) ; 

pe 是我们投入的实际价值集合。 SALT 在启动时只生成32位,每 JVM 一次(如果需要,这是实际的随机化) 。 elements.length 我们的例子是 8 (4个元素,但这里是8个 - 大小加倍)。



此表达式类似于负安全模运算。请注意,在 HashMap 中完成相同的逻辑操作(例如,(n - 1)& hash )桶被选中。



因此,对于我们的情况,如果 elements.length为8 ,那么此表达式将返回任何较小的正值比8 (0,1,2,3,4,5,6,7)



现在其余方法:

  while(true){
E ee = elements [idx];
if(ee == null){
return -idx - 1;
}否则如果(pe.equals(ee)){
返回idx;
} else if(++ idx == elements.length){
idx = 0;
}
}

让我们将其分解:

  if(ee == null){
return -idx - 1;

这很好,这意味着数组中的当前插槽是空的 - 我们可以把我们的值那里。

 }否则如果(pe.equals(ee)){
返回idx;

这很糟糕 - 插槽已被占用且已存在的条目等于我们想要的放。 设置 s不能有重复的元素 - 因此稍后会抛出异常。

  else if(++ idx == elements.length){
idx = 0;
}

这意味着此插槽已被占用(哈希冲突),但元素不是等于。在 HashMap 中,此条目将与 LinkedNode TreeNode <放在同一个存储桶中/ code> - 但不是这里的情况。



所以索引递增并尝试下一个位置(小警告它以圆形方式移动当它到达最后一个位置时。)



这是一个问题:如果没有什么太花哨(除非我遗漏了什么)在搜索索引时正在进行,为什么是否需要两倍大的阵列?或者为什么函数不是这样写的:

  int idx = Math.floorMod(pe.hashCode()^ SALT,输入。长度); 

//注意diff elements.length(8)而不是input.length(4)


解决方案

SetN 的当前实现是一个相当简单的封闭散列方案,而不是使用单独的链接方法通过 HashMap 。 (封闭散列也被混淆地称为开放式寻址。)在封闭的散列方案中,元素存储在表本身中,而不是存储在从每个表槽链接的元素的列表或树中,这是单独的链接。



这意味着如果两个不同的元素散列到同一个表槽,则需要通过为其中一个元素找到另一个槽来解决此冲突。当前的 SetN 实现使用线性探测解决了这个问题,其中顺序检查表槽(最后回绕)直到找到一个打开的槽。



如果你想存储 N 元素,它们肯定适合大小 N 的表格。你总是可以找到集合中的任何元素,尽管你可能需要探测几个(或许多)连续的表槽来找到它,因为会有很多冲突。但是,如果探测到的是不是成员的对象,则线性探测必须检查每个表槽,然后才能确定该对象不是成员。使用完整表,大多数探测操作将降级到O(N)时间,而大多数基于散列的方法的目标是操作为O(1)时间。



<因此,我们有一个类时空权衡。如果我们把桌子做得更大,整个桌子上都会有空的插槽。存储项目时,应该有更少的冲突,线性探测将更快地找到空槽。彼此相邻的完整时隙簇将更小。非成员的探测器将更快地进行,因为他们更容易在线性探测时更快地遇到空槽 - 可能在完全不需要重新探测之后。



在实施过程中,我们使用不同的扩展因子运行了一系列基准测试。 (我在代码中使用了术语 EXPAND_FACTOR ,而大多数文献都使用载荷因子。原因是扩展因子是载荷因子的倒数,如<$中所使用的那样c $ c> HashMap ,并且对两种含义使用加载因子会令人困惑。)当扩展因子接近1.0时,探测器性能非常慢,正如预期的那样。随着扩张系数的增加,它得到了显着改善。到达3.0或4.0时,这种改善确实很平坦。我们之所以选择2.0,是因为它获得了大部分的性能提升(接近O(1)时间),与 HashSet 相比,提供了更好的空间节省。 (对不起,我们没有在任何地方发布这些基准测试数字。)



当然,所有这些都是实现细节,可能会从一个版本更改为下一个版本,因为我们找到更好的方法来优化系统。我确信有办法改进当前的实施。 (幸运的是,当我们这样做时,我们不必担心保留迭代顺序。)



有关负载因子的开放寻址和性能权衡的一个很好的讨论可以在3.4节中找到


塞奇威克,罗伯特和凯文韦恩。 算法,第四版。 Addison-Wesley,2011。


在线图书网站这里但请注意印刷版有更多细节。


I have sort of a hard time understanding an implementation detail from java-9 ImmutableCollections.SetN; specifically why is there a need to increase the inner array twice.

Suppose you do this:

Set.of(1,2,3,4) // 4 elements, but internal array is 8

More exactly I perfectly understand why this is done (a double expansion) in case of a HashMap - where you never (almost) want the load_factor to be one. A value of !=1 improves search time as entries are better dispersed to buckets for example.

But in case of an immutable Set - I can't really tell. Especially since the way an index of the internal array is chosen.

Let me provide some details. First how the index is searched:

 int idx = Math.floorMod(pe.hashCode() ^ SALT, elements.length);

pe is the actual value we put in the set. SALT is just 32 bits generated at start-up, once per JVM (this is the actual randomization if you want). elements.length for our example is 8 (4 elements, but 8 here - double the size).

This expression is like a negative-safe modulo operation. Notice that the same logical thing is done in HashMap for example ((n - 1) & hash) when the bucket is chosen.

So if elements.length is 8 for our case, then this expression will return any positive value that is less than 8 (0, 1, 2, 3, 4, 5, 6, 7).

Now the rest of the method:

 while (true) {
        E ee = elements[idx];
        if (ee == null) {
            return -idx - 1;
        } else if (pe.equals(ee)) {
            return idx;
        } else if (++idx == elements.length) {
            idx = 0;
        }
    }

Let's break it down:

if (ee == null) {
    return -idx - 1;

This is good, it means that the current slot in the array is empty - we can put our value there.

} else if (pe.equals(ee)) {
    return idx;

This is bad - slot is occupied and the already in place entry is equal to the one we want to put. Sets can't have duplicate elements - so an Exception is later thrown.

 else if (++idx == elements.length) {
      idx = 0;
 }

This means that this slot is occupied (hash collision), but elements are not equal. In a HashMap this entry would be put to the same bucket as a LinkedNode or TreeNode - but not the case here.

So index is incremented and the next position is tried (with the small caveat that it moves in a circular way when it reaches the last position).

And here is the question: if nothing too fancy (unless I'm missing something) is being done when searching the index, why is there a need to have an array twice as big? Or why the function was not written like this:

int idx = Math.floorMod(pe.hashCode() ^ SALT, input.length);

// notice the diff elements.length (8) and not input.length (4)

解决方案

The current implementation of SetN is a fairly simple closed hashing scheme, as opposed to the separate chaining approach used by HashMap. ("Closed hashing" is also confusingly known as "open addressing".) In a closed hashing scheme, elements are stored in the table itself, instead of being stored in a list or tree of elements that are linked from each table slot, which is separate chaining.

This implies that if two different elements hash to the same table slot, this collision needs to be resolved by finding another slot for one of the elements. The current SetN implementation resolves this using linear probing, where the table slots are checked sequentially (wrapping around at the end) until an open slot is found.

If you want to store N elements, they'll certainly fit into a table of size N. You can always find any element that's in the set, though you might have to probe several (or many) successive table slots to find it, because there will be lots of collisions. But if the set is probed for an object that's not a member, linear probing will have to check every table slot before it can determine that object isn't a member. With a full table, most probe operations will degrade to O(N) time, whereas the goal of most hash-based approaches is for operations to be O(1) time.

Thus we have a class space-time tradeoff. If we make the table larger, there will be empty slots sprinkled throughout the table. When storing items, there should be fewer collisions, and linear probing will find empty slots more quickly. The clusters of full slots next to each other will be smaller. Probes for non-members will proceed more quickly, since they're more likely to encounter an empty slot sooner while probing linearly -- possibly after not having to reprobe at all.

In bringing up the implementation, we ran a bunch of benchmarks using different expansion factors. (I used the term EXPAND_FACTOR in the code whereas most literature uses load factor. The reason is that the expansion factor is the reciprocal of the load factor, as used in HashMap, and using "load factor" for both meanings would be confusing.) When the expansion factor was near 1.0, the probe performance was quite slow, as expected. It improved considerably as the expansion factor was increased. The improvement was really flattening out by the time it got up to 3.0 or 4.0. We chose 2.0 since it got most of the performance improvement (close to O(1) time) while providing good space savings compared to HashSet. (Sorry, we haven't published these benchmark numbers anywhere.)

Of course, all of these are implementation specifics and may change from one release to the next, as we find better ways to optimize the system. I'm certain there are ways to improve the current implementation. (And fortunately we don't have to worry about preserving iteration order when we do this.)

A good discussion of open addressing and performance tradeoffs with load factors can be found in section 3.4 of

Sedgewick, Robert and Kevin Wayne. Algorithms, Fourth Edition. Addison-Wesley, 2011.

The online book site is here but note that the print edition has much more detail.

这篇关于ImmutableCollections SetN实现细节的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆