ImmutableCollections SetN 实现细节 [英] ImmutableCollections SetN implementation detail

查看:20
本文介绍了ImmutableCollections SetN 实现细节的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我很难理解 java-9 ImmutableCollections.SetN 中的实现细节;具体为什么需要将内部数组增加两次.

I have sort of a hard time understanding an implementation detail from java-9 ImmutableCollections.SetN; specifically why is there a need to increase the inner array twice.

假设你这样做:

Set.of(1,2,3,4) // 4 elements, but internal array is 8

更确切地说,我完全理解为什么在 HashMap 的情况下这样做(双重扩展) - 您从不(几乎)希望 load_factor 成为其中之一.!=1 的值可以缩短搜索时间,因为条目可以更好地分散到存储桶中.

More exactly I perfectly understand why this is done (a double expansion) in case of a HashMap - where you never (almost) want the load_factor to be one. A value of !=1 improves search time as entries are better dispersed to buckets for example.

但是在 不可变 Set 的情况下 - 我真的不知道.特别是因为内部数组索引的选择方式.

But in case of an immutable Set - I can't really tell. Especially since the way an index of the internal array is chosen.

让我提供一些细节.首先是如何搜索索引:

Let me provide some details. First how the index is searched:

 int idx = Math.floorMod(pe.hashCode() ^ SALT, elements.length);

pe 是我们放入集合中的实际值.SALT 只是在启动时生成的 32 位,每个 JVM 一次(如果需要,这是实际的随机化).elements.length 在我们的例子中是 8(4 个元素,但这里 8 个元素 - 大小翻倍).

pe is the actual value we put in the set. SALT is just 32 bits generated at start-up, once per JVM (this is the actual randomization if you want). elements.length for our example is 8 (4 elements, but 8 here - double the size).

这个表达式就像一个负安全模运算.Notice that the same logical thing is done in HashMap for example ((n - 1) & hash) when the bucket is chosen.

This expression is like a negative-safe modulo operation. Notice that the same logical thing is done in HashMap for example ((n - 1) & hash) when the bucket is chosen.

所以如果 elements.length is 8 对于我们的例子,那么这个表达式将返回任何小于 8 (0, 1, 2, 3, 4, 5,6, 7).

So if elements.length is 8 for our case, then this expression will return any positive value that is less than 8 (0, 1, 2, 3, 4, 5, 6, 7).

现在剩下的方法:

 while (true) {
        E ee = elements[idx];
        if (ee == null) {
            return -idx - 1;
        } else if (pe.equals(ee)) {
            return idx;
        } else if (++idx == elements.length) {
            idx = 0;
        }
    }

让我们分解一下:

if (ee == null) {
    return -idx - 1;

这很好,这意味着数组中的当前槽是空的——我们可以把我们的值放在那里.

This is good, it means that the current slot in the array is empty - we can put our value there.

} else if (pe.equals(ee)) {
    return idx;

这很糟糕 - 插槽被占用,并且已经到位的条目等于我们想要放置的条目.Sets 不能有重复的元素 - 所以稍后会抛出异常.

This is bad - slot is occupied and the already in place entry is equal to the one we want to put. Sets can't have duplicate elements - so an Exception is later thrown.

 else if (++idx == elements.length) {
      idx = 0;
 }

这意味着这个槽被占用了(哈希冲突),但是元素不相等.在 HashMap 中,该条目将被放入与 LinkedNodeTreeNode 相同的存储桶中 - 但这里不是这种情况.

This means that this slot is occupied (hash collision), but elements are not equal. In a HashMap this entry would be put to the same bucket as a LinkedNode or TreeNode - but not the case here.

所以 index 递增并尝试下一个位置(有一点需要注意的是,当它到达最后一个位置时它会以圆形方式移动).

So index is incremented and the next position is tried (with the small caveat that it moves in a circular way when it reaches the last position).

这里有一个问题:如果在搜索索引时没有做任何太花哨的事情(除非我遗漏了一些东西),为什么需要一个两倍大的数组?或者为什么函数不是这样写的:

And here is the question: if nothing too fancy (unless I'm missing something) is being done when searching the index, why is there a need to have an array twice as big? Or why the function was not written like this:

int idx = Math.floorMod(pe.hashCode() ^ SALT, input.length);

// notice the diff elements.length (8) and not input.length (4)

推荐答案

SetN 的当前实现是一个相当简单的封闭散列方案,与 HashMap 使用的单独链接方法相反.(封闭散列"也容易混淆地称为开放寻址".)在封闭散列方案中, 元素存储在表本身中,而不是存储在从每个表槽链接的元素的列表或树中,这是单独的链接.

The current implementation of SetN is a fairly simple closed hashing scheme, as opposed to the separate chaining approach used by HashMap. ("Closed hashing" is also confusingly known as "open addressing".) In a closed hashing scheme, elements are stored in the table itself, instead of being stored in a list or tree of elements that are linked from each table slot, which is separate chaining.

这意味着如果两个不同的元素散列到同一个表槽,则需要通过为其中一个元素找到另一个槽来解决这种冲突.当前的 SetN 实现使用 线性探测 解决了这个问题,其中表槽被顺序检查(在末尾环绕),直到找到一个开放槽.

This implies that if two different elements hash to the same table slot, this collision needs to be resolved by finding another slot for one of the elements. The current SetN implementation resolves this using linear probing, where the table slots are checked sequentially (wrapping around at the end) until an open slot is found.

如果您想存储 N 个元素,它们肯定适合 N 大小的表.您总是可以找到集合中的任何元素,尽管您可能必须探测几个(或多个)连续的表槽才能找到它,因为会有很多冲突.但是,如果针对不是成员的对象探测该集合,则线性探测必须检查每个表槽,然后才能确定该对象不是成员.对于全表,大多数探测操作将降级为 O(N) 时间,而大多数基于哈希的方法的目标是使操作达到 O(1) 时间.

If you want to store N elements, they'll certainly fit into a table of size N. You can always find any element that's in the set, though you might have to probe several (or many) successive table slots to find it, because there will be lots of collisions. But if the set is probed for an object that's not a member, linear probing will have to check every table slot before it can determine that object isn't a member. With a full table, most probe operations will degrade to O(N) time, whereas the goal of most hash-based approaches is for operations to be O(1) time.

因此我们有一个类时空权衡.如果我们把桌子变大,整个桌子上就会散落一些空槽.存储物品时,应该减少碰撞,线性探测会更快地找到空槽.彼此相邻的完整插槽集群将更小.非成员的探测将进行得更快,因为他们更有可能在线性探测时更快地遇到空槽——可能根本不需要重新探测.

Thus we have a class space-time tradeoff. If we make the table larger, there will be empty slots sprinkled throughout the table. When storing items, there should be fewer collisions, and linear probing will find empty slots more quickly. The clusters of full slots next to each other will be smaller. Probes for non-members will proceed more quickly, since they're more likely to encounter an empty slot sooner while probing linearly -- possibly after not having to reprobe at all.

在提出实施时,我们使用不同的扩展因子运行了一系列基准测试.(我在代码中使用了术语EXPAND_FACTOR,而大多数文献使用负载因子.原因是扩展因子是负载因子的倒数,如HashMap,并且使用负载因子"来表示这两种含义会造成混淆.)当扩展因子接近 1.0 时,探测性能很慢,正如预期的那样.随着膨胀系数的增加,它得到了显着改善.当它达到 3.0 或 4.0 时,改进确实变得平缓了.我们选择 2.0 是因为与 HashSet 相比,它获得了大部分性能改进(接近 O(1) 时间),同时提供了良好的空间节省.(抱歉,我们还没有在任何地方发布这些基准数据.)

In bringing up the implementation, we ran a bunch of benchmarks using different expansion factors. (I used the term EXPAND_FACTOR in the code whereas most literature uses load factor. The reason is that the expansion factor is the reciprocal of the load factor, as used in HashMap, and using "load factor" for both meanings would be confusing.) When the expansion factor was near 1.0, the probe performance was quite slow, as expected. It improved considerably as the expansion factor was increased. The improvement was really flattening out by the time it got up to 3.0 or 4.0. We chose 2.0 since it got most of the performance improvement (close to O(1) time) while providing good space savings compared to HashSet. (Sorry, we haven't published these benchmark numbers anywhere.)

当然,所有这些都是实现细节,随着我们找到更好的方法来优化系统,可能会从一个版本到下一个版本发生变化.我确信有一些方法可以改进当前的实现.(幸运的是,我们在执行此操作时不必担心保留迭代顺序.)

Of course, all of these are implementation specifics and may change from one release to the next, as we find better ways to optimize the system. I'm certain there are ways to improve the current implementation. (And fortunately we don't have to worry about preserving iteration order when we do this.)

塞奇威克、罗伯特和凯文韦恩.算法, 第四版.艾迪生-韦斯利,2011 年.

Sedgewick, Robert and Kevin Wayne. Algorithms, Fourth Edition. Addison-Wesley, 2011.

在线图书网站位于此处,但请注意印刷版有更多详细信息.

The online book site is here but note that the print edition has much more detail.

这篇关于ImmutableCollections SetN 实现细节的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆