番石榴的ImmutableSet.contains的性能 [英] Performance of Guava's ImmutableSet.contains

查看:133
本文介绍了番石榴的ImmutableSet.contains的性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

番石榴的ImmutableSet在我关于contains的基准测试中似乎表现不佳.对于某些尺寸,它甚至比List慢得多:

Guava's ImmutableSet seems to perform quite poorly in my benchmark concerning contains. For some sizes it gets even much slower than List:

  size            benchmark         ns linear runtime
100000         ListContains  110279.54 ==
100000          SetContains       7.15 =
100000 ImmutableSetContains   76716.47 =
200000         ListContains  275367.66 =====
200000          SetContains       7.34 =
200000 ImmutableSetContains  322185.50 ======
500000         ListContains  935210.10 ====================
500000          SetContains       7.79 =
500000 ImmutableSetContains 1382765.76 ==============================

基本上,我用数千个负整数填充一个集合,并用非负整数测试包含的整数.该代码是微不足道的,但是对于粘贴在较小的文本区域中来说太长了,因此请查看此处.

Basically, I fill a set with a few thousands negative integers and test contains with non-negative ones. The code is trivial but a bit too long for pasting in a small text area, so please look here.

我想知道这是怎么回事.可能是我遇到了一些简陋的案件,尽管我显然没有尝试过.也许我只是吹破了基准.否则,我想知道是否可以而且应该解决该问题.

I wonder what's going on here. Probably, I hit some degenerate case, although I obviously didn't try to. Or maybe I've just blew the benchmark. Otherwise, I wonder if it can and should be fixed.

解决方案是通过替换

hashCode ^= (hashCode >>> 20) ^ (hashCode >>> 12);
return hashCode ^ (hashCode >>> 7) ^ (hashCode >>> 4);

作者

return C2 * Integer.rotateLeft(hashCode * C1, 15);

这大约需要相同的时间,并且可能有一些缺点,但是可以很好地散布散列来解决当前的问题.

This takes about the same time and might have some disadvantage, but solves the current problem by nicely spreading the hashes.

推荐答案

说明实际上非常简单.想象一下,整数位于某些N的集合M = {0, 1, ..., N-1}中,这是2的幂.由于Integer.hashCode的定义方式,哈希也是如此.通过与smear处理哈希.java#HashMap.hash%28int%29"rel =" nofollow>此,以便在某些常见情况下将最低位的冲突降到最低.

The explanation is actually very simple. Imagine the integers lie in the set M = {0, 1, ..., N-1} for some N, which is a power of two. And so do the hashes, because of how Integer.hashCode is defined. The hashes get processed via a function smear identical to this one in order to minimize collision in the lowest bits in some common cases.

分配大小为2*N的表,并将M的成员放入表中.由于smear仅涉及向右移位,因此它将M映射到自身,这意味着将填充表的连续范围.因此,可以说表的左半部分中的所有插槽均已使用,而另一半未使用.

A table of size 2*N gets allocated and the members of M get put into it. As smear involves only right shifts, it maps M onto itself, which means that a continuous range of the table gets filled. So lets say that all slots in the left half of the table are used and the other half is unused.

当调用contains(o)时,搜索从由o.hashCode()确定位置的插槽开始.如果找到o,则结果为true,如果命中了一个空插槽,则结果为false.否则,搜索进行到另一个时隙.为了最大程度地减少缓存未命中,使用了线性探测.

When contains(o) gets invoked, the search starts in a slot which position is determined by o.hashCode(). If o gets found, the result is true, if an empty slot gets hit, the result is false. Otherwise, the search proceeds to another slot. In order to minimize cache misses, linear probing gets used.

当我们很不幸无法在第一个使用的插槽开始搜索时,必须遍历所有这些位置,这意味着N步骤.从随机位置开始意味着平均N/4步进.

When we are unlucky enough to start the search at the first used slot, all of them have to be traversed, which means N steps. Starting at a random position means N/4 steps on the average.

在我的基准测试中发生的情况与上述情况不完全相同,但是其性能不佳的原因是相同的.将大小乘以2的幂会使问题变得更糟.

What happened in my benchmark is not exactly as above, but the reason for its bad performance is the same. Making the sizes to powers of two makes the problem worse.

这篇关于番石榴的ImmutableSet.contains的性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆