奇怪的JIT对循环习语的悲观化 [英] Strange JIT pessimization of a loop idiom
问题描述
在分析最近的一个问题的结果时,我遇到了一个非常特殊的现象:显然,HotSpot的额外一层JIT优化实际上会降低我机器上的执行速度。
While analyzing the results of a recent question here, I encountered a quite peculiar phenomenon: apparently an extra layer of HotSpot's JIT-optimization actually slows down execution on my machine.
这是我用于测量的代码:
Here is the code I have used for the measurement:
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@BenchmarkMode(Mode.AverageTime)
@OperationsPerInvocation(Measure.ARRAY_SIZE)
@Warmup(iterations = 2, time = 1)
@Measurement(iterations = 5, time = 1)
@State(Scope.Thread)
@Threads(1)
@Fork(2)
public class Measure
{
public static final int ARRAY_SIZE = 1024;
private final int[] array = new int[ARRAY_SIZE];
@Setup public void setup() {
final Random random = new Random();
for (int i = 0; i < ARRAY_SIZE; ++i) {
final int x = random.nextInt();
array[i] = x == 0? 1 : x;
}
}
@GenerateMicroBenchmark public int normalIndex() {
final int[] array = this.array;
int result = 0;
for (int i = 0; i < array.length; i++) {
final int j = i & array.length-1;
final int entry = array[i];
result ^= entry + j;
}
return result;
}
@GenerateMicroBenchmark public int maskedIndex() {
final int[] array = this.array;
int result = 0;
for (int i = 0; i < array.length; i++) {
final int j = i & array.length-1;
final int entry = array[j];
result ^= entry + i;
}
return result;
}
@GenerateMicroBenchmark public int normalWithExitPoint() {
final int[] array = this.array;
int result = 0;
for (int i = 0; i < array.length; i++) {
final int j = i & array.length-1;
final int entry = array[i];
result ^= entry + j;
if (entry == 0) break;
}
return result;
}
@GenerateMicroBenchmark public int maskedWithExitPoint() {
final int[] array = this.array;
int result = 0;
for (int i = 0; i < array.length; i++) {
final int j = i & array.length-1;
final int entry = array[j];
result ^= entry + i;
if (entry == 0) break;
}
return result;
}
}
代码是非常微妙所以让我指出重要的一点:
The code is quite subtle so let me point out the important bits:
- 普通索引变体使用直接变量
i
表示数组索引。 HotSpot可以在整个循环中轻松确定i
的范围,并消除数组边界检查; - 蒙面索引变体索引
j
,实际上等于i
,但是这个事实是通过AND-masking操作从HotSpot隐藏的; - with exit point变体引入了一个explict循环退出点。其重要性将在下面解释。
- the "normal index" variants use the straight variable
i
for array index. HotSpot can easily determine the range ofi
throughout the loop and eliminate the array bounds check; - the "masked index" variants index by
j
, which is actually equal toi
, but that fact is "hidden" from HotSpot through the AND-masking operation; - the "with exit point" variants introduce an explict loop exit point. The importance of this will be explained below.
观察边界检查数据有两个重要方面:
Observe that the bounds check figures in two important ways:
- 它具有与之关联的运行时开销(比较后跟条件分支) ;
- 它构成一个循环退出点,它可以在任何步骤中断循环。这对适用的JIT优化有重要影响。
- it has runtime overhead associated with it (a comparison followed by a conditional branch);
- it constitutes a loop exit point which can interrupt the loop on any step. This turns out to have important consequences for the applicable JIT optimizations.
通过检查上述四种方法发出的机器代码,我有注意到以下内容:
By inspecting the machine code emitted for the above four methods, I have noticed the following:
- 在所有情况下循环展开;
- 在
normalIndex
,它被区分为唯一没有过早循环退出点的操作,所有展开步骤的操作都被重新排序,以便首先执行所有数组提取,然后执行XOR - 将所有值输入累加器。
- in all cases the loop is unrolled;
- in the case of
normalIndex
, which is distinguished as the only one having no premature loop exit points, the operations of all the unrolled steps are reordered such that first all the array fetches are performed, followed by XOR-ing all the values into the accumulator.
现在我们可以根据讨论的特征对这四种方法进行分类:
Now we can classify the four methods according to the discussed features:
-
normalIndex
没有边界检查,没有循环退出点; -
normalWithExitPoint
没有边界检查和1个退出点; -
maskedIndex
有1个边界检查和1个退出点;
-
maskedWithExitPoint
有1个边界检查和2个退出点。
normalIndex
has no bounds checks and no loop exit points;normalWithExitPoint
has no bounds checks and 1 exit point;maskedIndex
has 1 bounds check and 1 exit point;maskedWithExitPoint
has 1 bounds check and 2 exit points.
显而易见的期望是上面的列表应该按性能的降序显示方法;然而,这些是我的实际结果:
The obvious expectation is that the above list should present the methods in descending order of performance; however, these are my actual results:
Benchmark Mode Samples Mean Mean error Units
normalIndex avgt 20 0.946 0.010 ns/op
normalWithExitPoint avgt 20 0.807 0.010 ns/op
maskedIndex avgt 20 0.803 0.007 ns/op
maskedWithExitPoint avgt 20 1.007 0.009 ns/op
-
normalWithExitPoint
和maskedIndex
是相同的模数测量误差,即使后者只有边界检查; - 在
normalIndex
上观察到最大的异常应该是最快的,但明显慢于normalWithExitPoint
,除了还有一行代码之外,在各个方面都相同它引入了退出点。 normalWithExitPoint
andmaskedIndex
are identical modulo measurement error, even though only the latter has the bounds check;- the greatest anomaly is observed on
normalIndex
which should have been the fastest, but is significantly slower thannormalWithExitPoint
, identical to it in every respect save for having one more line of code, the one which introduces the exit point. -
Java HotSpot (TM)64位服务器VM(版本24.0-b56,混合模式)
(Java 7 Update 40) - OS X版本10.9.1
- 2.66 GHz Intel Core i7
Java HotSpot(TM) 64-Bit Server VM (build 24.0-b56, mixed mode)
(Java 7 Update 40)- OS X Version 10.9.1
- 2.66 GHz Intel Core i7
因为正常索引
是唯一一个应用了额外重新排序优化的方法,结论是这是减速的原因。
Since normalIndex
is the only method which has the extra reordering "optimization" applied to it, the conclusion is that this is the cause of the slowdown.
我正在测试:
我也成功地在Java 8 EA b118上复制了结果。
I have successfully reproduced the results on Java 8 EA b118 as well.
上述现象是否可以在其他同类机器上重现?从一开始提到的问题我已经有一个提示,至少有些机器不会重现它,所以来自同一个CPU的另一个结果会非常有趣。
Is the above phenomenon reproducible on other, similar machines? From the question mentioned at the beginning I already have a hint that at least some machines do not reproduce it, so another result from the same CPU would be very interesting.
我收集了下表,其中执行时间与相关联-XX:LoopUnrollLimit
命令行参数。这里我只关注两个变体,有和没有 if(entry == 0)break;
line:
I have gathered the following table which correlates execution times with the -XX:LoopUnrollLimit
command-line argument. Here I focus only on two variants, with and without the if (entry == 0) break;
line:
LoopUnrollLimit: 14 15 18 19 22 23 60
withExitPoint: 96 95 95 79 80 80 69 1/100 ns
withoutExitPoint: 94 64 64 63 64 77 75 1/100 ns
可以观察到以下突然变化:
The following sudden changes can be observed:
-
从14转换到15,
withoutExitPoint
变体获得有利的LCM 1 转型并大大加快。由于循环展开限制,所有加载的值都适合寄存器;
on the transition from 14 to 15, the
withoutExitPoint
variant receives a beneficial LCM1 transformation and is significantly sped up. Due to the loop-unrolling limit, all loaded values fit into registers;
18-> 19, withExitPoint
variant获得加速,小于上述;
on 18->19, the withExitPoint
variant receives a speedup, lesser than the above;
22-> 23, withoutExitPoint
变量减慢了。在这一点上,我看到溢出到堆栈位置,如 maaartinus 的回答所述,开始发生。
on 22->23, the withoutExitPoint
variant is slowed down. At this point I see the spill-over into stack locations, as described in maaartinus's answer, starting to happen.
我的设置的默认 loopUnrollLimit
是60,所以我在最后一列中显示了它的结果。
The default loopUnrollLimit
for my setup is 60, so I present its results in the last column.
1 LCM =本地代码动作。正是转换导致所有数组访问发生在顶部,然后处理加载的值。
1 LCM = Local Code Motion. It is the transformation which causes all array access to happen on the top, followed by the processing of the loaded values.
https:// bug .openjdk.java.net / browse / JDK-7101232
0x00000001044a37c0: mov ecx,eax
0x00000001044a37c2: and ecx,esi ;*iand
; - org.sample.Measure::normalIndex@20 (line 44)
0x00000001044a37c4: mov rbp,QWORD PTR [rsp+0x28] ;*iload_3
; - org.sample.Measure::normalIndex@15 (line 44)
0x00000001044a37c9: add ecx,DWORD PTR [rbp+rsi*4+0x10]
0x00000001044a37cd: xor ecx,r8d
0x00000001044a37d0: mov DWORD PTR [rsp],ecx
0x00000001044a37d3: mov r10d,esi
0x00000001044a37d6: add r10d,0xf
0x00000001044a37da: and r10d,eax
0x00000001044a37dd: mov r8d,esi
0x00000001044a37e0: add r8d,0x7
0x00000001044a37e4: and r8d,eax
0x00000001044a37e7: mov DWORD PTR [rsp+0x4],r8d
0x00000001044a37ec: mov r11d,esi
0x00000001044a37ef: add r11d,0x6
0x00000001044a37f3: and r11d,eax
0x00000001044a37f6: mov DWORD PTR [rsp+0x8],r11d
0x00000001044a37fb: mov r8d,esi
0x00000001044a37fe: add r8d,0x5
0x00000001044a3802: and r8d,eax
0x00000001044a3805: mov DWORD PTR [rsp+0xc],r8d
0x00000001044a380a: mov r11d,esi
0x00000001044a380d: inc r11d
0x00000001044a3810: and r11d,eax
0x00000001044a3813: mov DWORD PTR [rsp+0x10],r11d
0x00000001044a3818: mov r8d,esi
0x00000001044a381b: add r8d,0x2
0x00000001044a381f: and r8d,eax
0x00000001044a3822: mov DWORD PTR [rsp+0x14],r8d
0x00000001044a3827: mov r11d,esi
0x00000001044a382a: add r11d,0x3
0x00000001044a382e: and r11d,eax
0x00000001044a3831: mov r9d,esi
0x00000001044a3834: add r9d,0x4
0x00000001044a3838: and r9d,eax
0x00000001044a383b: mov r8d,esi
0x00000001044a383e: add r8d,0x8
0x00000001044a3842: and r8d,eax
0x00000001044a3845: mov DWORD PTR [rsp+0x18],r8d
0x00000001044a384a: mov r8d,esi
0x00000001044a384d: add r8d,0x9
0x00000001044a3851: and r8d,eax
0x00000001044a3854: mov ebx,esi
0x00000001044a3856: add ebx,0xa
0x00000001044a3859: and ebx,eax
0x00000001044a385b: mov ecx,esi
0x00000001044a385d: add ecx,0xb
0x00000001044a3860: and ecx,eax
0x00000001044a3862: mov edx,esi
0x00000001044a3864: add edx,0xc
0x00000001044a3867: and edx,eax
0x00000001044a3869: mov edi,esi
0x00000001044a386b: add edi,0xd
0x00000001044a386e: and edi,eax
0x00000001044a3870: mov r13d,esi
0x00000001044a3873: add r13d,0xe
0x00000001044a3877: and r13d,eax
0x00000001044a387a: movsxd r14,esi
0x00000001044a387d: add r10d,DWORD PTR [rbp+r14*4+0x4c]
0x00000001044a3882: mov DWORD PTR [rsp+0x24],r10d
0x00000001044a3887: mov QWORD PTR [rsp+0x28],rbp
0x00000001044a388c: mov ebp,DWORD PTR [rsp+0x4]
0x00000001044a3890: mov r10,QWORD PTR [rsp+0x28]
0x00000001044a3895: add ebp,DWORD PTR [r10+r14*4+0x2c]
0x00000001044a389a: mov DWORD PTR [rsp+0x4],ebp
0x00000001044a389e: mov r10d,DWORD PTR [rsp+0x8]
0x00000001044a38a3: mov rbp,QWORD PTR [rsp+0x28]
0x00000001044a38a8: add r10d,DWORD PTR [rbp+r14*4+0x28]
0x00000001044a38ad: mov DWORD PTR [rsp+0x8],r10d
0x00000001044a38b2: mov r10d,DWORD PTR [rsp+0xc]
0x00000001044a38b7: add r10d,DWORD PTR [rbp+r14*4+0x24]
0x00000001044a38bc: mov DWORD PTR [rsp+0xc],r10d
0x00000001044a38c1: mov r10d,DWORD PTR [rsp+0x10]
0x00000001044a38c6: add r10d,DWORD PTR [rbp+r14*4+0x14]
0x00000001044a38cb: mov DWORD PTR [rsp+0x10],r10d
0x00000001044a38d0: mov r10d,DWORD PTR [rsp+0x14]
0x00000001044a38d5: add r10d,DWORD PTR [rbp+r14*4+0x18]
0x00000001044a38da: mov DWORD PTR [rsp+0x14],r10d
0x00000001044a38df: add r13d,DWORD PTR [rbp+r14*4+0x48]
0x00000001044a38e4: add r11d,DWORD PTR [rbp+r14*4+0x1c]
0x00000001044a38e9: add r9d,DWORD PTR [rbp+r14*4+0x20]
0x00000001044a38ee: mov r10d,DWORD PTR [rsp+0x18]
0x00000001044a38f3: add r10d,DWORD PTR [rbp+r14*4+0x30]
0x00000001044a38f8: mov DWORD PTR [rsp+0x18],r10d
0x00000001044a38fd: add r8d,DWORD PTR [rbp+r14*4+0x34]
0x00000001044a3902: add ebx,DWORD PTR [rbp+r14*4+0x38]
0x00000001044a3907: add ecx,DWORD PTR [rbp+r14*4+0x3c]
0x00000001044a390c: add edx,DWORD PTR [rbp+r14*4+0x40]
0x00000001044a3911: add edi,DWORD PTR [rbp+r14*4+0x44]
0x00000001044a3916: mov r10d,DWORD PTR [rsp+0x10]
0x00000001044a391b: xor r10d,DWORD PTR [rsp]
0x00000001044a391f: mov ebp,DWORD PTR [rsp+0x14]
0x00000001044a3923: xor ebp,r10d
0x00000001044a3926: xor r11d,ebp
0x00000001044a3929: xor r9d,r11d
0x00000001044a392c: xor r9d,DWORD PTR [rsp+0xc]
0x00000001044a3931: xor r9d,DWORD PTR [rsp+0x8]
0x00000001044a3936: xor r9d,DWORD PTR [rsp+0x4]
0x00000001044a393b: mov r10d,DWORD PTR [rsp+0x18]
0x00000001044a3940: xor r10d,r9d
0x00000001044a3943: xor r8d,r10d
0x00000001044a3946: xor ebx,r8d
0x00000001044a3949: xor ecx,ebx
0x00000001044a394b: xor edx,ecx
0x00000001044a394d: xor edi,edx
0x00000001044a394f: xor r13d,edi
0x00000001044a3952: mov r8d,DWORD PTR [rsp+0x24]
0x00000001044a3957: xor r8d,r13d ;*ixor
; - org.sample.Measure::normalIndex@34 (line 46)
0x00000001044a395a: add esi,0x10 ;*iinc
; - org.sample.Measure::normalIndex@36 (line 43)
0x00000001044a395d: cmp esi,DWORD PTR [rsp+0x20]
0x00000001044a3961: jl 0x00000001044a37c0 ;*if_icmpge
; - org.sample.Measure::normalIndex@12 (line 43)
推荐答案
JITC试图将所有内容组合在一起的原因对我来说还不清楚。 AFAIK有(有?)架构,其中两个负载的分组导致更好的性能(我认为一些早期的奔腾)。
The reason why the JITC tries to group everything together is rather unclear to me. AFAIK there are (were?) architectures for which grouping of two loads leads to a better performance (some early Pentium, I think).
由于JITC知道热点,它可以比提前编译器更积极地内联,因此在这种情况下它会执行16次。除了使循环相对更便宜之外,我在这里看不到任何明显的优势。我也怀疑是否有任何架构从16个负载组合中获利。
As JITC knows the hot spots, it can inline much more aggressively than an ahead-of-time compiler, so it does it 16 times in this case. I can't see any clear advantage thereof here, except for making the looping relatively cheaper. I also doubt that there's any architecture profiting from grouping 16 loads together.
代码计算16个临时值,每次迭代一次
The code computes 16 temporary values, one per iteration of
int j = i & array.length-1;
int entry = array[i];
int tmp = entry + j;
result ^= tmp;
每个计算都非常简单,一个AND,一个LOAD和一个ADD。这些值将映射到寄存器,但它们不够。所以这些值必须在以后存储和加载。
Each computation is pretty trivial, one AND, one LOAD, and one ADD. The values are to be mapped to registers, but there aren't enough of them. So the values have to be stored and loaded later.
16个寄存器中有7个会发生这种情况,并且显着增加了成本。
This happens for 7 out of the 16 registers and increases the costs significantly.
我不太确定使用 -XX进行验证:LoopUnrollLimit
:
LoopUnrollLimit Benchmark Mean Mean error Units
8 ..normalIndex 0.902 0.004 ns/op
8 ..normalWithExitPoint 0.913 0.005 ns/op
8 ..maskedIndex 0.918 0.006 ns/op
8 ..maskedWithExitPoint 0.996 0.008 ns/op
16 ..normalIndex 0.769 0.003 ns/op
16 ..normalWithExitPoint 0.930 0.004 ns/op
16 ..maskedIndex 0.937 0.004 ns/op
16 ..maskedWithExitPoint 1.012 0.003 ns/op
32 ..normalIndex 0.814 0.003 ns/op
32 ..normalWithExitPoint 0.816 0.005 ns/op
32 ..maskedIndex 0.838 0.003 ns/op
32 ..maskedWithExitPoint 0.978 0.002 ns/op
- ..normalIndex 0.830 0.002 ns/op
- ..normalWithExitPoint 0.683 0.002 ns/op
- ..maskedIndex 0.791 0.005 ns/op
- ..maskedWithExitPoint 0.908 0.003 ns/op
16的限制使得 normalIndex
成为最快的变体,这表明我对累积罚分是正确的。根据Marko的说法,生成的程序集也会在其他方面随着展开限制而变化,因此事情变得更加复杂。
The limit of 16 makes normalIndex
be the fastest variant, which indicates that I was right with the "overallocation penalty". Bur according to Marko, the generated assembly changes with the unroll limit also in other aspects, so things are more complicated.
这篇关于奇怪的JIT对循环习语的悲观化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!