奇怪的JIT对循环习语的悲观化 [英] Strange JIT pessimization of a loop idiom

查看:149
本文介绍了奇怪的JIT对循环习语的悲观化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在分析最近的一个问题的结果时,我遇到了一个非常特殊的现象:显然,HotSpot的额外一层JIT优化实际上会降低我机器上的执行速度。

While analyzing the results of a recent question here, I encountered a quite peculiar phenomenon: apparently an extra layer of HotSpot's JIT-optimization actually slows down execution on my machine.

这是我用于测量的代码:

Here is the code I have used for the measurement:

@OutputTimeUnit(TimeUnit.NANOSECONDS)
@BenchmarkMode(Mode.AverageTime)
@OperationsPerInvocation(Measure.ARRAY_SIZE)
@Warmup(iterations = 2, time = 1)
@Measurement(iterations = 5, time = 1)
@State(Scope.Thread)
@Threads(1)
@Fork(2)
public class Measure
{
  public static final int ARRAY_SIZE = 1024;
  private final int[] array = new int[ARRAY_SIZE];

  @Setup public void setup() {
    final Random random = new Random();
    for (int i = 0; i < ARRAY_SIZE; ++i) {
      final int x = random.nextInt();
      array[i] = x == 0? 1 : x;
    }
  }

  @GenerateMicroBenchmark public int normalIndex() {
    final int[] array = this.array;
    int result = 0;
    for (int i = 0; i < array.length; i++) {
      final int j = i & array.length-1;
      final int entry = array[i];
      result ^= entry + j;
    }
    return result;
  }

  @GenerateMicroBenchmark public int maskedIndex() {
    final int[] array = this.array;
    int result = 0;
    for (int i = 0; i < array.length; i++) {
      final int j = i & array.length-1;
      final int entry = array[j];
      result ^= entry + i;
    }
    return result;
  }

  @GenerateMicroBenchmark public int normalWithExitPoint() {
    final int[] array = this.array;
    int result = 0;
    for (int i = 0; i < array.length; i++) {
      final int j = i & array.length-1;
      final int entry = array[i];
      result ^= entry + j;
      if (entry == 0) break;
    }
    return result;
  }

  @GenerateMicroBenchmark public int maskedWithExitPoint() {
    final int[] array = this.array;
    int result = 0;
    for (int i = 0; i < array.length; i++) {
      final int j = i & array.length-1;
      final int entry = array[j];
      result ^= entry + i;
      if (entry == 0) break;
    }
    return result;
  }


}

代码是非常微妙所以让我指出重要的一点:

The code is quite subtle so let me point out the important bits:


  • 普通索引变体使用直接变量 i 表示数组索引。 HotSpot可以在整个循环中轻松确定 i 的范围,并消除数组边界检查;

  • 蒙面索引变体索引 j ,实际上等于 i ,但是这个事实是通过AND-masking操作从HotSpot隐藏的;

  • with exit point变体引入了一个explict循环退出点。其重要性将在下面解释。

  • the "normal index" variants use the straight variable i for array index. HotSpot can easily determine the range of i throughout the loop and eliminate the array bounds check;
  • the "masked index" variants index by j, which is actually equal to i, but that fact is "hidden" from HotSpot through the AND-masking operation;
  • the "with exit point" variants introduce an explict loop exit point. The importance of this will be explained below.

观察边界检查数据有两个重要方面:

Observe that the bounds check figures in two important ways:


  • 它具有与之关联的运行时开销(比较后跟条件分支) ;

  • 它构成一个循环退出点,它可以在任何步骤中断循环。这对适用的JIT优化有重要影响。

  • it has runtime overhead associated with it (a comparison followed by a conditional branch);
  • it constitutes a loop exit point which can interrupt the loop on any step. This turns out to have important consequences for the applicable JIT optimizations.

通过检查上述四种方法发出的机器代码,我有注意到以下内容:

By inspecting the machine code emitted for the above four methods, I have noticed the following:


  • 在所有情况下循环展开;

  • normalIndex ,它被区分为唯一没有过早循环退出点的操作,所有展开步骤的操作都被重新排序,以便首先执行所有数组提取,然后执行XOR - 将所有值输入累加器。

  • in all cases the loop is unrolled;
  • in the case of normalIndex, which is distinguished as the only one having no premature loop exit points, the operations of all the unrolled steps are reordered such that first all the array fetches are performed, followed by XOR-ing all the values into the accumulator.

现在我们可以根据讨论的特征对这四种方法进行分类:

Now we can classify the four methods according to the discussed features:


  • normalIndex 没有边界检查,没有循环退出点;

  • normalWithExitPoint 没有边界检查和1个退出点;

  • maskedIndex 有1个边界检查和1个退出点;
  • maskedWithExitPoint 有1个边界检查和2个退出点。

  • normalIndex has no bounds checks and no loop exit points;
  • normalWithExitPoint has no bounds checks and 1 exit point;
  • maskedIndex has 1 bounds check and 1 exit point;
  • maskedWithExitPoint has 1 bounds check and 2 exit points.

显而易见的期望是上面的列表应该按性能的降序显示方法;然而,这些是我的实际结果:

The obvious expectation is that the above list should present the methods in descending order of performance; however, these are my actual results:

Benchmark               Mode   Samples         Mean   Mean error    Units
normalIndex             avgt        20        0.946        0.010    ns/op
normalWithExitPoint     avgt        20        0.807        0.010    ns/op
maskedIndex             avgt        20        0.803        0.007    ns/op
maskedWithExitPoint     avgt        20        1.007        0.009    ns/op




  • normalWithExitPoint maskedIndex 是相同的模数测量误差,即使后者只有边界检查;

  • normalIndex 上观察到最大的异常应该是最快的,但明显慢于 normalWithExitPoint ,除了还有一行代码之外,在各个方面都相同它引入了退出点。

    • normalWithExitPoint and maskedIndex are identical modulo measurement error, even though only the latter has the bounds check;
    • the greatest anomaly is observed on normalIndex which should have been the fastest, but is significantly slower than normalWithExitPoint, identical to it in every respect save for having one more line of code, the one which introduces the exit point.
    • 因为正常索引是唯一一个应用了额外重新排序优化的方法,结论是这是减速的原因。

      Since normalIndex is the only method which has the extra reordering "optimization" applied to it, the conclusion is that this is the cause of the slowdown.

      我正在测试:


      • Java HotSpot (TM)64位服务器VM(版本24.0-b56,混合模式)(Java 7 Update 40)

      • OS X版本10.9.1

      • 2.66 GHz Intel Core i7

      • Java HotSpot(TM) 64-Bit Server VM (build 24.0-b56, mixed mode) (Java 7 Update 40)
      • OS X Version 10.9.1
      • 2.66 GHz Intel Core i7

      我也成功地在Java 8 EA b118上复制了结果。

      I have successfully reproduced the results on Java 8 EA b118 as well.

      上述现象是否可以在其他同类机器上重现?从一开始提到的问题我已经有一个提示,至少有些机器不会重现它,所以来自同一个CPU的另一个结果会非常有趣。

      Is the above phenomenon reproducible on other, similar machines? From the question mentioned at the beginning I already have a hint that at least some machines do not reproduce it, so another result from the same CPU would be very interesting.

      我收集了下表,其中执行时间与相关联-XX:LoopUnrollLimit 命令行参数。这里我只关注两个变体,有和没有 if(entry == 0)break; line:

      I have gathered the following table which correlates execution times with the -XX:LoopUnrollLimit command-line argument. Here I focus only on two variants, with and without the if (entry == 0) break; line:

      LoopUnrollLimit:   14 15 18 19 22 23 60
      withExitPoint:     96 95 95 79 80 80 69   1/100 ns
      withoutExitPoint:  94 64 64 63 64 77 75   1/100 ns
      

      可以观察到以下突然变化:

      The following sudden changes can be observed:


      • 从14转换到15, withoutExitPoint 变体获得有利的LCM 1 转型并大大加快。由于循环展开限制,所有加载的值都适合寄存器;

      • on the transition from 14 to 15, the withoutExitPoint variant receives a beneficial LCM1 transformation and is significantly sped up. Due to the loop-unrolling limit, all loaded values fit into registers;

      18-> 19, withExitPoint variant获得加速,小于上述;

      on 18->19, the withExitPoint variant receives a speedup, lesser than the above;

      22-> 23, withoutExitPoint 变量减慢了。在这一点上,我看到溢出到堆栈位置,如 maaartinus 的回答所述,开始发生。

      on 22->23, the withoutExitPoint variant is slowed down. At this point I see the spill-over into stack locations, as described in maaartinus's answer, starting to happen.

      我的设置的默认 loopUnrollLimit 是60,所以我在最后一列中显示了它的结果。

      The default loopUnrollLimit for my setup is 60, so I present its results in the last column.


      1 LCM =本地代码动作。正是转换导致所有数组访问发生在顶部,然后处理加载的值。

      1 LCM = Local Code Motion. It is the transformation which causes all array access to happen on the top, followed by the processing of the loaded values.

      https:// bug .openjdk.java.net / browse / JDK-7101232

      0x00000001044a37c0: mov    ecx,eax
      0x00000001044a37c2: and    ecx,esi            ;*iand
                                                    ; - org.sample.Measure::normalIndex@20 (line 44)
      0x00000001044a37c4: mov    rbp,QWORD PTR [rsp+0x28]  ;*iload_3
                                                    ; - org.sample.Measure::normalIndex@15 (line 44)
      0x00000001044a37c9: add    ecx,DWORD PTR [rbp+rsi*4+0x10]
      0x00000001044a37cd: xor    ecx,r8d
      0x00000001044a37d0: mov    DWORD PTR [rsp],ecx
      0x00000001044a37d3: mov    r10d,esi
      0x00000001044a37d6: add    r10d,0xf
      0x00000001044a37da: and    r10d,eax
      0x00000001044a37dd: mov    r8d,esi
      0x00000001044a37e0: add    r8d,0x7
      0x00000001044a37e4: and    r8d,eax
      0x00000001044a37e7: mov    DWORD PTR [rsp+0x4],r8d
      0x00000001044a37ec: mov    r11d,esi
      0x00000001044a37ef: add    r11d,0x6
      0x00000001044a37f3: and    r11d,eax
      0x00000001044a37f6: mov    DWORD PTR [rsp+0x8],r11d
      0x00000001044a37fb: mov    r8d,esi
      0x00000001044a37fe: add    r8d,0x5
      0x00000001044a3802: and    r8d,eax
      0x00000001044a3805: mov    DWORD PTR [rsp+0xc],r8d
      0x00000001044a380a: mov    r11d,esi
      0x00000001044a380d: inc    r11d
      0x00000001044a3810: and    r11d,eax
      0x00000001044a3813: mov    DWORD PTR [rsp+0x10],r11d
      0x00000001044a3818: mov    r8d,esi
      0x00000001044a381b: add    r8d,0x2
      0x00000001044a381f: and    r8d,eax
      0x00000001044a3822: mov    DWORD PTR [rsp+0x14],r8d
      0x00000001044a3827: mov    r11d,esi
      0x00000001044a382a: add    r11d,0x3
      0x00000001044a382e: and    r11d,eax
      0x00000001044a3831: mov    r9d,esi
      0x00000001044a3834: add    r9d,0x4
      0x00000001044a3838: and    r9d,eax
      0x00000001044a383b: mov    r8d,esi
      0x00000001044a383e: add    r8d,0x8
      0x00000001044a3842: and    r8d,eax
      0x00000001044a3845: mov    DWORD PTR [rsp+0x18],r8d
      0x00000001044a384a: mov    r8d,esi
      0x00000001044a384d: add    r8d,0x9
      0x00000001044a3851: and    r8d,eax
      0x00000001044a3854: mov    ebx,esi
      0x00000001044a3856: add    ebx,0xa
      0x00000001044a3859: and    ebx,eax
      0x00000001044a385b: mov    ecx,esi
      0x00000001044a385d: add    ecx,0xb
      0x00000001044a3860: and    ecx,eax
      0x00000001044a3862: mov    edx,esi
      0x00000001044a3864: add    edx,0xc
      0x00000001044a3867: and    edx,eax
      0x00000001044a3869: mov    edi,esi
      0x00000001044a386b: add    edi,0xd
      0x00000001044a386e: and    edi,eax
      0x00000001044a3870: mov    r13d,esi
      0x00000001044a3873: add    r13d,0xe
      0x00000001044a3877: and    r13d,eax
      0x00000001044a387a: movsxd r14,esi
      0x00000001044a387d: add    r10d,DWORD PTR [rbp+r14*4+0x4c]
      0x00000001044a3882: mov    DWORD PTR [rsp+0x24],r10d
      0x00000001044a3887: mov    QWORD PTR [rsp+0x28],rbp
      0x00000001044a388c: mov    ebp,DWORD PTR [rsp+0x4]
      0x00000001044a3890: mov    r10,QWORD PTR [rsp+0x28]
      0x00000001044a3895: add    ebp,DWORD PTR [r10+r14*4+0x2c]
      0x00000001044a389a: mov    DWORD PTR [rsp+0x4],ebp
      0x00000001044a389e: mov    r10d,DWORD PTR [rsp+0x8]
      0x00000001044a38a3: mov    rbp,QWORD PTR [rsp+0x28]
      0x00000001044a38a8: add    r10d,DWORD PTR [rbp+r14*4+0x28]
      0x00000001044a38ad: mov    DWORD PTR [rsp+0x8],r10d
      0x00000001044a38b2: mov    r10d,DWORD PTR [rsp+0xc]
      0x00000001044a38b7: add    r10d,DWORD PTR [rbp+r14*4+0x24]
      0x00000001044a38bc: mov    DWORD PTR [rsp+0xc],r10d
      0x00000001044a38c1: mov    r10d,DWORD PTR [rsp+0x10]
      0x00000001044a38c6: add    r10d,DWORD PTR [rbp+r14*4+0x14]
      0x00000001044a38cb: mov    DWORD PTR [rsp+0x10],r10d
      0x00000001044a38d0: mov    r10d,DWORD PTR [rsp+0x14]
      0x00000001044a38d5: add    r10d,DWORD PTR [rbp+r14*4+0x18]
      0x00000001044a38da: mov    DWORD PTR [rsp+0x14],r10d
      0x00000001044a38df: add    r13d,DWORD PTR [rbp+r14*4+0x48]
      0x00000001044a38e4: add    r11d,DWORD PTR [rbp+r14*4+0x1c]
      0x00000001044a38e9: add    r9d,DWORD PTR [rbp+r14*4+0x20]
      0x00000001044a38ee: mov    r10d,DWORD PTR [rsp+0x18]
      0x00000001044a38f3: add    r10d,DWORD PTR [rbp+r14*4+0x30]
      0x00000001044a38f8: mov    DWORD PTR [rsp+0x18],r10d
      0x00000001044a38fd: add    r8d,DWORD PTR [rbp+r14*4+0x34]
      0x00000001044a3902: add    ebx,DWORD PTR [rbp+r14*4+0x38]
      0x00000001044a3907: add    ecx,DWORD PTR [rbp+r14*4+0x3c]
      0x00000001044a390c: add    edx,DWORD PTR [rbp+r14*4+0x40]
      0x00000001044a3911: add    edi,DWORD PTR [rbp+r14*4+0x44]
      0x00000001044a3916: mov    r10d,DWORD PTR [rsp+0x10]
      0x00000001044a391b: xor    r10d,DWORD PTR [rsp]
      0x00000001044a391f: mov    ebp,DWORD PTR [rsp+0x14]
      0x00000001044a3923: xor    ebp,r10d
      0x00000001044a3926: xor    r11d,ebp
      0x00000001044a3929: xor    r9d,r11d
      0x00000001044a392c: xor    r9d,DWORD PTR [rsp+0xc]
      0x00000001044a3931: xor    r9d,DWORD PTR [rsp+0x8]
      0x00000001044a3936: xor    r9d,DWORD PTR [rsp+0x4]
      0x00000001044a393b: mov    r10d,DWORD PTR [rsp+0x18]
      0x00000001044a3940: xor    r10d,r9d
      0x00000001044a3943: xor    r8d,r10d
      0x00000001044a3946: xor    ebx,r8d
      0x00000001044a3949: xor    ecx,ebx
      0x00000001044a394b: xor    edx,ecx
      0x00000001044a394d: xor    edi,edx
      0x00000001044a394f: xor    r13d,edi
      0x00000001044a3952: mov    r8d,DWORD PTR [rsp+0x24]
      0x00000001044a3957: xor    r8d,r13d           ;*ixor
                                                    ; - org.sample.Measure::normalIndex@34 (line 46)
      0x00000001044a395a: add    esi,0x10           ;*iinc
                                                    ; - org.sample.Measure::normalIndex@36 (line 43)
      0x00000001044a395d: cmp    esi,DWORD PTR [rsp+0x20]
      0x00000001044a3961: jl     0x00000001044a37c0  ;*if_icmpge
                                                    ; - org.sample.Measure::normalIndex@12 (line 43)
      


      推荐答案

      JITC试图将所有内容组合在一起的原因对我来说还不清楚。 AFAIK有(有?)架构,其中两个负载的分组导致更好的性能(我认为一些早期的奔腾)。

      The reason why the JITC tries to group everything together is rather unclear to me. AFAIK there are (were?) architectures for which grouping of two loads leads to a better performance (some early Pentium, I think).

      由于JITC知道热点,它可以比提前编译器更积极地内联,因此在这种情况下它会执行16次。除了使循环相对更便宜之外,我在这里看不到任何明显的优势。我也怀疑是否有任何架构从16个负载组合中获利。

      As JITC knows the hot spots, it can inline much more aggressively than an ahead-of-time compiler, so it does it 16 times in this case. I can't see any clear advantage thereof here, except for making the looping relatively cheaper. I also doubt that there's any architecture profiting from grouping 16 loads together.

      代码计算16个临时值,每次迭代一次

      The code computes 16 temporary values, one per iteration of

      int j = i & array.length-1;
      int entry = array[i];
      int tmp = entry + j;
      result ^= tmp;
      

      每个计算都非常简单,一个AND,一个LOAD和一个ADD。这些值将映射到寄存器,但它们不够。所以这些值必须在以后存储和加载。

      Each computation is pretty trivial, one AND, one LOAD, and one ADD. The values are to be mapped to registers, but there aren't enough of them. So the values have to be stored and loaded later.

      16个寄存器中有7个会发生这种情况,并且显着增加了成本。

      This happens for 7 out of the 16 registers and increases the costs significantly.

      我不太确定使用 -XX进行验证:LoopUnrollLimit

      LoopUnrollLimit Benchmark   Mean   Mean error    Units
      
       8 ..normalIndex           0.902        0.004    ns/op
       8 ..normalWithExitPoint   0.913        0.005    ns/op
       8 ..maskedIndex           0.918        0.006    ns/op
       8 ..maskedWithExitPoint   0.996        0.008    ns/op
      
      16 ..normalIndex           0.769        0.003    ns/op
      16 ..normalWithExitPoint   0.930        0.004    ns/op
      16 ..maskedIndex           0.937        0.004    ns/op
      16 ..maskedWithExitPoint   1.012        0.003    ns/op
      
      32 ..normalIndex           0.814        0.003    ns/op
      32 ..normalWithExitPoint   0.816        0.005    ns/op
      32 ..maskedIndex           0.838        0.003    ns/op
      32 ..maskedWithExitPoint   0.978        0.002    ns/op
      
       - ..normalIndex           0.830        0.002    ns/op
       - ..normalWithExitPoint   0.683        0.002    ns/op
       - ..maskedIndex           0.791        0.005    ns/op
       - ..maskedWithExitPoint   0.908        0.003    ns/op
      

      16的限制使得 normalIndex 成为最快的变体,这表明我对累积罚分是正确的。根据Marko的说法,生成的程序集也会在其他方面随着展开限制而变化,因此事情变得更加复杂。

      The limit of 16 makes normalIndex be the fastest variant, which indicates that I was right with the "overallocation penalty". Bur according to Marko, the generated assembly changes with the unroll limit also in other aspects, so things are more complicated.

      这篇关于奇怪的JIT对循环习语的悲观化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆