Java for-loop优化 [英] Java for-loop optimization

查看：156 发布时间：2018/1/27 23:08:00 java for-loop optimization jvm javacompiler
本文介绍了Java for-loop优化的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！
问题描述

我用java for循环做了一些运行时测试，并认识到一个奇怪的行为。
对于我的代码，我需要像int，double等基本类型的包装对象来模拟io和输出参数，但那不是重点。
只要看我的代码。如何使用字段访问的对象比原始类型快？
for 循环类型：
pre $
 public static void main（String [] args）{
 double max = 1000; 
 for（int j = 1; j <8; j ++）{
 double i; 
 max = max * 10; 
 long start = System.nanoTime（）; 
 for（i = 0; i } 
 long end = System.nanoTime（）; 
 long微秒=（结束 - 开始）/ 1000; 
 System.out.println（MicroTime primitive（max：=+ max +）：+ microseconds）; 
 
 
 
 
 $结果：
 $ MicroTime基元（max：= 10000.0）：110 
 
 MicroTime基元（max：= 100000.0）：1081 
 
 MicroTime基元（最大值= 10000.0）：
 
 MicroTime基元（max：= 1.0E7）：28248 
 
 MicroTime基元（max：= 1.0E8）：276205 
 
 MicroTime原语（最大值= 1.0E9）：2729824 
 
 MicroTime原始值（最大值= 1.0E10）：27547009 
 
 
 
 <  $  for 使用简单类型（包装器对象）循环： 
 
 
  public static void main（String [] args）{
 HDouble max = new HDouble（）; 
 max.value = 1000; 
 for（int j = 1; j <8; j ++）{
 HDouble i = new HDouble（）; 
 max.value = max.value * 10; 
 long start = System.nanoTime（）; 
 for（i.value = 0; i.value< max.value; i.value ++）{
} 
 long end = System.nanoTime（）; 
 long微秒=（结束 - 开始）/ 1000; 
 System.out.println（MicroTime wrapper（max：=+ max.value +）：+ microseconds）; 
 
 
 
 
 $结果：
 $ MicroTime wrapper（max：= 10000.0）：157 
 
 MicroTime wrapper（max：= 100000.0）：1561 
 
 MicroTime包装器（最大值= 10000.0）：b $ b 
最大：= 1000000.0）：3174 
 
 MicroTime包装（最大：= 1.0E7）：15630 
 
 MicroTime包装（最大：= 1.0E8）：155471 
 
 MicroTime封装（最大= 1.0E9）：1520967 
 
 MicroTime封装（最大= 1.0E10）：15373311 
 
 
 
 <迭代越多，第二个代码就越快。但为什么？我知道java编译器和jvm正在优化我的代码，但我从来没有想过，基元类型可以比字段访问的对象慢。
 
有没有人有一个似是而非的解释？
 
 
 编辑：
 HDouble class： 
 
 
  public class HDouble {
公共双重价值; 
 $ b public HDouble（）{
} 
 
 public HDouble（double value）{
 this.value = value; 
} 
 
 @Override 
 public String toString（）{
 return String.valueOf（value）; 
 
 
 
 
 $ b $ p 
 $ b 我也用代码测试了循环。例如，我计算总和 - >相同的行为（差异不是那么大，但我认为原始算法要快得多？）。首先我想，这个计算需要很长时间，这个字段的访问几乎没有什么区别。 
 
 
包装for循环：
 
 
  for（i.value = 0; i.value< max.value; i.value ++）{
 sum.value = sum.value + i.value ; 
 
 $ / code $ / pre 
 $ b $结果：
 
 
 
 MicroTime封装（最大= 10000.0）：243 
 
 MicroTime封装（最大：= 100000.0）：2805 
 
 MicroTime封装（最大= 1000000.0） ：3409 
 
 MicroTime封装（max：= 1.0E7）：28104 
 
 MicroTime封装（max：= 1.0E8）：278432 
 
 MicroTime封装（max： = 1.0E9）：2678322 
 
 MicroTime包装（最大= 1.0E10）：26665540 
 
原始for- （i = 0; i  sum = sum + i; 
 
 
  
 
 $ / code $ / pre 
 $ b $结果：
 
 
 
  MicroTime基元（max：= 10000.0）：149 
 
 MicroTime基元（max：= 100000.0）：1996 
 
 MicroTime基元（max：= 1000000.0） ：2289 
 
 MicroTime基元（max：= 1.0E7）：27085 
 
 MicroTime基元（max：= 1.0E8）：279939 
 
 MicroTime基元（最大值： = 1.0E9）：2759133 
 
 MicroTime primitive（max：= 1.0E10）：27369724  
 
 
 
解决方案
 
手工制作的微型基因很容易被人愚弄，你永远不知道它们究竟是什么。这就是为什么有像 JMH 这样的特殊工具。但是让我们来分析一下原始手工基准测试的结果：
 
 $ p $ static $ class double double 
 double value; 
 
 
 public static void main（String [] args）{
 primitive（）; 
 wrapper（）; 
 
 
 public static void primitive（）{
 long start = System.nanoTime（）; 
 for（double d = 0; d <1000000000; d ++）{
} 
 long end = System.nanoTime（）; 
 System.out.printf（Primitive：％.3f s \\\
，（end  -  start）/ 1e9）; 
 
 
 public static void wrapper（）{
 HD double d = new HDouble（）; 
 long start = System.nanoTime（）; 
 for（d.value = 0; d.value< 1000000000; d.value ++）{
} 
 long end = System.nanoTime（）; 
 System.out.printf（Wrapper：％.3f s \\\
，（end  -  start）/ 1e9）; 
 
 
 
 
 
 结果有点类似于你： 
 
 
 原始材料：3.618 s 
包装材料：1.380 s 
  
现在重复多次测试：
 
 pre $ public static void main（String [] args）{
 for（int i = 0; i <5; i ++）{
 primitive（）; 
 wrapper（）; 
 
 
 
 
 $ p 
它变得更有趣：
 
 
 原始材料：3.661 s 
包装材料：1.382 s 
原始材料：3.461 s 
包装材料：1.380 s 
原始：1.376 s < - 从第三次迭代开始
包装：1.381 s < - 时间变成相等
原始：1.371 s 
包装：1.372 s 
原始： 1.379 s 
包装：1.378 s 
  
看起来这两种方法都得到了最终的优化。运行一次，现在用JIT编译器的活动记录：
  -XX：-TieredCompilation -XX：CompileOnly = Test -XX：+ PrintCompilation  
  136 1％Test :: primitive @ 6（53字节）
 3725 1％Test :: primitive @ -2（53字节）没有进入
原语：3.589 s 
 3748 2％Test :: wrapper @ 17（73字节）
 5122 2％Test :: wrapper @ -2（73字节）未进入
包装：1.374 s 
 5122 3 Test :: primitive（53字节）
 5124 4％Test :: primitive @ 6（53字节）
原语：3.421 s 
 8544 5测试::封装（73字节）
 8547 6％测试::封装@ 17（73字节）
封装：1.378 s 
原始数据：1.372 s 
封装：1.375 s 
原始码：1.378 s 
包装：1.373 s 
原始码：1.375 s 
包装：1.378 s 
  
注意％登录编译日志在第一次迭代。这意味着这些方法是在OSR中编译的（（on-堆栈替换）模式。在第二次迭代期间，这些方法在正常模式下重新编译。从那以后，从第三次迭代开始，在执行速度方面，primitive和wrapper没有区别。
 
 
 实际测量的是OSR存根的性能。它通常与应用程序的真实性能无关，你不应该关心它。
 
 
 但问题仍然存在，为什么OSR存根是一个包装编译好比原始变量？为了找到这个，我们需要下载生成的汇编代码：
 
  -XX：CompileOnly = Test -XX：+ UnlockDiagnosticVMOptions -XX：+ PrintAssembly  
 
 
 我将省略所有不相关的代码，只留下编译的循环。
 
 
 原始：
  0x00000000023e90d0：vmovsd 0x28（％rsp），％xmm1<  - 从堆栈中加载double 
 0x00000000023e90d6：vaddsd -0x7e rip），％xmm1，％xmm1 
 0x00000000023e90de：test％eax，-0x21f90e4（％rip）
 0x00000000023e90e4：vmovsd％xmm1,0x28（％rsp）<  - 存储到堆栈
 0x00000000023e90ea：vucomisd 0x28（％rsp），％xmm0<  - 与栈值比较
 0x00000000023e90f0：ja 0x00000000023e90d0 
  
包装：
 
 
  0x00000000023ebe90：vaddsd -0x78（％rip），％xmm0，％xmm0 
 0x00000000023ebe98：vmovsd％xmm0,0x10（％rbx）<  - 存储到对象字段
 0x00000000023ebe9d：test％eax，-0x21fbea3（％rip）
 0x00000000023ebea3：vuc omisd％xmm0，％xmm1<  - 比较寄存器
 0x00000000023ebea7：ja 0x00000000023ebe90 
  
你可以看到，'原始'的情况下，一些加载和存储到一个堆栈的位置，而'包装'大部分在注册操作。为什么OSR存根指向堆栈是非常容易理解的：在解释模式下，局部变量被存储在栈中，并且OSR存根与该解释的框架兼容。在一个'包装'的情况下，这个值被存储在堆中，而对象的引用已经被缓存在一个寄存器中。
 
I made some runtime tests with java for loops and recognized a strange behaviour.
For my code I need wrapper objects for primitive types like int, double and so on, to simulate io and output parameters, but thats not the point.
Just watch my code. How can objects with field access be faster then primitive types?

for loop with prtimitive type:
public static void main(String[] args) {
    double max = 1000;
    for (int j = 1; j < 8; j++) {
        double i;
        max = max * 10;
        long start = System.nanoTime();
        for (i = 0; i < max; i++) {
        }
        long end = System.nanoTime();
        long microseconds = (end - start) / 1000;
        System.out.println("MicroTime primitive(max: ="+max + "): " + microseconds);
    }
}
Result:

  MicroTime primitive(max: =10000.0): 110

  MicroTime primitive(max: =100000.0): 1081

  MicroTime primitive(max: =1000000.0): 2450

  MicroTime primitive(max: =1.0E7): 28248

  MicroTime primitive(max: =1.0E8): 276205

  MicroTime primitive(max: =1.0E9): 2729824

  MicroTime primitive(max: =1.0E10): 27547009
for loop with simple type (wrapper object):
public static void main(String[] args) {
    HDouble max = new HDouble();
    max.value = 1000;
    for (int j = 1; j < 8; j++) {
        HDouble i = new HDouble();
        max.value = max.value*10;
        long start = System.nanoTime();
        for (i.value = 0; i.value <max.value; i.value++) {
        }
        long end = System.nanoTime();
        long microseconds = (end - start) / 1000;
        System.out.println("MicroTime wrapper(max: ="+max.value + "): " + microseconds);
    }
}
Result:

  MicroTime wrapper(max: =10000.0): 157

  MicroTime wrapper(max: =100000.0): 1561

  MicroTime wrapper(max: =1000000.0): 3174

  MicroTime wrapper(max: =1.0E7): 15630

  MicroTime wrapper(max: =1.0E8): 155471

  MicroTime wrapper(max: =1.0E9): 1520967

  MicroTime wrapper(max: =1.0E10): 15373311
The more iterations, the faster is the second code. But why? I know that the java-compiler and jvm are optimizing my code, but I never thought that primitive types can be slower, than objects with field access.

Does anyone have a plausible explanation for it?

Edited:
HDouble class:
public class HDouble {
    public double value;

    public HDouble() {
    }

    public HDouble(double value) {
        this.value = value;
    }

    @Override
    public String toString() {
        return String.valueOf(value);
    }
}
I also tested my loops with code in it. For example I calculate the sum -> same behaviour (the difference is not that big, but I thought the primitive algorithm have to be much faster?). First I thought, that the calculation takes that long, that the field access nearly no difference.

Wrapper for-loop:
for (i.value = 0; i.value <max.value; i.value++) {
    sum.value = sum.value + i.value;
}
Result:

  MicroTime wrapper(max: =10000.0): 243

  MicroTime wrapper(max: =100000.0): 2805

  MicroTime wrapper(max: =1000000.0): 3409

  MicroTime wrapper(max: =1.0E7): 28104

  MicroTime wrapper(max: =1.0E8): 278432

  MicroTime wrapper(max: =1.0E9): 2678322

  MicroTime wrapper(max: =1.0E10): 26665540
Primitive for-loop:
for (i = 0; i < max; i++) {
    sum = sum + i;
}
Result:

  MicroTime primitive(max: =10000.0): 149

  MicroTime primitive(max: =100000.0): 1996

  MicroTime primitive(max: =1000000.0): 2289

  MicroTime primitive(max: =1.0E7): 27085

  MicroTime primitive(max: =1.0E8): 279939

  MicroTime primitive(max: =1.0E9): 2759133

  MicroTime primitive(max: =1.0E10): 27369724

 解决方案 
It's so easy to get fooled by hand-made microbenchmarks - you never know what they actually measure. That's why there are special tools like JMH. But let's analyze what happens to the primitive hand-made benchmark:
static class HDouble {
    double value;
}

public static void main(String[] args) {
    primitive();
    wrapper();
}

public static void primitive() {
    long start = System.nanoTime();
    for (double d = 0; d < 1000000000; d++) {
    }
    long end = System.nanoTime();
    System.out.printf("Primitive: %.3f s\n", (end - start) / 1e9);
}

public static void wrapper() {
    HDouble d = new HDouble();
    long start = System.nanoTime();
    for (d.value = 0; d.value < 1000000000; d.value++) {
    }
    long end = System.nanoTime();
    System.out.printf("Wrapper:   %.3f s\n", (end - start) / 1e9);
}
The results are somewhat similar to yours:
Primitive: 3.618 s
Wrapper:   1.380 s
Now repeat the test several times:
public static void main(String[] args) {
    for (int i = 0; i < 5; i++) {
        primitive();
        wrapper();
    }
}
It gets more interesting:
Primitive: 3.661 s
Wrapper:   1.382 s
Primitive: 3.461 s
Wrapper:   1.380 s
Primitive: 1.376 s <-- starting from 3rd iteration
Wrapper:   1.381 s <-- the timings become equal
Primitive: 1.371 s
Wrapper:   1.372 s
Primitive: 1.379 s
Wrapper:   1.378 s
Looks like both methods got finally optimized. Run it once again, now with logging JIT compiler activity:
-XX:-TieredCompilation -XX:CompileOnly=Test -XX:+PrintCompilation
    136    1 %           Test::primitive @ 6 (53 bytes)
   3725    1 %           Test::primitive @ -2 (53 bytes)   made not entrant
Primitive: 3.589 s
   3748    2 %           Test::wrapper @ 17 (73 bytes)
   5122    2 %           Test::wrapper @ -2 (73 bytes)   made not entrant
Wrapper:   1.374 s
   5122    3             Test::primitive (53 bytes)
   5124    4 %           Test::primitive @ 6 (53 bytes)
Primitive: 3.421 s
   8544    5             Test::wrapper (73 bytes)
   8547    6 %           Test::wrapper @ 17 (73 bytes)
Wrapper:   1.378 s
Primitive: 1.372 s
Wrapper:   1.375 s
Primitive: 1.378 s
Wrapper:   1.373 s
Primitive: 1.375 s
Wrapper:   1.378 s
Note % sign in the compilation log on the first iteration. It means that the methods were compiled in OSR (on-stack replacement) mode. During the second iteration the methods were recompiled in normal mode. Since then, starting from the third iteration, there was no difference between primitive and wrapper in execution speed.

What you've actually measured is the performance of OSR stub. It is usually not related to the real performance of an application and you shouldn't care much about it.

But the question still remains, why OSR stub for a wrapper is compiled better than for a primitive variable? To find this out we need to get down to generated assembly code:

-XX:CompileOnly=Test -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly

I'll omit all unrelevant code leaving only the compiled loop.

Primitive:
0x00000000023e90d0: vmovsd 0x28(%rsp),%xmm1      <-- load double from the stack
0x00000000023e90d6: vaddsd -0x7e(%rip),%xmm1,%xmm1
0x00000000023e90de: test   %eax,-0x21f90e4(%rip)
0x00000000023e90e4: vmovsd %xmm1,0x28(%rsp)      <-- store to the stack
0x00000000023e90ea: vucomisd 0x28(%rsp),%xmm0    <-- compare with the stack value
0x00000000023e90f0: ja     0x00000000023e90d0
Wrapper:
0x00000000023ebe90: vaddsd -0x78(%rip),%xmm0,%xmm0
0x00000000023ebe98: vmovsd %xmm0,0x10(%rbx)      <-- store to the object field
0x00000000023ebe9d: test   %eax,-0x21fbea3(%rip)
0x00000000023ebea3: vucomisd %xmm0,%xmm1         <-- compare registers
0x00000000023ebea7: ja     0x00000000023ebe90
As you can see, the 'primitive' case makes a number of loads and stores to a stack location while 'wrapper' does mostly in-register operations. It is quite understandable why OSR stub refers to stack: in the interpreted mode local variables are stored on the stack, and OSR stub is made compatible with this interpreted frame. In a 'wrapper' case the value is stored on the heap, and the reference to the object is already cached in a register.

                        这篇关于Java for-loop优化的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！
                        
                    
                    
                        查看全文
Java for-loop优化 [英] Java for-loop optimization

问题描述

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

Java for-loop优化 [英] Java for-loop optimization

问题描述

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭