Java for 循环优化 [英] Java for-loop optimization

查看:21
本文介绍了Java for 循环优化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我用 java for 循环做了一些运行时测试,发现了一个奇怪的行为.对于我的代码,我需要原始类型(如 int、double 等)的包装器对象来模拟 io 和输出参数,但这不是重点.只看我的代码.具有字段访问权限的对象如何比原始类型更快?

I made some runtime tests with java for loops and recognized a strange behaviour. For my code I need wrapper objects for primitive types like int, double and so on, to simulate io and output parameters, but thats not the point. Just watch my code. How can objects with field access be faster then primitive types?

for 原始类型的循环:

public static void main(String[] args) {
    double max = 1000;
    for (int j = 1; j < 8; j++) {
        double i;
        max = max * 10;
        long start = System.nanoTime();
        for (i = 0; i < max; i++) {
        }
        long end = System.nanoTime();
        long microseconds = (end - start) / 1000;
        System.out.println("MicroTime primitive(max: ="+max + "): " + microseconds);
    }
}

结果:

MicroTime 原语(最大值:=10000.0):110
MicroTime 原语(最大值:=100000.0):1081
MicroTime 原语(最大值:=1000000.0):2450
MicroTime 原语(最大值:=1.0E7):28248
MicroTime 原语(最大值:=1.0E8):276205
MicroTime 原语(最大值:=1.0E9):2729824
MicroTime 原语(最大值:=1.0E10):27547009

MicroTime primitive(max: =10000.0): 110
MicroTime primitive(max: =100000.0): 1081
MicroTime primitive(max: =1000000.0): 2450
MicroTime primitive(max: =1.0E7): 28248
MicroTime primitive(max: =1.0E8): 276205
MicroTime primitive(max: =1.0E9): 2729824
MicroTime primitive(max: =1.0E10): 27547009

for 简单类型循环(包装对象):

for loop with simple type (wrapper object):

public static void main(String[] args) {
    HDouble max = new HDouble();
    max.value = 1000;
    for (int j = 1; j < 8; j++) {
        HDouble i = new HDouble();
        max.value = max.value*10;
        long start = System.nanoTime();
        for (i.value = 0; i.value <max.value; i.value++) {
        }
        long end = System.nanoTime();
        long microseconds = (end - start) / 1000;
        System.out.println("MicroTime wrapper(max: ="+max.value + "): " + microseconds);
    }
}

结果:

MicroTime 包装器(最大值:=10000.0):157
MicroTime 包装器(最大:=100000.0):1561
MicroTime 包装器(最大值:=1000000.0):3174
MicroTime 包装器(最大:=1.0E7):15630
MicroTime 包装器(最大:=1.0E8):155471
MicroTime 包装器(最大:=1.0E9):1520967
MicroTime 包装器(最大:=1.0E10):15373311

MicroTime wrapper(max: =10000.0): 157
MicroTime wrapper(max: =100000.0): 1561
MicroTime wrapper(max: =1000000.0): 3174
MicroTime wrapper(max: =1.0E7): 15630
MicroTime wrapper(max: =1.0E8): 155471
MicroTime wrapper(max: =1.0E9): 1520967
MicroTime wrapper(max: =1.0E10): 15373311

迭代越多,第二个代码越快.但为什么?我知道 java-compiler 和 jvm 正在优化我的代码,但我从没想过原始类型会比具有字段访问的对象更慢.
有没有人对此有合理的解释?

The more iterations, the faster is the second code. But why? I know that the java-compiler and jvm are optimizing my code, but I never thought that primitive types can be slower, than objects with field access.
Does anyone have a plausible explanation for it?

HD双类:

public class HDouble {
    public double value;

    public HDouble() {
    }

    public HDouble(double value) {
        this.value = value;
    }

    @Override
    public String toString() {
        return String.valueOf(value);
    }
}

我还用代码测试了我的循环.例如,我计算总和 -> 相同的行为(差异不大,但我认为原始算法必须快得多?).首先我想,计算需要那么长时间,字段访问几乎没有区别.

I also tested my loops with code in it. For example I calculate the sum -> same behaviour (the difference is not that big, but I thought the primitive algorithm have to be much faster?). First I thought, that the calculation takes that long, that the field access nearly no difference.

包装循环:

for (i.value = 0; i.value <max.value; i.value++) {
    sum.value = sum.value + i.value;
}

结果:

MicroTime 包装器(最大:=10000.0):243
MicroTime 包装器(最大:=100000.0):2805
MicroTime 包装器(最大值:=1000000.0):3409
MicroTime 包装器(最大:=1.0E7):28104
MicroTime 包装器(最大:=1.0E8):278432
MicroTime 包装器(最大值:=1.0E9):2678322
MicroTime 包装器(最大:=1.0E10):26665540

MicroTime wrapper(max: =10000.0): 243
MicroTime wrapper(max: =100000.0): 2805
MicroTime wrapper(max: =1000000.0): 3409
MicroTime wrapper(max: =1.0E7): 28104
MicroTime wrapper(max: =1.0E8): 278432
MicroTime wrapper(max: =1.0E9): 2678322
MicroTime wrapper(max: =1.0E10): 26665540

原始 for 循环:

for (i = 0; i < max; i++) {
    sum = sum + i;
}

结果:

MicroTime 原语(最大值:=10000.0):149
MicroTime 原语(最大值:=100000.0):1996
MicroTime 原语(最大值:=1000000.0):2289
MicroTime 原语(最大值:=1.0E7):27085
MicroTime 原语(最大值:=1.0E8):279939
MicroTime 原语(最大值:=1.0E9):2759133
MicroTime 原语(最大值:=1.0E10):27369724

MicroTime primitive(max: =10000.0): 149
MicroTime primitive(max: =100000.0): 1996
MicroTime primitive(max: =1000000.0): 2289
MicroTime primitive(max: =1.0E7): 27085
MicroTime primitive(max: =1.0E8): 279939
MicroTime primitive(max: =1.0E9): 2759133
MicroTime primitive(max: =1.0E10): 27369724

推荐答案

很容易被手工制作的微基准所愚弄 - 你永远不知道它们实际上衡量的是什么.这就是为什么有像 JMH 这样的特殊工具的原因.但是让我们分析一下原始的手工基准会发生什么:

It's so easy to get fooled by hand-made microbenchmarks - you never know what they actually measure. That's why there are special tools like JMH. But let's analyze what happens to the primitive hand-made benchmark:

static class HDouble {
    double value;
}

public static void main(String[] args) {
    primitive();
    wrapper();
}

public static void primitive() {
    long start = System.nanoTime();
    for (double d = 0; d < 1000000000; d++) {
    }
    long end = System.nanoTime();
    System.out.printf("Primitive: %.3f s
", (end - start) / 1e9);
}

public static void wrapper() {
    HDouble d = new HDouble();
    long start = System.nanoTime();
    for (d.value = 0; d.value < 1000000000; d.value++) {
    }
    long end = System.nanoTime();
    System.out.printf("Wrapper:   %.3f s
", (end - start) / 1e9);
}

结果与您的有些相似:

Primitive: 3.618 s
Wrapper:   1.380 s

现在重复测试几次:

public static void main(String[] args) {
    for (int i = 0; i < 5; i++) {
        primitive();
        wrapper();
    }
}

变得更有趣了:

Primitive: 3.661 s
Wrapper:   1.382 s
Primitive: 3.461 s
Wrapper:   1.380 s
Primitive: 1.376 s <-- starting from 3rd iteration
Wrapper:   1.381 s <-- the timings become equal
Primitive: 1.371 s
Wrapper:   1.372 s
Primitive: 1.379 s
Wrapper:   1.378 s

看起来这两种方法最终都得到了优化.再次运行它,现在记录 JIT 编译器活动:-XX:-TieredCompilation -XX:CompileOnly=Test -XX:+PrintCompilation

Looks like both methods got finally optimized. Run it once again, now with logging JIT compiler activity: -XX:-TieredCompilation -XX:CompileOnly=Test -XX:+PrintCompilation

    136    1 %           Test::primitive @ 6 (53 bytes)
   3725    1 %           Test::primitive @ -2 (53 bytes)   made not entrant
Primitive: 3.589 s
   3748    2 %           Test::wrapper @ 17 (73 bytes)
   5122    2 %           Test::wrapper @ -2 (73 bytes)   made not entrant
Wrapper:   1.374 s
   5122    3             Test::primitive (53 bytes)
   5124    4 %           Test::primitive @ 6 (53 bytes)
Primitive: 3.421 s
   8544    5             Test::wrapper (73 bytes)
   8547    6 %           Test::wrapper @ 17 (73 bytes)
Wrapper:   1.378 s
Primitive: 1.372 s
Wrapper:   1.375 s
Primitive: 1.378 s
Wrapper:   1.373 s
Primitive: 1.375 s
Wrapper:   1.378 s

注意 % 在第一次迭代时登录编译日志.这意味着这些方法是在 OSR (on-堆栈替换) 模式.在第二次迭代期间,这些方法在正常模式下重新编译.此后,从第三次迭代开始,原语和包装器在执行速度上没有区别.

Note % sign in the compilation log on the first iteration. It means that the methods were compiled in OSR (on-stack replacement) mode. During the second iteration the methods were recompiled in normal mode. Since then, starting from the third iteration, there was no difference between primitive and wrapper in execution speed.

您实际测量的是 OSR 存根的性能.它通常与应用程序的实际性能无关,您不必太在意.

What you've actually measured is the performance of OSR stub. It is usually not related to the real performance of an application and you shouldn't care much about it.

但问题仍然存在,为什么封装器的 OSR 存根比原始变量编译得更好?要找到这一点,我们需要深入了解生成的汇编代码:
-XX:CompileOnly=Test -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly

But the question still remains, why OSR stub for a wrapper is compiled better than for a primitive variable? To find this out we need to get down to generated assembly code:
-XX:CompileOnly=Test -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly

我将省略所有不相关的代码,只留下编译循环.

I'll omit all unrelevant code leaving only the compiled loop.

原始:

0x00000000023e90d0: vmovsd 0x28(%rsp),%xmm1      <-- load double from the stack
0x00000000023e90d6: vaddsd -0x7e(%rip),%xmm1,%xmm1
0x00000000023e90de: test   %eax,-0x21f90e4(%rip)
0x00000000023e90e4: vmovsd %xmm1,0x28(%rsp)      <-- store to the stack
0x00000000023e90ea: vucomisd 0x28(%rsp),%xmm0    <-- compare with the stack value
0x00000000023e90f0: ja     0x00000000023e90d0

包装:

0x00000000023ebe90: vaddsd -0x78(%rip),%xmm0,%xmm0
0x00000000023ebe98: vmovsd %xmm0,0x10(%rbx)      <-- store to the object field
0x00000000023ebe9d: test   %eax,-0x21fbea3(%rip)
0x00000000023ebea3: vucomisd %xmm0,%xmm1         <-- compare registers
0x00000000023ebea7: ja     0x00000000023ebe90

如您所见,原始"情况将大量加载和存储到堆栈位置,而包装器"主要执行寄存器内操作.OSR stub 指堆栈的原因很容易理解:在解释模式下,局部变量存储在堆栈中,并且OSR stub 与这个解释帧兼容.在包装器"的情况下,值存储在堆中,对象的引用已缓存在寄存器中.

As you can see, the 'primitive' case makes a number of loads and stores to a stack location while 'wrapper' does mostly in-register operations. It is quite understandable why OSR stub refers to stack: in the interpreted mode local variables are stored on the stack, and OSR stub is made compatible with this interpreted frame. In a 'wrapper' case the value is stored on the heap, and the reference to the object is already cached in a register.

这篇关于Java for 循环优化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆