Java for-loop优化 [英] Java for-loop optimization

查看:156
本文介绍了Java for-loop优化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我用java for循环做了一些运行时测试,并认识到一个奇怪的行为。
对于我的代码,我需要像int,double等基本类型的包装对象来模拟io和输出参数,但那不是重点。
只要看我的代码。如何使用字段访问的对象比原始类型快?



for 循环类型:



pre $ public static void main(String [] args){
double max = 1000;
for(int j = 1; j <8; j ++){
double i;
max = max * 10;
long start = System.nanoTime();
for(i = 0; i }
long end = System.nanoTime();
long微秒=(结束 - 开始)/ 1000;
System.out.println(MicroTime primitive(max:=+ max +):+ microseconds);




$结果:
$ MicroTime基元(max:= 10000.0):110

MicroTime基元(max:= 100000.0):1081

MicroTime基元(最大值= 10000.0):

MicroTime基元(max:= 1.0E7):28248

MicroTime基元(max:= 1.0E8):276205

MicroTime原语(最大值= 1.0E9):2729824

MicroTime原始值(最大值= 1.0E10):27547009



< $ for 使用简单类型(包装器对象)循环:

  public static void main(String [] args){
HDouble max = new HDouble();
max.value = 1000;
for(int j = 1; j <8; j ++){
HDouble i = new HDouble();
max.value = max.value * 10;
long start = System.nanoTime();
for(i.value = 0; i.value< max.value; i.value ++){
}
long end = System.nanoTime();
long微秒=(结束 - 开始)/ 1000;
System.out.println(MicroTime wrapper(max:=+ max.value +):+ microseconds);




$结果:
$ MicroTime wrapper(max:= 10000.0):157

MicroTime wrapper(max:= 100000.0):1561

MicroTime包装器(最大值= 10000.0):b $ b

最大:= 1000000.0):3174

MicroTime包装(最大:= 1.0E7):15630

MicroTime包装(最大:= 1.0E8):155471

MicroTime封装(最大= 1.0E9):1520967

MicroTime封装(最大= 1.0E10):15373311



<迭代越多,第二个代码就越快。但为什么?我知道java编译器和jvm正在优化我的代码,但我从来没有想过,基元类型可以比字段访问的对象慢。

有没有人有一个似是而非的解释?



编辑:
HDouble class:

  public class HDouble {
公共双重价值;
$ b public HDouble(){
}

public HDouble(double value){
this.value = value;
}

@Override
public String toString(){
return String.valueOf(value);




$ b $ p
$ b

我也用代码测试了循环。例如,我计算总和 - >相同的行为(差异不是那么大,但我认为原始算法要快得多?)。首先我想,这个计算需要很长时间,这个字段的访问几乎没有什么区别。

包装for循环:

  for(i.value = 0; i.value< max.value; i.value ++){
sum.value = sum.value + i.value ;

$ / code $ / pre
$ b $结果:


MicroTime封装(最大= 10000.0):243

MicroTime封装(最大:= 100000.0):2805

MicroTime封装(最大= 1000000.0) :3409

MicroTime封装(max:= 1.0E7):28104

MicroTime封装(max:= 1.0E8):278432

MicroTime封装(max: = 1.0E9):2678322

MicroTime包装(最大= 1.0E10):26665540

原始for- (i = 0; i sum = sum + i;

  

$ / code $ / pre
$ b $结果:


MicroTime基元(max:= 10000.0):149

MicroTime基元(max:= 100000.0):1996

MicroTime基元(max:= 1000000.0) :2289

MicroTime基元(max:= 1.0E7):27085

MicroTime基元(max:= 1.0E8):279939

MicroTime基元(最大值: = 1.0E9):2759133

MicroTime primitive(max:= 1.0E10):27369724


解决方案

手工制作的微型基因很容易被人愚弄,你永远不知道它们究竟是什么。这就是为什么有像 JMH 这样的特殊工具。但是让我们来分析一下原始手工基准测试的结果:

$ p $ static $ class double double
double value;


public static void main(String [] args){
primitive();
wrapper();


public static void primitive(){
long start = System.nanoTime();
for(double d = 0; d <1000000000; d ++){
}
long end = System.nanoTime();
System.out.printf(Primitive:%.3f s \\\
,(end - start)/ 1e9);


public static void wrapper(){
HD double d = new HDouble();
long start = System.nanoTime();
for(d.value = 0; d.value< 1000000000; d.value ++){
}
long end = System.nanoTime();
System.out.printf(Wrapper:%.3f s \\\
,(end - start)/ 1e9);





结果有点类似于你:

 原始材料:3.618 s 
包装材料:1.380 s

现在重复多次测试:

pre $ public static void main(String [] args){
for(int i = 0; i <5; i ++){
primitive();
wrapper();




$ p

它变得更有趣:

 原始材料:3.661 s 
包装材料:1.382 s
原始材料:3.461 s
包装材料:1.380 s
原始:1.376 s < - 从第三次迭代开始
包装:1.381 s < - 时间变成相等
原始:1.371 s
包装:1.372 s
原始: 1.379 s
包装:1.378 s

看起来这两种方法都得到了最终的优化。运行一次,现在用JIT编译器的活动记录:
-XX:-TieredCompilation -XX:CompileOnly = Test -XX:+ PrintCompilation

  136 1%Test :: primitive @ 6(53字节)
3725 1%Test :: primitive @ -2(53字节)没有进入
原语:3.589 s
3748 2%Test :: wrapper @ 17(73字节)
5122 2%Test :: wrapper @ -2(73字节)未进入
包装:1.374 s
5122 3 Test :: primitive(53字节)
5124 4%Test :: primitive @ 6(53字节)
原语:3.421 s
8544 5测试::封装(73字节)
8547 6%测试::封装@ 17(73字节)
封装:1.378 s
原始数据:1.372 s
封装:1.375 s
原始码:1.378 s
包装:1.373 s
原始码:1.375 s
包装:1.378 s

注意登录编译日志在第一次迭代。这意味着这些方法是在OSR中编译的((on-堆栈替换)模式。在第二次迭代期间,这些方法在正常模式下重新编译。从那以后,从第三次迭代开始,在执行速度方面,primitive和wrapper没有区别。



实际测量的是OSR存根的性能。它通常与应用程序的真实性能无关,你不应该关心它。



但问题仍然存在,为什么OSR存根是一个包装编译好比原始变量?为了找到这个,我们需要下载生成的汇编代码:

-XX:CompileOnly = Test -XX:+ UnlockDiagnosticVMOptions -XX:+ PrintAssembly



我将省略所有不相关的代码,只留下编译的循环。



原始:

  0x00000000023e90d0:vmovsd 0x28(%rsp),%xmm1<  - 从堆栈中加载double 
0x00000000023e90d6:vaddsd -0x7e rip),%xmm1,%xmm1
0x00000000023e90de:test%eax,-0x21f90e4(%rip)
0x00000000023e90e4:vmovsd%xmm1,0x28(%rsp)< - 存储到堆栈
0x00000000023e90ea:vucomisd 0x28(%rsp),%xmm0< - 与栈值比较
0x00000000023e90f0:ja 0x00000000023e90d0

包装:

  0x00000000023ebe90:vaddsd -0x78(%rip),%xmm0,%xmm0 
0x00000000023ebe98:vmovsd%xmm0,0x10(%rbx)< - 存储到对象字段
0x00000000023ebe9d:test%eax,-0x21fbea3(%rip)
0x00000000023ebea3:vuc omisd%xmm0,%xmm1< - 比较寄存器
0x00000000023ebea7:ja 0x00000000023ebe90

你可以看到,'原始'的情况下,一些加载和存储到一个堆栈的位置,而'包装'大部分在注册操作。为什么OSR存根指向堆栈是非常容易理解的:在解释模式下,局部变量被存储在栈中,并且OSR存根与该解释的框架兼容。在一个'包装'的情况下,这个值被存储在堆中,而对象的引用已经被缓存在一个寄存器中。


I made some runtime tests with java for loops and recognized a strange behaviour. For my code I need wrapper objects for primitive types like int, double and so on, to simulate io and output parameters, but thats not the point. Just watch my code. How can objects with field access be faster then primitive types?

for loop with prtimitive type:

public static void main(String[] args) {
    double max = 1000;
    for (int j = 1; j < 8; j++) {
        double i;
        max = max * 10;
        long start = System.nanoTime();
        for (i = 0; i < max; i++) {
        }
        long end = System.nanoTime();
        long microseconds = (end - start) / 1000;
        System.out.println("MicroTime primitive(max: ="+max + "): " + microseconds);
    }
}

Result:

MicroTime primitive(max: =10000.0): 110
MicroTime primitive(max: =100000.0): 1081
MicroTime primitive(max: =1000000.0): 2450
MicroTime primitive(max: =1.0E7): 28248
MicroTime primitive(max: =1.0E8): 276205
MicroTime primitive(max: =1.0E9): 2729824
MicroTime primitive(max: =1.0E10): 27547009

for loop with simple type (wrapper object):

public static void main(String[] args) {
    HDouble max = new HDouble();
    max.value = 1000;
    for (int j = 1; j < 8; j++) {
        HDouble i = new HDouble();
        max.value = max.value*10;
        long start = System.nanoTime();
        for (i.value = 0; i.value <max.value; i.value++) {
        }
        long end = System.nanoTime();
        long microseconds = (end - start) / 1000;
        System.out.println("MicroTime wrapper(max: ="+max.value + "): " + microseconds);
    }
}

Result:

MicroTime wrapper(max: =10000.0): 157
MicroTime wrapper(max: =100000.0): 1561
MicroTime wrapper(max: =1000000.0): 3174
MicroTime wrapper(max: =1.0E7): 15630
MicroTime wrapper(max: =1.0E8): 155471
MicroTime wrapper(max: =1.0E9): 1520967
MicroTime wrapper(max: =1.0E10): 15373311

The more iterations, the faster is the second code. But why? I know that the java-compiler and jvm are optimizing my code, but I never thought that primitive types can be slower, than objects with field access.
Does anyone have a plausible explanation for it?

Edited: HDouble class:

public class HDouble {
    public double value;

    public HDouble() {
    }

    public HDouble(double value) {
        this.value = value;
    }

    @Override
    public String toString() {
        return String.valueOf(value);
    }
}

I also tested my loops with code in it. For example I calculate the sum -> same behaviour (the difference is not that big, but I thought the primitive algorithm have to be much faster?). First I thought, that the calculation takes that long, that the field access nearly no difference.

Wrapper for-loop:

for (i.value = 0; i.value <max.value; i.value++) {
    sum.value = sum.value + i.value;
}

Result:

MicroTime wrapper(max: =10000.0): 243
MicroTime wrapper(max: =100000.0): 2805
MicroTime wrapper(max: =1000000.0): 3409
MicroTime wrapper(max: =1.0E7): 28104
MicroTime wrapper(max: =1.0E8): 278432
MicroTime wrapper(max: =1.0E9): 2678322
MicroTime wrapper(max: =1.0E10): 26665540

Primitive for-loop:

for (i = 0; i < max; i++) {
    sum = sum + i;
}

Result:

MicroTime primitive(max: =10000.0): 149
MicroTime primitive(max: =100000.0): 1996
MicroTime primitive(max: =1000000.0): 2289
MicroTime primitive(max: =1.0E7): 27085
MicroTime primitive(max: =1.0E8): 279939
MicroTime primitive(max: =1.0E9): 2759133
MicroTime primitive(max: =1.0E10): 27369724

解决方案

It's so easy to get fooled by hand-made microbenchmarks - you never know what they actually measure. That's why there are special tools like JMH. But let's analyze what happens to the primitive hand-made benchmark:

static class HDouble {
    double value;
}

public static void main(String[] args) {
    primitive();
    wrapper();
}

public static void primitive() {
    long start = System.nanoTime();
    for (double d = 0; d < 1000000000; d++) {
    }
    long end = System.nanoTime();
    System.out.printf("Primitive: %.3f s\n", (end - start) / 1e9);
}

public static void wrapper() {
    HDouble d = new HDouble();
    long start = System.nanoTime();
    for (d.value = 0; d.value < 1000000000; d.value++) {
    }
    long end = System.nanoTime();
    System.out.printf("Wrapper:   %.3f s\n", (end - start) / 1e9);
}

The results are somewhat similar to yours:

Primitive: 3.618 s
Wrapper:   1.380 s

Now repeat the test several times:

public static void main(String[] args) {
    for (int i = 0; i < 5; i++) {
        primitive();
        wrapper();
    }
}

It gets more interesting:

Primitive: 3.661 s
Wrapper:   1.382 s
Primitive: 3.461 s
Wrapper:   1.380 s
Primitive: 1.376 s <-- starting from 3rd iteration
Wrapper:   1.381 s <-- the timings become equal
Primitive: 1.371 s
Wrapper:   1.372 s
Primitive: 1.379 s
Wrapper:   1.378 s

Looks like both methods got finally optimized. Run it once again, now with logging JIT compiler activity: -XX:-TieredCompilation -XX:CompileOnly=Test -XX:+PrintCompilation

    136    1 %           Test::primitive @ 6 (53 bytes)
   3725    1 %           Test::primitive @ -2 (53 bytes)   made not entrant
Primitive: 3.589 s
   3748    2 %           Test::wrapper @ 17 (73 bytes)
   5122    2 %           Test::wrapper @ -2 (73 bytes)   made not entrant
Wrapper:   1.374 s
   5122    3             Test::primitive (53 bytes)
   5124    4 %           Test::primitive @ 6 (53 bytes)
Primitive: 3.421 s
   8544    5             Test::wrapper (73 bytes)
   8547    6 %           Test::wrapper @ 17 (73 bytes)
Wrapper:   1.378 s
Primitive: 1.372 s
Wrapper:   1.375 s
Primitive: 1.378 s
Wrapper:   1.373 s
Primitive: 1.375 s
Wrapper:   1.378 s

Note % sign in the compilation log on the first iteration. It means that the methods were compiled in OSR (on-stack replacement) mode. During the second iteration the methods were recompiled in normal mode. Since then, starting from the third iteration, there was no difference between primitive and wrapper in execution speed.

What you've actually measured is the performance of OSR stub. It is usually not related to the real performance of an application and you shouldn't care much about it.

But the question still remains, why OSR stub for a wrapper is compiled better than for a primitive variable? To find this out we need to get down to generated assembly code:
-XX:CompileOnly=Test -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly

I'll omit all unrelevant code leaving only the compiled loop.

Primitive:

0x00000000023e90d0: vmovsd 0x28(%rsp),%xmm1      <-- load double from the stack
0x00000000023e90d6: vaddsd -0x7e(%rip),%xmm1,%xmm1
0x00000000023e90de: test   %eax,-0x21f90e4(%rip)
0x00000000023e90e4: vmovsd %xmm1,0x28(%rsp)      <-- store to the stack
0x00000000023e90ea: vucomisd 0x28(%rsp),%xmm0    <-- compare with the stack value
0x00000000023e90f0: ja     0x00000000023e90d0

Wrapper:

0x00000000023ebe90: vaddsd -0x78(%rip),%xmm0,%xmm0
0x00000000023ebe98: vmovsd %xmm0,0x10(%rbx)      <-- store to the object field
0x00000000023ebe9d: test   %eax,-0x21fbea3(%rip)
0x00000000023ebea3: vucomisd %xmm0,%xmm1         <-- compare registers
0x00000000023ebea7: ja     0x00000000023ebe90

As you can see, the 'primitive' case makes a number of loads and stores to a stack location while 'wrapper' does mostly in-register operations. It is quite understandable why OSR stub refers to stack: in the interpreted mode local variables are stored on the stack, and OSR stub is made compatible with this interpreted frame. In a 'wrapper' case the value is stored on the heap, and the reference to the object is already cached in a register.

这篇关于Java for-loop优化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆