为什么不是简单的性能优化领域? [英] Why aren't simple properties optimized to fields?

查看:130
本文介绍了为什么不是简单的性能优化领域?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

 密封A级
{
    公众诠释X;
    公众诠释Ÿ{获得;组; }
}
 

如果我创建了一个它需要我约550ms访问Ÿ亿次,而这大约需要250毫秒访问X.我运行它作为一个发布版本,它仍然为属性慢得多的新实例。为什么不.NET优化Ÿ到现场?

编辑:

  A T =新的A();
    t.Y = 50;
    t.X = 50;

    Int64的Y = 0;

    秒表SW =新的秒表();
    sw.Start();

    的for(int i = 0; I<亿;我++)
        Y + = t.Y;

    sw.Stop();
 

这是我的code我使用的是测试,而我改变TY到TX到,而不是测试X。另外我在发布版本。

解决方案

 的for(int i = 0; I<亿;我++)
    Y + = t.X;
 

这是的非常的困难code来分析。你可以看到,当看生成的机器code与调试+的Windows +拆卸。在64 code是这样的:

  0000005a XOR r11d,r11d; I = 0
0000005d MOV EAX,DWORD PTR [RBX + 0CH]。阅读t.X
00000060加r11d,4; I + = 4
00000064 CMP r11d,5F5E100h;测试I<亿
0000006b JL 0000000000000060;对于 (;;)
 

这是高度优化的code,注意怎样+ =运算符完全消失。你允许这样的事情发生,因为你,你不使用y的计算值都取得了基准测试的错误。抖动知道这一点,因此只​​需去掉了无谓增加。增量4需要解释为好,这是一个循环展开优化的副作用。你会看到它以后使用。

所以您必须更改了您的基准,使之切合实际,在末尾加上这一行:

  sw.Stop();
Console.WriteLine({0}毫秒,{1},sw.ElapsesMilliseconds,y)基
 

这迫使y的值被计算。现在看来的完全的不同:

  0000005d XOR EBP,EBP; Y = 0
0000005f MOV EAX,DWORD PTR [RBX + 0CH]
00000062 movsxd RDX,EAX; RDX = t.X
00000065 NOP字PTR [RAX + RAX + 00000000H]。对齐的分支目标
00000070 LEA RAX,[RDX + RBP]。 Y + = t.X
00000074 LEA RCX,[RAX + RDX]。 Y + = t.X
00000078 LEA RAX,[RCX + RDX]。 Y + = t.X
0000007c LEA RBP,[RAX + RDX]。 Y + = t.X
00000080加载r11d,4; I + = 4
00000084 CMP r11d,5F5E100h;测试I<亿
0000008b JL 0000000000000070;对于 (;;)
 

不过的非常的优化code。怪人NOP指令确保在跳008B的地址是有效的,跳转到多数民众赞成对准16优化的指令去codeR单元在处理器的地址。所述LEA指令是一个典型的特技到让地址生成部生成的加成,使主的ALU在同一时间执行其他工作。没有其他的工作在这里完成,但可以有,如果循环体是更多地参与。并且环路被展开4次,以避免分支指令

安美居,现在你实际上是衡量真正的code,代替的删除的code。结果我的机器上,重复测试10次(重要!):

  Y + = t.X:125毫秒
Y + = t.Y:125毫秒
 

究竟相同的时间量。当然,应该是这样的。你不付的属性。

抖动不会产生质量机器code出色的工作。如果你得到一个奇怪的结果,那么的总是的先检查您的测试code。它是code最有可能是错误的。不抖动,它已被彻底的测试。

sealed class A
{
    public int X;
    public int Y { get; set; }
}

If I create a new instance of A it takes me about 550ms to access Y 100,000,000 times, while it takes about 250ms to access X. I'm running it as a release build and it's still much slower for the property. Why doesn't .NET optimize Y to a field?

Edit:

    A t = new A();
    t.Y = 50;
    t.X = 50;

    Int64 y = 0;

    Stopwatch sw = new Stopwatch();
    sw.Start();

    for (int i = 0; i < 100000000; i++)
        y += t.Y;

    sw.Stop();

That's my code I'm using to test, and I'm changing t.Y to t.X to test X instead. Also I'm in release build.

解决方案

for (int i = 0; i < 100000000; i++)
    y += t.X;

This is very difficult code to profile. You can see that when looking at the generated machine code with Debug + Windows + Disassembly. The x64 code looks like this:

0000005a  xor         r11d,r11d                           ; i = 0
0000005d  mov         eax,dword ptr [rbx+0Ch]             ; read t.X
00000060  add         r11d,4                              ; i += 4
00000064  cmp         r11d,5F5E100h                       ; test i < 100000000
0000006b  jl          0000000000000060                    ; for (;;)

This is heavily optimized code, note how the += operator completely disappeared. You allowed this to happen because you made a mistake in your benchmark, you are not using the computed value of y at all. The jitter knows this so it simply removed the pointless addition. The increment by 4 needs an explanation as well, this is a side-effect of a loop unrolling optimization. You'll see it used later.

So you must make a change to your benchmark to make it realistic, add this line at the end:

sw.Stop();
Console.WriteLine("{0} msec, {1}", sw.ElapsesMilliseconds, y);

Which forces the value of y to be computed. It now looks completely different:

0000005d  xor         ebp,ebp                             ; y = 0
0000005f  mov         eax,dword ptr [rbx+0Ch]          
00000062  movsxd      rdx,eax                             ; rdx = t.X
00000065  nop         word ptr [rax+rax+00000000h]        ; align branch target
00000070  lea         rax,[rdx+rbp]                       ; y += t.X
00000074  lea         rcx,[rax+rdx]                       ; y += t.X
00000078  lea         rax,[rcx+rdx]                       ; y += t.X
0000007c  lea         rbp,[rax+rdx]                       ; y += t.X
00000080  add         r11d,4                              ; i += 4
00000084  cmp         r11d,5F5E100h                       ; test i < 100000000
0000008b  jl          0000000000000070                    ; for (;;)

Still very optimized code. The weirdo NOP instruction ensures that the jump at address 008b is efficient, jumping to an address that's aligned to 16 optimizes the instruction decoder unit in the processor. The LEA instruction is a classic trick to the let the address generation unit generate an addition, allowing the main ALUs to perform other work at the same time. No other work to be done here but could have if the loop body was more involved. And the loop was unrolled 4 times to avoid branch instructions.

Anyhoo, now you are actually measuring real code, instead of removed code. Result on my machine, repeating the test 10 times (important!):

y += t.X: 125 msec
y += t.Y: 125 msec

Exactly the same amount of time. Of course, it should be that way. You don't pay for a property.

The jitter does an excellent job of generating quality machine code. If you get a strange result then always check your test code first. It is the code most likely to have a mistake. Not the jitter, it has been thoroughly tested.

这篇关于为什么不是简单的性能优化领域?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆