没有目标的.net委托比目标慢 [英] .net delegate without target slower than with target

查看:105
本文介绍了没有目标的.net委托比目标慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我在计算机上以释放模式执行以下代码时,具有非空目标的委托的执行总是比委托具有空目标的委托执行的速度略快(我希望它是等效的或较慢的)。



我确实不是在寻找微优化,但我想知道为什么会这样吗?

  static void Main(string [] args)
{
//预热代码

long durationWithTarget =
MeasureDuration(()=> new DelegatePerformanceTester(withTarget:true).Run());

Console.WriteLine($ With target:{durationWithTarget});

long durationWithoutTarget =
MeasureDuration(()=> new DelegatePerformanceTester(withTarget:false).Run());

Console.WriteLine($无目标:{durationWithoutTarget});
}

///< summary>
///测量动作的持续时间。
///< / summary> < / param>
///< param name = action>必须测量持续时间的动作。
///< returns>持续时间(以毫秒为单位)。
private static long MeasureDuration(Action action)
{
Stopwatch stopwatch = Stopwatch.StartNew();

action();

返回秒表。经过的毫秒;
}

类DelegatePerformanceTester
{
public DelegatePerformanceTester(bool withTarget)
{
if(withTarget)
{
_func = AddNotStatic;
}
else
{
_func = AddStatic;
}
}
private只读Func< double,double,double> _func;

private double AddNotStatic(double x,double y)=> x + y;
private static double AddStatic(double x,double y)=> x + y;

public void Run()
{
const int loops = 1000000000;
for(int i = 0; i< loops; i ++)
{
double funcResult = _func.Invoke(1d,2d);
}
}
}


解决方案

我会写这篇,后面有很多不错的编程建议,这些建议对于任何关心编写快速代码的C#程序员都至关重要。我通常会谨慎使用微基准,由于现代CPU内核上代码执行速度的不可预测性,因此15%或更少的差异通常在统计上并不显着。减少重复测试至少10次以消除缓存效果并交换测试以消除代码对齐效果的好方法,它可以减少测量不存在的内容的可能性。



但是您看到的是真实的,调用静态方法的委托实际上要慢一些。在x86代码中,效果是很小的,但在x64代码中,效果却更糟,请确保修改项目>属性>构建选项卡>首选32位和平台目标设置来尝试两者。



要知道为什么它变慢,需要查看抖动产生的机器代码。对于委托人,该代码非常隐藏得很好。通过调试> Windows>反汇编查看代码时,您将看不到它。而且您甚至无法单步执行代码,托管调试器的编写是为了隐藏它,而完全拒绝显示它。我将不得不描述一种将视觉放回Visual Studio中的技术。



我不得不谈谈存根。存根是除抖动生成的代码之外,CLR动态创建的一小部分机器代码。存根用于实现接口,它们提供了灵活性,即类的方法表中的方法顺序不必与接口方法的顺序匹配。它们对代表这个问题的代表很重要。存根对即时编译也很重要,存根中的初始代码指向抖动中的入口点,以便在调用该方法时对方法进行编译。之后,存根将被替换,现在调用jitted target方法。导致静态方法调用变慢的是存根,静态方法目标的存根比实例方法的存根更为复杂。






要查看存根,您必须纠缠调试器以强制其显示其代码。需要进行一些设置:首先使用工具>选项>调试>常规。取消选中仅我的代码复选框,然后取消选中抑制JIT优化复选框。如果您使用VS2015,然后勾选使用托管的兼容模式,则VS2015调试器会出现很多错误,并且会严重影响这种调试方式,此选项通过强制使用VS2010托管调试器引擎提供了一种解决方法。切换到发布配置。然后,在项目>属性>调试中,选中启用本机代码调试复选框。然后在项目>属性>生成中,取消选中首选32位复选框,平台目标应为AnyCPU。



在Run()方法上设置一个断点,注意,断点在优化代码中不是很准确。最好在方法标题上进行设置。命中后,请使用调试> Windows>反汇编查看由抖动生成的机器代码。在Haswell内核上,委托调用看起来像这样,如果您拥有的旧处理器还不支持AVX,则可能与您看到的不匹配:

  funcResult + = _func.Invoke(1d,2d); 
0000001a mov rax,qword ptr [rsi + 8]; rax = _func
0000001e mov rcx,qword ptr [rax + 8]; rcx = _func._methodBase(?)
00000022 vmovsd xmm2,qword ptr [0000000000000070h]; arg3 = 2d
0000002b vmovsd xmm1,qword ptr [0000000000000078h]; arg2 = 1d
00000034调用qword ptr [rax + 18h];调用存根

64位方法调用传递了寄存器中的前4个参数,任何其他参数都被传递通过堆栈(不在此处)。在这里使用XMM寄存器是因为参数是浮点数。此时,抖动尚无法知道该方法是静态的还是实例的,只有在此代码实际执行后才能发现。隐藏差异是存根的工作。假定它将是一个实例方法,这就是为什么我注释了arg2和arg3。



在CALL指令上设置断点,第二次命中(因此在存根之后)不再指向抖动),您可以查看一下。必须手动完成,使用调试> Windows>寄存器,然后复制RAX寄存器的值。调试> Windows>内存> Memory1并粘贴值,在其前面放置 0x并添加0x18。右键单击该窗口,然后选择 8字节整数,复制第一个显示的值。这就是存根代码的地址。



现在,这个技巧,托管的调试引擎仍在使用中,不允许您查看存根代码。您必须强制进行模式切换,以便控制非托管调试引擎。使用调试> Windows>调用堆栈,然后双击底部的方法调用,例如RtlUserThreadStart。强制调试器切换引擎。现在您可以将地址粘贴到地址框中,并在其前面添加 0x。 Out弹出存根代码:

  00007FFCE66D0100 jmp 00007FFCE66D0E40 

非常简单,直接跳转到委托目标方法。这将是快速代码。在实例方法上正确地猜测了抖动,并且委托对象已经在RCX寄存器中提供了 this 自变量,因此不需要执行任何特殊操作。



继续第二个测试,并执行完全相同的操作以查看实例调用的存根。现在,存根非常不同:

  000001FE559F0850 mov rax,rsp; ? 
000001FE559F0853 mov r11,rcx; r11 = _func(?)
000001FE559F0856 movaps xmm0,xmm1;将arg3洗牌到右寄存器
000001FE559F0859 movaps xmm1,xmm2;将arg2洗牌到右寄存器
000001FE559F085C mov r10,qword ptr [r11 + 20h]; r10 = _func.Method
000001FE559F0860加r11,20h; ?
000001FE559F0864 jmp r10;跳转到_func.Method

代码有点古怪而不是最优的,微软可能会做得更好在这里工作,我不确定100%是否正确注释了它。我猜想不必要的mov rax,rsp指令仅与具有4个以上参数的方法的存根相关。不知道为什么添加指令是必要的。最重要的细节是XMM寄存器的移动,它必须重新排列它们,因为静态方法没有 this 参数。正是这种改组要求使代码变慢了。



您可以使用x86抖动进行相同的练习,静态方法存根现在看起来像:

  04F905B4 mov eax,ecx 
04F905B6 add eax,10h
04F905B9 jmp dword ptr [eax];跳转到_func.Method

比64位存根简单得多,这就是为什么32位代码几乎没有遭受放缓的影响。差异如此之大的一个原因是32位代码通过了FPU堆栈上的浮点,而不必重新组合。当您使用整数或对象参数时,这不一定会更快。






非常神秘,希望我没有提出大家都睡觉了。请注意,我可能无法正确注释某些注释,我不完全了解存根以及CLR编写委托对象成员以使代码尽快生成的方式。但是这里肯定有不错的编程建议。您确实确实喜欢实例方法作为委托目标,使它们<静态>并非优化。


When I execute the following code in release mode on my machine the execution of a delegate with a non null target is always slightly faster than when the delegate has a null target (I expected it to be equivalent or slower).

I'm really not looking for micro optimization but I was wondering why this is the case?

static void Main(string[] args)
{
    // Warmup code

    long durationWithTarget = 
        MeasureDuration(() => new DelegatePerformanceTester(withTarget: true).Run());

    Console.WriteLine($"With target: {durationWithTarget}");

    long durationWithoutTarget = 
        MeasureDuration(() => new DelegatePerformanceTester(withTarget: false).Run());

    Console.WriteLine($"Without target: {durationWithoutTarget}");
}

/// <summary>
/// Measures the duration of an action.
/// </summary>
/// <param name="action">Action which duration has to be measured.</param>
/// <returns>The duration in milliseconds.</returns>
private static long MeasureDuration(Action action)
{
    Stopwatch stopwatch = Stopwatch.StartNew();

    action();

    return stopwatch.ElapsedMilliseconds;
}

class DelegatePerformanceTester
{
    public DelegatePerformanceTester(bool withTarget)
    {
        if (withTarget)
        {
            _func = AddNotStatic;
        }
        else
        {
            _func = AddStatic;
        }
    }
    private readonly Func<double, double, double> _func;

    private double AddNotStatic(double x, double y) => x + y;
    private static double AddStatic(double x, double y) => x + y;

    public void Run()
    {
        const int loops = 1000000000;
        for (int i = 0; i < loops; i++)
        {
            double funcResult = _func.Invoke(1d, 2d);
        }
    }
}

解决方案

I'll write this one up, there is pretty decent programming advice behind it that ought to matter to any C# programmer that cares about writing fast code. I in general caution about using micro-benchmarks, differences of 15% or less are not in general statistically significant due to the unpredictability of code execution speed on a modern CPU core. A good approach to reduce the odds of measuring something that is not there is to repeat a test at least 10 times to remove caching effects and to swap a test so that code alignment effects can be eliminated.

But what you saw is real, delegates that invoke a static method are in fact slower. The effect is quite small in x86 code but it is significantly worse in x64 code, be sure to tinker with the Project > Properties > Build tab > Prefer 32-bit and Platform target settings to try both.

Knowing why it is slower requires looking at the machine code that the jitter generates. In the case of delegates, that code is very well hidden. You will not see it when you look at the code with Debug > Windows > Disassembly. And you can't even single-step through the code, the managed debugger was written to hide it and completely refuses to show it. I'll have to describe a technique to put the "visual" back into Visual Studio.

I have to talk a bit about "stubs". A stub is a little sliver of machine code that the CLR dynamically creates in addition to the code that the jitter generates. Stubs are used to implement interfaces, they provide the flexibility that the order of the methods in the method table for a class does not have to match the order of the interface methods. And they matter for delegates, the subject of this question. Stubs also matter to just-in-time compilation, the initial code in a stub points to an entrypoint into the jitter to get a method compiled when it is invoked. After which the stub is replaced, now calling the jitted target method. It is the stub that makes the static method call slower, the stub for the static method target is more elaborate than the stub for an instance method.


To see the stubs, you have to wrangle the debugger to force it to show their code. Some setup is required: first use Tools > Options > Debugging > General. Untick the "Just My Code" checkbox, untick the "Suppress JIT optimization" checkbox. If you use VS2015 then tick "Use Managed Compatibility Mode", the VS2015 debugger is very buggy and gets seriously in the way for this kind of debugging, this option provides a workaround by forcing the VS2010 managed debugger engine to be used. Switch to the Release configuration. Then Project > Properties > Debug, tick the "Enable native code debugging" checkbox. And Project > Properties > Build, untick the "Prefer 32-bit" checkbox and "Platform target" should be AnyCPU.

Set a breakpoint on the Run() method, beware that breakpoints are not very accurate in optimized code. Setting on the method header is best. Once it hits, use Debug > Windows > Disassembly to see the machine code that the jitter generated. The delegate invoke call looks like this on a Haswell core, might not match what you see if you have an older processor that doesn't support AVX yet:

                funcResult += _func.Invoke(1d, 2d);
0000001a  mov         rax,qword ptr [rsi+8]               ; rax = _func              
0000001e  mov         rcx,qword ptr [rax+8]               ; rcx = _func._methodBase (?)
00000022  vmovsd      xmm2,qword ptr [0000000000000070h]  ; arg3 = 2d
0000002b  vmovsd      xmm1,qword ptr [0000000000000078h]  ; arg2 = 1d
00000034  call        qword ptr [rax+18h]                 ; call stub

A 64-bit method call passes the first 4 arguments in registers, any additional arguments are passed through the stack (not here). The XMM registers are used here because the arguments are floating point. At this point the jitter cannot know yet whether the method is static or instance, that can't be found out until this code actually executes. It is the job of the stub to hide the difference. It assumes it will be an instance method, that's why I annotated arg2 and arg3.

Set a breakpoint on the CALL instruction, the second time it hits (so after the stub no longer points into the jitter) you can have a look at it. That has to be done by hand, use Debug > Windows > Registers and copy the value of the RAX register. Debug > Windows > Memory > Memory1 and paste the value, put "0x" in front of it and add 0x18. Right-click that window and select "8-byte Integer", copy the first displayed value. That is the address of the stub code.

Now the trick, at this point the managed debugging engine is still being used and will not allow you to look at the stub code. You have to force a mode switch so the unmanaged debugging engine is in control. Use Debug > Windows > Call Stack and double-click a method call on the bottom, like RtlUserThreadStart. Forces the debugger to switch engines. Now you are good to go and can paste the address in the Address box, put "0x" in front of it. Out pops the stub code:

  00007FFCE66D0100  jmp         00007FFCE66D0E40  

Very simple one, a straight jump to the delegate target method. This will be fast code. The jitter guessed correctly at an instance method and the delegate object already provided the this argument in the RCX register so nothing special needs to be done.

Proceed to the second test and do the exact same thing to look at the stub for the instance call. Now the stub is very different:

000001FE559F0850  mov         rax,rsp                 ; ?
000001FE559F0853  mov         r11,rcx                 ; r11 = _func (?)
000001FE559F0856  movaps      xmm0,xmm1               ; shuffle arg3 into right register
000001FE559F0859  movaps      xmm1,xmm2               ; shuffle arg2 into right register
000001FE559F085C  mov         r10,qword ptr [r11+20h] ; r10 = _func.Method 
000001FE559F0860  add         r11,20h                 ; ?
000001FE559F0864  jmp         r10                     ; jump to _func.Method

The code is a bit wonky and not optimal, Microsoft could probably do a better job here, and I'm not 100% sure I annotated it correctly. I guess that the unnecessary mov rax,rsp instruction is only relevant for stubs to methods with more than 4 arguments. No idea why the add instruction is necessary. Most important detail that matters are the XMM register moves, it has to reshuffle them because the static method does not have the this argument. It is this reshuffling requirement that makes the code slower.

You can do the same exercise with the x86 jitter, the static method stub now looks like:

04F905B4  mov         eax,ecx  
04F905B6  add         eax,10h  
04F905B9  jmp         dword ptr [eax]      ; jump to _func.Method

Much simpler than the 64-bit stub, which is why 32-bit code does not suffer from the slowdown nearly as much. One reason it is so very different is that 32-bit code passes floating point on the FPU stack and they don't have to be reshuffled. This won't necessarily be faster when you use integral or object arguments.


Very arcane, hope I didn't put everybody to sleep yet. Beware I might have gotten some annotations wrong, I don't fully understand stubs and the way the CLR cooks delegate object members to make code as fast as possible. But there is certainly decent programming advice here. You really do favor instance methods as delegate targets, making them static is not an optimization.

这篇关于没有目标的.net委托比目标慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆