装配性能调优 [英] Assembly Performance Tuning

查看:53
本文介绍了装配性能调优的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一个编译器(乐趣无穷),但我想尝试使其尽可能高效.例如,有人告诉我,在英特尔架构上,使用 EAX 以外的任何寄存器来执行数学运算都会产生成本(大概是因为它可以交换成 EAX 来执行实际操作数学).至少有一个消息来源指出了这种可能性(http://www.swansontec.com/sregisters.html).

我想验证和衡量这些性能特征上的差异.因此,我用C ++编写了该程序:

  #include"stdafx.h"#include< intrin.h>#include< iostream>使用命名空间std;int _tmain(int argc,_TCHAR * argv []){__int64 startval;__int64 stopval;unsigned int值;//保留该值以防止其被遗漏startval = __rdtsc();//使用汇编RDTSC操作码获取CPU滴答计数器//简单数学:a =(a<< 3)+ 0x0054E9_asm {mov ebx,0x1E532//种子shl ebx,3岁添加ebx,0x0054E9mov值,ebx}stopval = __rdtsc();__int64 val =(stopval-startval);cout<<结果:<<值<<->"<<val<<恩德尔我cin>>一世;返回0;} 

我尝试用此代码交换 eax ebx ,但是我没有得到稳定"的数字.我希望测试是确定性的(每次都相同),因为它太短了,以至于不太可能在测试期间发生上下文切换.就目前而言,没有统计差异,但是数量波动很大,以至于无法做出确定.即使我采集了大量样本,数量仍然不可能改变.

我还想测试 xor eax,eax mov eax,0 ,但是有相同的问题.

是否可以在Windows(或其他任何地方)上进行这类性能测试?当我以前为TI-Calc编写Z80程序时,我有一个工具可以选择某些程序集,它会告诉我执行代码需要多少个时钟周期,而这是我们新型的现代处理器无法做到的吗?/p>

有很多答案表明要运行一百万次循环.需要澄清的是,这实际上使情况变得更糟.CPU更有可能进行上下文切换,并且测试涵盖了除我正在测试的内容之外的所有内容.

解决方案

要甚至希望在RDTSC提供的级别上具有可重复的确定性计时,您需要采取一些额外的步骤.首先,RDTSC不是 序列化指令,因此它可以无序执行,这通常会使它在上述片段中变得毫无意义.

您通常要使用一个序列化指令,然后使用RDTSC,然后使用所讨论的代码,另一个序列化指令和第二个RDTSC.

在用户模式下,几乎唯一可用的序列化指令是CPUID.但是,这还增加了一些细微的变化:英特尔将CPUID记录为需要不同的时间来执行-前几次执行可能比其他执行慢.

因此,您代码的正常时序是这样的:

  XOR EAX,EAXCPUIDXOR EAX,EAXCPUIDXOR EAX,EAXCPUID;英特尔表示,到第三次执行时,时间将是稳定的.RDTSC;读时钟推e;节省开始时间推edxmov ebx,0x1E532//种子//执行测试序列shl ebx,3岁添加ebx,0x0054E9mov值,ebxXOR EAX,EAX;连载CPUIDrdtsc;得到结束时间流行ecx;重新开始时间流行音乐sub eax,ebp;寻找结束sbb edx,ecx 

我们开始接近,但最后一点是,在大多数编译器上使用内联代码很难处理:交叉缓存行也可能会产生一些影响,因此通常您希望强制将代码对齐到16字节(段)的边界.任何体面的汇编程序都将支持该格式,但是编译器中的内联汇编通常不支持.

说了这么多,我想你是在浪费时间.如您所料,我在这个级别上已经做了很多时间安排,而且我可以肯定您所听到的完全是个神话.实际上,所有最近的x86 CPU都使用一组所谓的重命名寄存器".长话短说,这意味着您用于寄存器的名称并没有多大关系-CPU具有更多用于实际操作的寄存器集(例如,对于Intel,大约40个),因此您将EBX与EAX的值相乘对CPU实际上将在内部使用的寄存器的影响很小.可以将任何一个映射到任何重命名寄存器,这主要取决于该指令序列开始时哪个重命名寄存器碰巧是空闲的.

I am writing a compiler (more for fun than anything else), but I want to try to make it as efficient as possible. For example I was told that on Intel architecture the use of any register other than EAX for performing math incurs a cost (presumably because it swaps into EAX to do the actual piece of math). Here is at least one source that states the possibility (http://www.swansontec.com/sregisters.html).

I would like to verify and measure these differences in performance characteristics. Thus, I have written this program in C++:

#include "stdafx.h"
#include <intrin.h>
#include <iostream>

using namespace std;

int _tmain(int argc, _TCHAR* argv[])
{
    __int64 startval;
    __int64 stopval;
    unsigned int value; // Keep the value to keep from it being optomized out

    startval = __rdtsc(); // Get the CPU Tick Counter using assembly RDTSC opcode

    // Simple Math: a = (a << 3) + 0x0054E9
    _asm {
        mov ebx, 0x1E532 // Seed
        shl ebx, 3
        add ebx, 0x0054E9
        mov value, ebx
    }

    stopval = __rdtsc();
    __int64 val = (stopval - startval);
    cout << "Result: " << value << " -> " << val << endl;

    int i;
    cin >> i;

    return 0;
}

I tried this code swapping eax and ebx but I'm not getting a "stable" number. I would hope that the test would be deterministic (the same number every time) because it's so short that it's unlikely a context switch is occurring during the test. As it stands there is no statistical difference but the number fluctuates so wildly that it would be impossible to make that determination. Even if I take a large number of samples the number is still impossibly varied.

I'd also like to test xor eax, eax vs mov eax, 0, but have the same problem.

Is there any way to do these kinds of performance tests on Windows (or anywhere else)? When I used to program Z80 for my TI-Calc I had a tool where I could select some assembly and it would tell me how many clock cycles to execute the code -- can that not be done with our new-fangeled modern processors?

EDIT: There are a lot of answers indicating to run the loop a million times. To clarify, this actually makes things worse. The CPU is much more likely to context switch and the test becomes about everything but what I am testing.

解决方案

To even have a hope of repeatable, determinstic timing at the level that RDTSC gives, you need to take some extra steps. First, RDTSC is not a serializing instruction, so it can be executed out of order, which will usually render it meaningless in a snippet like the one above.

You normally want to use a serializing instruction, then your RDTSC, then the code in question, another serializing instruction, and the second RDTSC.

Nearly the only serializing instruction available in user mode is CPUID. That, however, adds one more minor wrinkle: CPUID is documented by Intel as requiring varying amounts of time to execute -- the first couple of executions can be slower than others.

As such, the normal timing sequence for your code would be something like this:

XOR EAX, EAX
CPUID
XOR EAX, EAX
CPUID
XOR EAX, EAX
CPUID            ; Intel says by the third execution, the timing will be stable.
RDTSC            ; read the clock
push eax         ; save the start time
push edx

    mov ebx, 0x1E532 // Seed // execute test sequence
    shl ebx, 3
    add ebx, 0x0054E9
    mov value, ebx

XOR EAX, EAX      ; serialize
CPUID   
rdtsc             ; get end time
pop ecx           ; get start time back
pop ebp
sub eax, ebp      ; find end-start
sbb edx, ecx

We're starting to get close, but there's on last point that's difficult to deal with using inline code on most compilers: there can also be some effects from crossing cache lines, so you normally want to force your code to be aligned to a 16-byte (paragraph) boundary. Any decent assembler will support that, but inline assembly in a compiler usually won't.

Having said all that, I think you're wasting your time. As you can guess, I've done a fair amount of timing at this level, and I'm quite certain what you've heard is an outright myth. In reality, all recent x86 CPUs use a set of what are called "rename registers". To make a long story short, this means the name you use for a register doesn't really matter much -- the CPU has a much larger set of registers (e.g., around 40 for Intel) that it uses for the actual operations, so your putting a value in EBX vs. EAX has little effect on the register that the CPU is really going to use internally. Either could be mapped to any rename register, depending primarily on which rename registers happen to be free when that instruction sequence starts.

这篇关于装配性能调优的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆