分析_mm_setzero_ps和{0.0f,0.0f,0.0f,0.0f} [英] Profiling _mm_setzero_ps and {0.0f,0.0f,0.0f,0.0f}

查看:129
本文介绍了分析_mm_setzero_ps和{0.0f,0.0f,0.0f,0.0f}的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

正如Cody Gray在他的评论中指出的那样,使用禁用的优化进行性能分析完全是在浪费时间.那么我应该如何进行这项测试?

As Cody Gray pointed out in his comment, profiling with disabled optimization is complete waste of time. How then should i approach this test?

Microsoft在其XMVectorZero中(如果已定义)_XM_SSE_INTRINSICS_使用_mm_setzero_ps,如果未使用{0.0f,0.0f,0.0f,0.0f}.我决定检查获胜人数.因此,我在x86版本中使用了以下程序,并将配置属性> C/C ++> Optimization> Optimization 设置为Disabled (/Od).

Microsoft in its XMVectorZero in case if defined _XM_SSE_INTRINSICS_ uses _mm_setzero_ps and {0.0f,0.0f,0.0f,0.0f} if don't. I decided to check how big is the win. So i used the following program in Release x86 and Configuration Properties>C/C++>Optimization>Optimization set to Disabled (/Od).

constexpr __int64 loops = 1e9;
inline void fooSSE() {
    for (__int64 i = 0; i < loops; ++i) {
        XMVECTOR zero1 = _mm_setzero_ps();
        //XMVECTOR zero2 = _mm_setzero_ps();
        //XMVECTOR zero3 = _mm_setzero_ps();
        //XMVECTOR zero4 = _mm_setzero_ps();
    }
}
inline void fooNoIntrinsic() {
    for (__int64 i = 0; i < loops; ++i) {
        XMVECTOR zero1 = { 0.f,0.f,0.f,0.f };
        //XMVECTOR zero2 = { 0.f,0.f,0.f,0.f };
        //XMVECTOR zero3 = { 0.f,0.f,0.f,0.f };
        //XMVECTOR zero4 = { 0.f,0.f,0.f,0.f };
    }
}
int main() {
    fooNoIntrinsic();
    fooSSE();
}

我第一次只用零1运行了该程序两次,第二次在没有注释所有行的情况下运行了该程序.在第一种情况下,内在因素会失败,在第二种情况下,内在因素无疑是赢家.所以,我的问题是:

I ran the program twice first with only zero1 and second time with all lines uncommented. In the first case intrinsic loses, in the second intrinsic is clear winner. So, my questions are:

  • 为什么内在性并不总是赢?
  • 我使用的探查器是否适合进行此类测量?

推荐答案

在禁用优化的情况下进行性能分析会给您毫无意义的结果,并且完全是在浪费时间.如果您因为优化程序而禁用优化,否则优化程序会发现您的基准测试实际上没有任何用处,并且会完全删除它,那么欢迎您进行微基准测试!

Profiling things with optimization disabled gives you meaningless results and is a complete waste of time. If you are disabling optimization because otherwise the optimizer notices that your benchmark actually does nothing useful and is removing it entirely, then welcome to the difficulties of microbenchmarking!

通常很难构想一个测试案例,该案例实际上完成了足够的实际工作,因此不会被足够聪明的优化器删除,但是该工作的成本不会淹没并且使您的结果变得毫无意义.例如,许多人的第一个直觉是使用printf之类的东西打印出增量结果,但这并不是一个开始,因为printf的运行速度非常慢,并且绝对会破坏您的基准测试.将收集中间值的变量设置为volatile有时会起作用,因为它实际上禁用了该特定变量的加载/存储优化.尽管这依赖于定义不明确的语义,但这对于基准测试并不重要.另一种选择是对中间结果执行一些毫无意义但相对便宜的操作,例如将它们加在一起.这依赖于优化程序不会让您失望,并且为了验证基准测试结果是否有意义,您必须检查编译器发出的目标代码,并确保该代码实际上在执行操作 .不幸的是,制作微基准测试没有万灵丹.

It is often very difficult to concoct a test case that actually does enough real work that it will not be removed by a sufficiently smart optimizer, yet the cost of that work does not overwhelm and render meaningless your results. For example, a lot of people's first instinct is to print out the incremental results using something like printf, but that's a non-starter because printf is incredibly slow and will absolutely ruin your benchmark. Making the variable that collects the intermediate values as volatile will sometimes work because it effectively disables load/store optimizations for that particular variable. Although this relies on ill-defined semantics, that's not important for a benchmark. Another option is to perform some pointless yet relatively cheap operation on the intermediate results, like add them together. This relies on the optimizer not outsmarting you, and in order to verify that your benchmark results are meaningful, you'll have to examine the object code emitted by the compiler and ensure that the code is actually doing the thing. There is no magic bullet for crafting a microbenchmark, unfortunately.

best 技巧通常是在函数内部隔离代码的相关部分,在一个或多个不可预测的输入值上对其进行参数化,安排返回结果,然后将其放入可以在外部模块中运行,以使优化器无法在其上面留下肮脏的爪子.

The best trick is usually to isolate the relevant portion of the code inside of a function, parameterize it on one or more unpredictable input values, arrange for the result to be returned, and then put this function in an external module such that the optimizer can't get its grubby paws on it.

由于无论如何您都需要查看反汇编以确认您的微基准测试案例是否合适,因此这通常是一个很好的起点.如果您有足够的能力来阅读汇编语言,并且已经充分地提炼了所讨论的代码,那么这甚至足以使您判断代码的效率.如果您无法编写代码的开头或结尾,那么它可能已经足够复杂了,您可以继续对其进行基准测试.

Since you'll need to look at the disassembly anyway to confirm that your microbenchmark case is suitable, this is often a good place to start. If you are sufficiently competent in reading assembly language, and you have sufficiently distilled the code in question, this may even be enough for you to make a judgment about the efficiency of the code. If you can't make heads or tails of the code, then it is probably sufficiently complicated that you can go ahead and benchmark it.

这是一个很好的例子,当粗略检查生成的目标代码足以回答问题而无需制定基准时.

This is a good example of when a cursory examination of the generated object code is sufficient to answer the question without even needing to craft a benchmark.

根据我上面的建议,让我们编写一个简单的函数来测试内在函数.在这种情况下,我们没有任何要参数化的输入,因为代码实际上只是将寄存器设置为0.因此,让我们从功能中返回归零结构:

Following my advice above, let's write a simple function to test out the intrinsic. In this case, we don't have any input to parameterize upon because the code literally just sets a register to 0. So let's just return the zeroed structure from the function:

DirectX::XMVECTOR ZeroTest_Intrinsic()
{
    return _mm_setzero_ps();
}

这是另一个看似幼稚的方式执行初始化的候选者:

And here is the other candidate that performs the initialization the seemingly-naïve way:

DirectX::XMVECTOR ZeroTest_Naive()
{
    return { 0.0f, 0.0f, 0.0f, 0.0f };
}

这是编译器为这两个函数生成的目标代码(无论是哪个版本,无论是针对x86-32还是x86-64进行编译,还是针对大小或速度进行优化;结果都是相同):

Here is the object code generated by the compiler for these two functions (it doesn't matter which version, whether you compile for x86-32 or x86-64, or whether you optimize for size or speed; the results are the same):

ZeroTest_Intrinsic
    xorps  xmm0, xmm0
    ret

ZeroTest_Naive
    xorps  xmm0, xmm0
    ret

(如果支持AVX或AVX2指令,那么它们都将为vxorps xmm0, xmm0, xmm0.)

(If AVX or AVX2 instructions are supported, then these will both be vxorps xmm0, xmm0, xmm0.)

那是很明显的,即使对于那些无法阅读汇编代码的人也是如此.他们都是一样的!我想说的是绝对可以回答哪个问题更快的问题:它们将是相同的,因为优化程序会识别看似幼稚的初始化程序,并将其转换为单个优化的汇编语言指令,以清除寄存器.

That is pretty obvious, even to someone who cannot read assembly code. They are both identical! I'd say that pretty definitively answers the question of which one will be faster: they will be identical because the optimizer recognizes the seemingly-naïve initializer and translates it into a single, optimized assembly-language instruction for clearing a register.

现在,在某些情况下,将其深深地嵌入各种复杂的代码构造中是很可能的,这可能会阻止优化器识别它并发挥其魔力.换句话说,您的测试功能太简单了!"异议.这很可能就是为什么库的实现者选择在可用时显式使用内在函数的原因.它使用保证来确保代码生成器发出所需的指令,因此将对代码进行尽可能的优化.

Now, it is certainly possible that there are cases where this is embedded deep within various complicated code constructs, preventing the optimizer from recognizing it and performing its magic. In other words, the "your test function is too simple!" objection. And that is most likely why the library's implementer chose to explicitly use the intrinsic whenever it is available. Its use guarantees that the code-gen will emit the desired instruction, and therefore the code will be as optimized as possible.

显式使用内在函数的另一个可能的好处是,即使在没有SSE/SSE2支持的情况下编译代码,也可以确保您获得所需的指令.正如我想象的那样,这不是一个特别引人注目的用例,因为如果可以接受使用这些指令,则没有SSE/SSE2支持就不会进行编译.而且,如果您明确尝试禁用SSE/SSE2指令的生成,以便可以在旧系统上运行,则内在函数会破坏您的一天,因为它将强制发出xorps指令,而旧系统将抛出碰到此指令后,立即导致无效的操作异常.

Another possible benefit of explicitly using the intrinsic is to ensure that you get the desired instruction, even if the code is being compiled without SSE/SSE2 support. This isn't a particularly compelling use-case, as I imagine it, because you wouldn't be compiling without SSE/SSE2 support if it was acceptable to be using these instructions. And if you were explicitly trying to disable the generation of SSE/SSE2 instructions so that you could run on legacy systems, the intrinsic would ruin your day because it would force an xorps instruction to be emitted, and the legacy system would throw an invalid operation exception immediately upon hitting this instruction.

不过,我确实看到了一个有趣的案例. xorps是此指令的单精度版本,仅需要SSE支持.但是,如果仅使用SSE支持而不使用SSE2来编译上面显示的功能,则会得到以下信息:

I did see one interesting case, though. xorps is the single-precision version of this instruction, and requires only SSE support. However, if I compile the functions shown above with only SSE support (no SSE2), I get the following:

ZeroTest_Intrinsic
    xorps  xmm0, xmm0
    ret

ZeroTest_Naive
    push   ebp
    mov    ebp, esp
    and    esp, -16
    sub    esp, 16

    mov    DWORD PTR [esp],    0
    mov    DWORD PTR [esp+4],  0
    mov    DWORD PTR [esp+8],  0
    mov    DWORD PTR [esp+12], 0
    movaps xmm0, XMMWORD PTR [esp]

    mov    esp, ebp
    pop    ebp
    ret

很显然,由于某些原因,当SSE2指令支持不可用时,优化器无法将优化应用于初始化器的使用,即使它会使用的xorps指令也是如此.不需要SSE2指令支持!可以说这是优化器中的一个错误,但是明确地使用了它周围的内在原理.

Clearly, for some reason, the optimizer is unable to apply the optimization to the use of the initializer when SSE2 instruction support is not available, even though the xorps instruction that it would be using does not require SSE2 instruction support! This is arguably a bug in the optimizer, but explicit use of the intrinsic works around it.

这篇关于分析_mm_setzero_ps和{0.0f,0.0f,0.0f,0.0f}的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆