SSE2 双倍乘法比标准乘法慢 [英] SSE2 double multiplication slower than with standard multiplication

查看:58
本文介绍了SSE2 双倍乘法比标准乘法慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道为什么以下带有 SSE2 指令的代码执行乘法的速度比标准 C++ 实现慢.代码如下:

I'm wondering why the following code with SSE2 instructions performs the multiplication slower than the standard C++ implementation. Here is the code:

        m_win = (double*)_aligned_malloc(size*sizeof(double), 16);
        __m128d* pData = (__m128d*)input().data;
        __m128d* pWin = (__m128d*)m_win;
        __m128d* pOut = (__m128d*)m_output.data;
        __m128d tmp;
        int i=0;
        for(; i<m_size/2;i++)
            pOut[i] = _mm_mul_pd(pData[i], pWin[i]);

m_output.datainput().data 的内存已经通过 _aligned_malloc 分配.

The memory for m_output.data and input().data has been allocated with _aligned_malloc.

但是,对于 2^25 数组,执行此代码的时间与此代码的时间相同(350 毫秒):

The time to execute this code however for a 2^25 array is identical to the time for this code (350ms):

for(int i=0;i<m_size;i++)
    m_output.data[i] = input().data[i] * m_win[i];

这怎么可能?理论上应该只需要 50% 的时间,对吧?还是从SIMD寄存器到m_output.data数组的内存传输开销这么大?

How is that possible? It should theoretically take only 50% of the time, right? Or is the overhead for the memory transfer from SIMD registers to the m_output.data array so expensive?

如果我替换第一个片段中的行

If I replace the line from the first snippet

pOut[i] = _mm_mul_pd(pData[i], pWin[i]);

tmp = _mm_mul_pd(pData[i], pWin[i]);

where __m128d tmp; 然后代码执行得非常快,比我的计时器功能的分辨率还低.那是因为所有东西都存储在寄存器中而不是内存中吗?

where __m128d tmp; then the codes executes blazingly fast, less then the resolution of my timer function. Is that because everything is just stored in the registers and not the memory?

更令人惊讶的是,如果我在调试模式下编译,SSE 代码需要仅 93 毫秒,而标准乘法需要 309 毫秒.

And even more surprising, if I compile in debug mode, the SSE code takes only 93ms while the standard multiplication takes 309ms.

  • 调试:93 毫秒 (SSE2)/309 毫秒(标准乘法)
  • 释放:350 毫秒 (SSE2)/350(标准乘法)

这是怎么回事???

我在发布模式下使用 MSVC2008 和 QtCreator 2.2.1.这是我用于 RELEASE 的编译器开关:

I'm using MSVC2008 with QtCreator 2.2.1 in release mode. Here are my compilter switches for RELEASE:

cl -c -nologo -Zm200 -Zc:wchar_t- -O2 -MD -GR -EHsc -W3 -w34100 -w34189

这些用于调试:

cl -c -nologo -Zm200 -Zc:wchar_t- -Zi -MDd -GR -EHsc -W3 -w34100 -w34189

编辑关于 RELEASE 与 DEBUG 问题:我只是想指出,我对代码进行了分析,而 SSE 代码实际上在发布模式下速度较慢!这只是以某种方式证实了 VS2008 以某种方式无法正确处理优化器的内在函数的假设.英特尔 VTune 在调试模式下为 SSE 循环提供了 289 毫秒,在 RELEASE 模式下为我提供了 504 毫秒.哇...只是哇...

EDIT Regarding the RELEASE vs DEBUG issue: I just wanted to note that I profiled the code and the SSE code is infact slower in release mode! That just confirms somehow the hypothesis that VS2008 somehow cant handle intrinsics with the optimizer properly. Intel VTune gives me 289ms for the SSE loop in DEBUG and 504ms in RELEASE mode. Wow... just wow...

推荐答案

首先,VS 2008 是内在的糟糕选择,因为它往往会添加比必要更多的寄存器移动,并且通常不能很好地优化(对于例如,当存在 SSE 指令时,它在循环归纳变量分析方面存在问题.)

First of all, VS 2008 is a bad choice for intrisincs as it tends to add many more register moves than necessary and in general does not optimize very well (for instance, it has issues with loop induction variable analysis when SSE instructions are present.)

所以,我的猜测是编译器生成 mulss 指令,CPU 可以轻松地重新排序和并行执行(迭代之间没有依赖性),而内在导致大量寄存器移动/复杂SSE 代码——它甚至可能会破坏现代 CPU 上的跟踪缓存.VS2008 以在寄存器中进行所有计算而臭名昭著,我猜会有一些 CPU 无法跳过的危险(例如 xor reg、move mem->reg、xor、mov mem->reg、mul、mov mem->reg 其中是一个依赖链,而标量代码可能是 move mem->reg, mul with mem 操作数,mov.)你一定要看看生成的程序集或尝试 VS 2010,它对内在的支持 .

So, my wild guess is that the compiler generates mulss instructions which the CPU can trivially reorder and execute in parallel (no dependencies between the iterations) while the intrisincs result in lots of register moves/complex SSE code -- it might even blow the trace cache on modern CPUs. VS2008 is notorious for doing all it's calculations in registers and I guess there will be some hazards that the CPU cannot skip (like xor reg, move mem->reg, xor, mov mem->reg, mul, mov mem->reg which is a dependency chain while the scalar code might be move mem->reg, mul with mem operand, mov.) You should definitely look at the generated assembly or try VS 2010 which has much better support for intrinsincs.

最后,也是最重要的一点:您的代码根本没有计算限制,再多的 SSE 也不会使它显着加快.在每次迭代中,您读取四个 double 值并写入两个,这意味着 FLOPs 不是您的问题.在这种情况下,您受缓存/内存子系统的支配,这可能解释了您看到的差异.调试乘法不应比发布快;并且如果您发现它比您应该执行的运行速度更快并检查其他情况(请注意,如果您的 CPU 支持 Turbo 模式,这会增加另外 20% 的变化.)清空缓存的上下文切换可能就足够了这种情况.

Finally, and most important: Your code is not compute bound at all, no amount of SSE will make it significantly faster. On each iteration, you are reading four double values and writing two, which means FLOPs is not your problem. In that case, you're at the mercy of the cache/memory subsystem, and that probably explains the variance you see. The debug multiplication shouldn't be faster than release; and if you see it being faster than you should do more runs and check what else is going on (be careful if your CPU supports a turbo mode, that adds another 20% variation.) A context switch which empties the cache might be enough in this case.

因此,总的来说,您所做的测试几乎毫无意义,只是表明对于内存受限的情况,使用与不使用 SSE 没有区别.如果确实存在计算密集和并行的代码,您应该使用 SSE,即使这样,我也会花大量时间使用分析器来确定要优化的确切位置.一个简单的点积不适合通过 SSE 看到任何性能改进.

So, overall, the test you made is pretty much meaningless and just shows that for memory bound cases there is no difference to use SSE or not. You should use SSE if there is actually code which is compute-dense and parallel, and even then I would spend a lot of time with a profiler to nail down the exact location where to optimize. A simple dot product is not suitable to see any performance improvements with SSE.

这篇关于SSE2 双倍乘法比标准乘法慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆