SSE比FPU慢? [英] SSE slower than FPU?

查看:217
本文介绍了SSE比FPU慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一大段代码,其中一部分包含以下代码:

I have a large piece of code, part of whose body contains this piece of code:

result = (nx * m_Lx + ny * m_Ly + m_Lz) / sqrt(nx * nx + ny * ny + 1);

我已经向量化的

如下(一切已经是float):

which I have vectorized as follows (everything is already a float):

__m128 r = _mm_mul_ps(_mm_set_ps(ny, nx, ny, nx),
                      _mm_set_ps(ny, nx, m_Ly, m_Lx));
__declspec(align(16)) int asInt[4] = {
    _mm_extract_ps(r,0), _mm_extract_ps(r,1),
    _mm_extract_ps(r,2), _mm_extract_ps(r,3)
};
float (&res)[4] = reinterpret_cast<float (&)[4]>(asInt);
result = (res[0] + res[1] + m_Lz) / sqrt(res[2] + res[3] + 1);

结果是正确的;但是,我的基准测试表明,矢量化版本的 slower :

The result is correct; however, my benchmarking shows that the vectorized version is slower:

  • 非矢量化版本需要3750毫秒
  • 向量化版本需要4050毫秒
  • 直接将result设置为0(并完全删除这部分代码)可将整个过程减少到2500 ms
  • The non-vectorized version takes 3750 ms
  • The vectorized version takes 4050 ms
  • Setting result to 0 directly (and removing this part of the code entirely) reduces the entire process to 2500 ms

鉴于矢量化版本仅包含一个一组SSE乘法(而不是四个单独的FPU乘法),为什么它要慢一些? FPU确实比SSE快吗?还是这里有一个令人困惑的变量?

Given that the vectorized version only contains one set of SSE multiplications (instead of four individual FPU multiplications), why is it slower? Is the FPU indeed faster than SSE, or is there a confounding variable here?

(我在移动Core i5上.)

(I'm on a mobile Core i5.)

推荐答案

您正在花费大量时间使用_mm_set_ps_mm_extract_ps将标量值移入SSE寄存器或从SSE寄存器移出标量值-这会生成大量指令,执行时间将大大超过使用_mm_mul_ps的任何好处.查看生成的程序集输出,以查看除单个MULPS指令之外还生成了多少代码.

You are spending a lot of time moving scalar values to/from SSE registers with _mm_set_ps and _mm_extract_ps - this is generating a lot of instructions, the execution time of which will far outweigh any benefit from using _mm_mul_ps. Take a look at the generated assembly output to see how much code is being generated in addition to the single MULPS instruction.

要正确向量化,您需要使用128位SSE加载和存储(_mm_load_ps/_mm_store_ps),然后使用SSE随机播放指令在需要的地方在寄存器内移动元素.

To vectorize this properly you need to use 128 bit SSE loads and stores (_mm_load_ps/_mm_store_ps) and then use SSE shuffle instructions to move elements around within registers where needed.

还要注意的另一点是-酷睿i5,酷睿i7等现代CPU具有两个标量FPU,并且每个时钟可以发出2个浮点乘法.因此,SSE对于单精度浮点的潜在好处最多仅为2倍.如果您有过多的内务处理"指示,就很容易失去这2倍收益中的大部分/全部,就像这里一样.

One further point to note - modern CPUs such as Core i5, Core i7, have two scalar FPUs and can issue 2 floating point multiplies per clock. The potential benefit from SSE for single precision floating point is therefore only 2x at best. It's easy to lose most/all of this 2x benefit if you have excessive "housekeeping" instructions, as is the case here.

这篇关于SSE比FPU慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆