SSE比FPU慢? [英] SSE slower than FPU?

查看：217 发布时间：2020/5/21 20:52:39 c++ optimization sse vectorization simd

本文介绍了SSE比FPU慢?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一大段代码，其中一部分包含以下代码:

I have a large piece of code, part of whose body contains this piece of code:

result = (nx * m_Lx + ny * m_Ly + m_Lz) / sqrt(nx * nx + ny * ny + 1);

我已经向量化的

如下(一切已经是float):

which I have vectorized as follows (everything is already a float):

__m128 r = _mm_mul_ps(_mm_set_ps(ny, nx, ny, nx),
                      _mm_set_ps(ny, nx, m_Ly, m_Lx));
__declspec(align(16)) int asInt[4] = {
    _mm_extract_ps(r,0), _mm_extract_ps(r,1),
    _mm_extract_ps(r,2), _mm_extract_ps(r,3)
};
float (&res)[4] = reinterpret_cast<float (&)[4]>(asInt);
result = (res[0] + res[1] + m_Lz) / sqrt(res[2] + res[3] + 1);

结果是正确的；但是，我的基准测试表明，矢量化版本的 slower :

The result is correct; however, my benchmarking shows that the vectorized version is slower:

非矢量化版本需要3750毫秒
向量化版本需要4050毫秒
直接将result设置为0(并完全删除这部分代码)可将整个过程减少到2500 ms

The non-vectorized version takes 3750 ms
The vectorized version takes 4050 ms
Setting result to 0 directly (and removing this part of the code entirely) reduces the entire process to 2500 ms

鉴于矢量化版本仅包含一个一组SSE乘法(而不是四个单独的FPU乘法)，为什么它要慢一些? FPU确实比SSE快吗?还是这里有一个令人困惑的变量?

Given that the vectorized version only contains one set of SSE multiplications (instead of four individual FPU multiplications), why is it slower? Is the FPU indeed faster than SSE, or is there a confounding variable here?

(我在移动Core i5上.)

(I'm on a mobile Core i5.)

SSE比FPU慢? [英] SSE slower than FPU?

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

SSE比FPU慢? [英] SSE slower than FPU?

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

登录关闭