SSE归一化比简单近似慢? [英] SSE normalization slower than simple approximation?

查看：85 发布时间：2021/6/8 18:59:40 c++ normalization profile sse approximation

本文介绍了SSE归一化比简单近似慢?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试标准化一个 4d 向量.

I am trying to normalize a 4d vector.

我的第一个方法是使用 SSE 内在函数——它为我的向量算法提供了 2 倍的速度提升.这是基本代码:(v.v4 是输入)(使用 GCC)(所有这些都是内联的)

My first approch was to use SSE intrinsics - something that provided a 2 times speed boost to my vector arithmetic. Here is the basic code: (v.v4 is the input) (using GCC) (all of this is inlined)

//find squares
v4sf s = __builtin_ia32_mulps(v.v4, v.v4);
//set t to square
v4sf t = s;
//add the 4 squares together
s   = __builtin_ia32_shufps(s, s, 0x1B);
t      = __builtin_ia32_addps(t, s);
s   = __builtin_ia32_shufps(s, s, 0x4e);
t      = __builtin_ia32_addps(t, s);
s   = __builtin_ia32_shufps(s, s, 0x1B);
t      = __builtin_ia32_addps(t, s);
//find 1/sqrt of t
t      = __builtin_ia32_rsqrtps(t);
//multiply to get normal
return Vec4(__builtin_ia32_mulps(v.v4, t));

我检查了反汇编，它看起来像我期望的那样.我看不出有什么大问题.

I check the disassembly and it looks like how I would expect. I don't see any big problems there.

无论如何，然后我尝试使用近似值:(我从谷歌得到这个)

Anyways, then I tried it using an approximation: (I got this from google)

float x = (v.w*v.w) + (v.x*v.x) + (v.y*v.y) + (v.z*v.z);
float xhalf = 0.5f*x;
int i = *(int*)&x; // get bits for floating value
i = 0x5f3759df - (i>>1); // give initial guess y0
x = *(float*)&i; // convert bits back to float
x *= 1.5f - xhalf*x*x; // newton step, repeating this step
// increases accuracy
//x *= 1.5f - xhalf*x*x;
return Vec4(v.w*x, v.x*x, v.y*x, v.z*x);

它的运行速度比 SSE 版本稍快！(大约快 5-10%)它的结果也非常准确 - 在查找长度时我会说是 0.001！但是……由于类型双关，GCC 给了我严格的别名规则.

It is running slightly faster than the SSE version! (about 5-10% faster) It's results also are very accurate - I would say to 0.001 when finding length! But.. GCC is giving me that lame strict aliasing rule because of the type punning.

所以我修改它:

union {
    float fa;
    int ia;
};
fa = (v.w*v.w) + (v.x*v.x) + (v.y*v.y) + (v.z*v.z);
float faHalf = 0.5f*fa;
ia = 0x5f3759df - (ia>>1);
fa *= 1.5f - faHalf*fa*fa;
//fa *= 1.5f - faHalf*fa*fa;
return Vec4(v.w*fa, v.x*fa, v.y*fa, v.z*fa);

现在修改后的版本(没有警告)运行速度较慢！！它的运行速度几乎是 SSE 版本运行速度的 60%(但结果相同)！这是为什么?

And now the modified version (with no warnings) is running slower!! It's running almost 60% the speed that SSE version runs (but same result)! Why is this?

问题来了:

我的 SSE 实现是否正确?
SSE 真的比正常的 fpu 操作慢吗?
为什么第三个代码这么慢?

SSE归一化比简单近似慢? [英] SSE normalization slower than simple approximation?

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

SSE归一化比简单近似慢? [英] SSE normalization slower than simple approximation?

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

登录关闭