SSE归一化比简单近似慢? [英] SSE normalization slower than simple approximation?

查看:85
本文介绍了SSE归一化比简单近似慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试标准化一个 4d 向量.

I am trying to normalize a 4d vector.

我的第一个方法是使用 SSE 内在函数——它为我的向量算法提供了 2 倍的速度提升.这是基本代码:(v.v4 是输入)(使用 GCC)(所有这些都是内联的)

My first approch was to use SSE intrinsics - something that provided a 2 times speed boost to my vector arithmetic. Here is the basic code: (v.v4 is the input) (using GCC) (all of this is inlined)

//find squares
v4sf s = __builtin_ia32_mulps(v.v4, v.v4);
//set t to square
v4sf t = s;
//add the 4 squares together
s   = __builtin_ia32_shufps(s, s, 0x1B);
t      = __builtin_ia32_addps(t, s);
s   = __builtin_ia32_shufps(s, s, 0x4e);
t      = __builtin_ia32_addps(t, s);
s   = __builtin_ia32_shufps(s, s, 0x1B);
t      = __builtin_ia32_addps(t, s);
//find 1/sqrt of t
t      = __builtin_ia32_rsqrtps(t);
//multiply to get normal
return Vec4(__builtin_ia32_mulps(v.v4, t));

我检查了反汇编,它看起来像我期望的那样.我看不出有什么大问题.

I check the disassembly and it looks like how I would expect. I don't see any big problems there.

无论如何,然后我尝试使用近似值:(我从谷歌得到这个)

Anyways, then I tried it using an approximation: (I got this from google)

float x = (v.w*v.w) + (v.x*v.x) + (v.y*v.y) + (v.z*v.z);
float xhalf = 0.5f*x;
int i = *(int*)&x; // get bits for floating value
i = 0x5f3759df - (i>>1); // give initial guess y0
x = *(float*)&i; // convert bits back to float
x *= 1.5f - xhalf*x*x; // newton step, repeating this step
// increases accuracy
//x *= 1.5f - xhalf*x*x;
return Vec4(v.w*x, v.x*x, v.y*x, v.z*x);

它的运行速度比 SSE 版本稍快!(大约快 5-10%)它的结果也非常准确 - 在查找长度时我会说是 0.001!但是……由于类型双关,GCC 给了我严格的别名规则.

It is running slightly faster than the SSE version! (about 5-10% faster) It's results also are very accurate - I would say to 0.001 when finding length! But.. GCC is giving me that lame strict aliasing rule because of the type punning.

所以我修改它:

union {
    float fa;
    int ia;
};
fa = (v.w*v.w) + (v.x*v.x) + (v.y*v.y) + (v.z*v.z);
float faHalf = 0.5f*fa;
ia = 0x5f3759df - (ia>>1);
fa *= 1.5f - faHalf*fa*fa;
//fa *= 1.5f - faHalf*fa*fa;
return Vec4(v.w*fa, v.x*fa, v.y*fa, v.z*fa);

现在修改后的版本(没有警告)运行速度较慢!!它的运行速度几乎是 SSE 版本运行速度的 60%(但结果相同)!这是为什么?

And now the modified version (with no warnings) is running slower!! It's running almost 60% the speed that SSE version runs (but same result)! Why is this?

问题来了:

  1. 我的 SSE 实现是否正确?
  2. SSE 真的比正常的 fpu 操作慢吗?
  3. 为什么第三个代码这么慢?

推荐答案

我是个笨蛋 - 我意识到我在进行基准测试时运行了 SETI@Home.我猜它正在扼杀我的 SSE 表现.关闭它并让它运行速度提高两倍.

I am a dope - I realized I had SETI@Home running while benchmarking. I'm guessing it was killing my SSE performance. Turned it off and got it running twice as fast.

我还在 AMD athlon 上对其进行了测试,得到了相同的结果 - SSE 更快.

I also tested it on an AMD athlon and got the same results - SSE was faster.

至少我修复了 shuf 错误!

At least I fixed the shuf bug!

这篇关于SSE归一化比简单近似慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆