为什么我简单的四元数乘法比SSE快? [英] Why is my straightforward quaternion multiplication faster than SSE?

查看:158
本文介绍了为什么我简单的四元数乘法比SSE快?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在研究几种不同的四元数乘法实现,但是我很惊讶地看到参考实现到目前为止是我最快的.这是有问题的实现:

I've been going through a few different quaternion multiplication implementations, but I've been rather surprised to see that the reference implementation is, so far, my fastest. This is the implementation in question:

inline static quat multiply(const quat& lhs, const quat& rhs)
{
    return quat((lhs.w * rhs.x) + (lhs.x * rhs.w) + (lhs.y * rhs.z) - (lhs.z * rhs.y),
                (lhs.w * rhs.y) + (lhs.y * rhs.w) + (lhs.z * rhs.x) - (lhs.x * rhs.z),
                (lhs.w * rhs.z) + (lhs.z * rhs.w) + (lhs.x * rhs.y) - (lhs.y * rhs.x),
                (lhs.w * rhs.w) - (lhs.x * rhs.x) - (lhs.y * rhs.y) - (lhs.z * rhs.z));
}

我已经尝试了其他一些实现,其中一些使用SSE,而有些则没有.这是一个这样的SSE实现示例,基本上是从Bullet Physics使用的库中复制的:

I've tried a few other implementations, some using SSE, some that don't. Here's an example of one such SSE implementation, basically copied from the library that Bullet Physics uses:

inline static __m128 multiplynew(__m128 lhs, __m128 rhs)
{
    __m128 qv, tmp0, tmp1, tmp2, tmp3;
    __m128 product, l_wxyz, r_wxyz, xy, qw;
    vec4 sw;

    tmp0 = _mm_shuffle_ps(lhs, lhs, _MM_SHUFFLE(3, 0, 2, 1));
    tmp1 = _mm_shuffle_ps(rhs, rhs, _MM_SHUFFLE(3, 1, 0, 2));
    tmp2 = _mm_shuffle_ps(lhs, lhs, _MM_SHUFFLE(3, 1, 0, 2));
    tmp3 = _mm_shuffle_ps(rhs, rhs, _MM_SHUFFLE(3, 0, 2, 1));
    qv = _mm_mul_ps(_mm_splat_ps(lhs, 3), rhs);
    qv = _mm_madd_ps(_mm_splat_ps(rhs, 3), lhs, qv);
    qv = _mm_madd_ps(tmp0, tmp1, qv);
    qv = _mm_nmsub_ps(tmp2, tmp3, qv);
    product = _mm_mul_ps(lhs, rhs);
    l_wxyz = _mm_sld_ps(lhs, lhs, 12);
    r_wxyz = _mm_sld_ps(rhs, rhs, 12);
    qw = _mm_nmsub_ps(l_wxyz, r_wxyz, product);
    xy = _mm_madd_ps(l_wxyz, r_wxyz, product);
    qw = _mm_sub_ps(qw, _mm_sld_ps(xy, xy, 8));

    sw.uiw = 0xffffffff;
    return _mm_sel_ps(qv, qw, sw);
}

在启用了优化的发布模式下,我的简单参考实现比bullet的SSE实现快70%-90%.在没有优化的调试模式下,它的运行速度快了3倍.

In release mode with optimizations turned on, my simple reference implementation runs 70%-90% faster than bullet's SSE implementation. In debug mode without optimizations, it runs as much as 3x faster.

我的第一个问题是,为什么会这样?

My first question is, why does this happen?

我的第二个问题是,我有什么方法可以优化四元数-四元数乘法例程?我不想处理汇编,但是我在其他地方经常使用sse内在函数.

My second question is, is there any way for me to optimize my quaternion-quaternion multiplication routine? I don't want to deal with assembly, but I use sse intrinsics quite a lot elsewhere.

(顺便说一句,我的四元数的数据存储定义为union { __m128 data; struct { float x, y, z, w; }; float f[4]; };)

(btw, if it matters, the data storage of my quaternion is defined as union { __m128 data; struct { float x, y, z, w; }; float f[4]; };)

我看了看反汇编.这是multiply(快速的非sse代码)的反汇编:

I looked at the dissassembly. Here's the disassembly for multiply (the fast non-sse one):

00EC9940  movaps      xmm3,xmmword ptr [esp+0D0h]  
00EC9948  movaps      xmm2,xmmword ptr [esp+0C0h]  
00EC9950  movaps      xmm4,xmm3  
00EC9953  mulss       xmm4,xmm5  
00EC9957  movaps      xmm0,xmm2  
00EC995A  mulss       xmm0,xmm6  
00EC995E  mulss       xmm3,xmm1  
00EC9962  addss       xmm4,xmm0  
00EC9966  movss       xmm0,dword ptr [esp+40h]  
00EC996C  mulss       xmm0,xmm1  
00EC9970  addss       xmm4,xmm0  
00EC9974  movss       xmm0,dword ptr [esp+0F0h]  
00EC997D  mulss       xmm0,xmm7  
00EC9981  subss       xmm4,xmm0  
00EC9985  movss       xmm0,dword ptr [esp+0F0h]  
00EC998E  mulss       xmm0,xmm6  
00EC9992  addss       xmm3,xmm0  
00EC9996  movaps      xmm0,xmm2  
00EC9999  movaps      xmm2,xmmword ptr [esp+40h]  
00EC999E  mulss       xmm0,xmm7  
00EC99A2  addss       xmm3,xmm0  
00EC99A6  movaps      xmm0,xmm2  
00EC99A9  mulss       xmm0,xmm5  
00EC99AD  mulss       xmm2,xmm6  
00EC99B1  subss       xmm3,xmm0  
00EC99B5  movss       xmm0,dword ptr [esp+0D0h]  
00EC99BE  mulss       xmm0,xmm7  
00EC99C2  addss       xmm2,xmm0  
00EC99C6  movss       xmm0,dword ptr [esp+0F0h]  
00EC99CF  mulss       xmm0,xmm5  
00EC99D3  addss       xmm2,xmm0  
00EC99D7  movss       xmm0,dword ptr [esp+0C0h]  
00EC99E0  mulss       xmm0,xmm1  
00EC99E4  movss       xmm1,dword ptr [esp+0D0h]  
00EC99ED  mulss       xmm1,xmm6  
00EC99F1  subss       xmm2,xmm0  
00EC99F5  movss       xmm0,dword ptr [esp+0C0h]  
00EC99FE  mulss       xmm0,xmm5  
00EC9A02  movaps      xmm5,xmmword ptr [esp+50h]  
00EC9A07  unpcklps    xmm4,xmm2  
00EC9A0A  subss       xmm1,xmm0  
00EC9A0E  movss       xmm0,dword ptr [esp+0F0h]  
00EC9A17  mulss       xmm0,xmm5  
00EC9A1B  subss       xmm1,xmm0  
00EC9A1F  movss       xmm0,dword ptr [esp+40h]  
00EC9A25  mulss       xmm0,xmm7  
00EC9A29  subss       xmm1,xmm0  
00EC9A2D  unpcklps    xmm3,xmm1  
00EC9A30  unpcklps    xmm4,xmm3  
00EC9A33  movaps      xmm5,xmm4  
00EC9A36  movaps      xmmword ptr [esp+30h],xmm5  
00EC9A3B  dec         eax  
00EC9A3C  je          SDL_main+58Ah (0EC9A5Ah)  

这是multiplynew(慢的sse之一)的反汇编:

And here's the disassembly for multiplynew (the slow sse one):

00329BF3  movaps      xmm6,xmm5  
00329BF6  mulps       xmm6,xmm1  
00329BF9  movaps      xmm0,xmm5  
00329BFC  mov         dword ptr [esp+6Ch],0FFFFFFFFh  
00329C04  shufps      xmm0,xmm5,93h  
00329C08  movaps      xmm1,xmm5  
00329C0B  mulps       xmm4,xmm0  
00329C0E  movaps      xmm0,xmmword ptr [esp+110h]  
00329C16  movaps      xmm3,xmm6  
00329C19  shufps      xmm1,xmm5,0FFh  
00329C1D  mulps       xmm1,xmmword ptr [esp+40h]  
00329C22  movaps      xmm7,xmmword ptr [esp+60h]  
00329C27  addps       xmm3,xmm4  
00329C2A  mulps       xmm0,xmm5  
00329C2D  subps       xmm6,xmm4  
00329C30  shufps      xmm3,xmm3,4Eh  
00329C34  addps       xmm1,xmm0  
00329C37  movaps      xmm0,xmm5  
00329C3A  shufps      xmm0,xmm5,0C9h  
00329C3E  subps       xmm6,xmm3  
00329C41  mulps       xmm0,xmmword ptr [esp+120h]  
00329C49  shufps      xmm5,xmm5,0D2h  
00329C4D  mulps       xmm5,xmmword ptr [esp+0C0h]  
00329C55  andps       xmm6,xmmword ptr [esp+60h]  
00329C5A  addps       xmm1,xmm0  
00329C5D  subps       xmm1,xmm5  
00329C60  andnps      xmm7,xmm1  

我测试速度的方法是使用:

The way I test the speed is using:

timer.update();
for (uint i = 0; i < 1000000; ++i)
{
    temp1 = quat::multiply(temp1, q1);
}
timer.update();
printf("1M calls to multiplyOld took %fs.\n", timer.getDeltaTime());

(timer.getDeltaTime()返回上次调用timer.update()与在此之前调用timer.update()的时间之间经过的时间(以秒为单位).

(timer.getDeltaTime() returns how much time has passed, in seconds, between the last time timer.update() was called and the time that timer.update() was called before that.)

为什么我的非SSE版本运行速度更快,尽管有更多说明.我是在阅读反汇编中的错误信息吗?

Why does my non-sse version run faster despite having more instructions..? Am I reading the disassembly wrong or something?

我发现在x64中进行编译时,sse版本的运行速度比非sse版本的运行速度快.

I've discovered that the sse version is running faster than the non-sse version when I compile in x64.

推荐答案

使用SIMD的最佳方法是使用SSE(AVX)一次乘以四个(八个)独立的四元数.但是,这通常是不切实际的.在这种情况下,我建议您查看Agner Fog的 vectorclass .在目录special中,他有一个文件quaterinon.h.我转换了乘法函数以匹配您的代码.这仅需要SSE2(#include <emmintrin.h>).

The best way to use SIMD would be to multiply four (eight) independent quaternions at once with SSE (AVX). However, this often is not practical. In that case I recommend checking out Agner Fog's vectorclass. In the directory special he has a file quaterinon.h. I converted the multiply function to match your code. This only needs SSE2 (#include <emmintrin.h>).

inline static __m128 multiplynew(__m128 a, __m128 b) {  
    __m128 a1123 = _mm_shuffle_ps(a,a,0xE5);
    __m128 a2231 = _mm_shuffle_ps(a,a,0x7A);
    __m128 b1000 = _mm_shuffle_ps(b,b,0x01);
    __m128 b2312 = _mm_shuffle_ps(b,b,0x9E);
    __m128 t1    = _mm_mul_ps(a1123, b1000);
    __m128 t2    = _mm_mul_ps(a2231, b2312);
    __m128 t12   = _mm_add_ps(t1, t2);
    const __m128i mask =_mm_set_epi32(0,0,0,0x80000000);
    __m128 t12m  = _mm_xor_ps(t12, _mm_castsi128_ps(mask)); // flip sign bits
    __m128 a3312 = _mm_shuffle_ps(a,a,0x9F);
    __m128 b3231 = _mm_shuffle_ps(b,b,0x7B);
    __m128 a0000 = _mm_shuffle_ps(a,a,0x00);
    __m128 t3    = _mm_mul_ps(a3312, b3231);
    __m128 t0    = _mm_mul_ps(a0000, b);
    __m128 t03   = _mm_sub_ps(t0, t3);
    return         _mm_add_ps(t03, t12m);
}

这是程序集输出(GCC 4.8 MASM样式):

Here is the assembly output (GCC 4.8 MASM style):

multiplynew(float __vector, float __vector):
movaps  xmm2, xmm0
movaps  xmm3, xmm0
movaps  xmm5, xmm1
movaps  xmm4, xmm1
shufps  xmm2, xmm0, 229
shufps  xmm4, xmm1, 158
shufps  xmm3, xmm0, 122
shufps  xmm5, xmm1, 1
mulps   xmm3, xmm4
movaps  xmm4, xmm1
mulps   xmm2, xmm5
shufps  xmm4, xmm1, 123
addps   xmm2, xmm3
movaps  xmm3, xmm0
shufps  xmm3, xmm0, 159
shufps  xmm0, xmm0, 0
xorps   xmm2, XMMWORD PTR .LC0[rip]
mulps   xmm3, xmm4
mulps   xmm0, xmm1
subps   xmm0, xmm3
addps   xmm0, xmm2
ret

这篇关于为什么我简单的四元数乘法比SSE快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆