我的SSE实现lookAt不工作 [英] My SSE implementation of lookAt doesn't work

查看:182
本文介绍了我的SSE实现lookAt不工作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以,我写一个数学库使用SSE内在函数来使用我的OpenGL应用程序。现在我正在实现一些更重要的函数,如lookAt,使用glm库检查正确性,但由于某种原因,我的lookAt的实现不能正常工作。



这里是源代码:

  inline void lookAt(__ m128 position,__m128 target,__m128 up)
{
/ *获取相对于摄像机位置的目标向量* /
__m128 t = vec4 :: normalize3(_mm_sub_ps(target,position));
__m128 u = vec4 :: normalize3(up);
/ *通过目标和向上获得正确的向量。 * /
__m128 r = vec4 :: normalize3(vec4 :: cross(t,u));
/ *通过右和目标修正向上向量。 * /
u = vec4 :: cross(r,t);
/ *取消目标向量。 * /
t = _mm_sub_ps(_mm_setzero_ps(),t);

/ *将右,上和目标向量视为矩阵,并对其进行转置。 * /
/ *方便地,这也将所有四个的w分量设置为0.0f * /
_MM_TRANSPOSE4_PS(r,u,t,_mm_setr_ps(0.0f,0.0f,0.0f,1.0f) );

vec4 pos = _mm_sub_ps(_mm_setzero_ps(),position);
pos.w = 1.0f;

/ *将我们的矩阵乘以转置向量。 * /
mat4 temp;
temp.col0 = r;
temp.col1 = u;
temp.col2 = t;
temp.col3 = _mm_setr_ps(0.0f,0.0f,0.0f,1.0f);

multiply(temp);
translate(pos);
}

我的矩阵是列主语,内部存储为__m128 col0,col1,



我在阅读手册页后进行了这里为gluLookAt。一旦我意识到正确的,向上的和目标向量看起来像一个行主矩阵的可怕,很容易我把它们转置,所以我可以将它们分配给旋转矩阵。



normalize3的代码,以防万一有帮助:

  inline static __m128 normalize3(const __m128& vec) 
{
__m128 v = _mm_mul_ps(vec,vec);
v = _mm_add_ps(
_mm_add_ps(
_mm_shuffle_ps(v,v,_MM_SHUFFLE(0,0,0,0)),
_mm_shuffle_ps(v,v,_MM_SHUFFLE ,1,1))),
_mm_shuffle_ps(v,v,_MM_SHUFFLE(2,2,2,2))));

return _mm_mul_ps(vec,_mm_rsqrt_ps(v));
}

通过忽略向量的w分量可以节省几个调用。 / p>

我做错了什么?



这里有一些示例输出。使用位置(5.0,5.0,0.0),目标(10.0,20.0,55.0)和上(0.0,1.0,0.0),我得到:






  • [ - 0.9959] [0.0000] [0.0905] [4.9795]

  • [ - 0.0237] [0.9650] [-0.2610] [-4.7065]

  • [ - 0.0874] [-0.2621] [-0.9611] [1.7474]

  • < ] [0.0000] [0.0000] [1.0000]


从我的lookAt():




  • [ - 0.9959] [0.0000] [0.0905] [-5.0000]

  • [ - 0.0237] [0.9651] [-0.2610] [-5.0000 ]

  • [ - 0.0874] [-0.2621] [-0.9611] [0.0000]

  • [0.0000] [0.0000]



似乎唯一的区别是第三列,但我老实不确定两者中哪一个是正确的。我倾向于说GLM是正确的,因为它被设计成与glu版本完全相同。



编辑:
我刚刚发现了一些有趣的东西。如果我调用translate(pos);在调用multiply(temp);之前,我的结果矩阵与glm完全相同。哪个是对的?根据gluLookAt上的OpenGL手册页,这个(因此glm)是向后做的。

解决方案

一个问题可能是 _mm_rsqrt_ps v)。这不是很准确。将它替换为 _mm_div_ps(_mm_set1_ps(1.0f),_ mm_sqrt_ps(v))。如果这解决了问题,那么你可能能够加快它与某种根抛光 Newton Raphson与SSE2 - 有人可以解释我这3行



另一个建议,你可以让你的功能更多SIMD友好的不做水平操作(你在你的标准化功能)。不是在转置之前对向量进行归一化,而是可以先转置。这将从(x,y,z,w)到(x,x,x,x),(y,y,y,y),(z,z,z,z) w,w) - 结构数组(AoS)到数组结构(SoA)。那么你只需要做1.0f / sqrt(r * r + u * u + t * t)来规范化。

  __m128 t = _mm_sub_ps(target,position)); 
__m128 u = up;
__m128 r = vec4 :: cross(t,u);
u = vec4 :: cross(r,t);
t = _mm_sub_ps(_mm_setzero_ps(),t);
_MM_TRANSPOSE4_PS(r,u,t,_mm_setr_ps(0.0f,0.0f,0.0f,1.0f)); // AoS to SoA

//现在规范化
__m128 den = _mm_add_ps(_mm_add_ps(_mm_mul_ps(r,r),_ mm_mul_ps(u,u)),_mm_mul_ps(t,t) ;
__m128 norm = _mm_div_ps(_mm_set1_ps(1.0f),_mm_sqrt_ps(den));
r = _mm_mul_ps(norm,r); u = _mm_mul_ps(norm,u); t = _mm_mul_ps(norm,t)

norm 不是单个标量。它包含四个不同的归一化(n1,n2,n3,n4),所以norm * r =(n1 * x1,n2 * x2,n3 * x3,n4 * x4)。请参阅此链接,了解如何使用SSE进行矩阵乘法的有效方法



与SSE的有效4x4矩阵向量乘法:水平加法和点积 - 这是什么?


So, I'm writing a math library using SSE intrinsics to use with my OpenGL application. Right now I'm implementing some of the more important functions like lookAt, using the glm library to check for correctness, but for some reason my implementation of lookAt isn't working as it should.

Here's the source code:

inline void lookAt(__m128 position, __m128 target, __m128 up)
{
    /* Get the target vector relative to the camera position */
    __m128 t = vec4::normalize3(_mm_sub_ps(target, position));
    __m128 u = vec4::normalize3(up);
    /* Get the right vector by crossing target and up. */
    __m128 r = vec4::normalize3(vec4::cross(t, u));
    /* Correct the up vector by crossing right and target. */
    u = vec4::cross(r, t);
    /* Negate the target vector. */
    t = _mm_sub_ps(_mm_setzero_ps(), t);

    /* Treat the right, up, and target vector as a matrix, and transpose it. */
    /* Conveniently, this also sets the w component of all four to 0.0f */
    _MM_TRANSPOSE4_PS(r, u, t, _mm_setr_ps(0.0f, 0.0f, 0.0f, 1.0f));

    vec4 pos = _mm_sub_ps(_mm_setzero_ps(), position);
    pos.w = 1.0f;

    /* Multiply our matrix by the transposed vectors. */
    mat4 temp;
    temp.col0 = r;
    temp.col1 = u;
    temp.col2 = t;
    temp.col3 = _mm_setr_ps(0.0f, 0.0f, 0.0f, 1.0f);

    multiply(temp);
    translate(pos);
}

My matrices are column-major, stored internally as "__m128 col0, col1, col2, col3;".

I made this after reading the man pages Here for gluLookAt. Once I realized that the right, up, and target vectors looked an awful lot like a row-major matrix, it was simple for me to transpose them so I could assign them to the rotation matrix.

The code for normalize3, in case it helps:

inline static __m128 normalize3(const __m128& vec)
{
    __m128 v = _mm_mul_ps(vec, vec);
    v = _mm_add_ps(
        _mm_add_ps(
            _mm_shuffle_ps(v, v, _MM_SHUFFLE(0, 0, 0, 0)),
            _mm_shuffle_ps(v, v, _MM_SHUFFLE(1, 1, 1, 1))),
        _mm_shuffle_ps(v, v, _MM_SHUFFLE(2, 2, 2, 2)));

    return _mm_mul_ps(vec, _mm_rsqrt_ps(v));
}

It saves a couple of calls by ignoring the w component of the vector.

What am I doing wrong?

Here's some sample output. Using position(5.0, 5.0, 0.0), target(10.0, 20.0, 55.0), and up (0.0, 1.0, 0.0), I get:

From GLM:

  • [-0.9959] [ 0.0000] [ 0.0905] [ 4.9795]
  • [-0.0237] [ 0.9650] [-0.2610] [-4.7065]
  • [-0.0874] [-0.2621] [-0.9611] [ 1.7474]
  • [ 0.0000] [ 0.0000] [ 0.0000] [ 1.0000]

From my lookAt():

  • [-0.9959] [ 0.0000] [ 0.0905] [-5.0000]
  • [-0.0237] [ 0.9651] [-0.2610] [-5.0000]
  • [-0.0874] [-0.2621] [-0.9611] [ 0.0000]
  • [ 0.0000] [ 0.0000] [ 0.0000] [ 1.0000]

It seems that the only difference is in the third column, but I'm honestly not sure which of the two is correct. I'm inclined to say that GLM's is correct, since it was designed to be identical to the glu version.

EDIT: I just discovered something interesting. If I call "translate(pos);" before calling "multiply(temp);", my resulting matrix is exactly the same as glm's. Which is correct? According to the OpenGL man page on gluLookAt, this (and thus glm) is doing it backwards. Was I doing it right before, or it correct now?

解决方案

One problem could be with _mm_rsqrt_ps(v). It's not very accurate. Replace it with _mm_div_ps(_mm_set1_ps(1.0f),_mm_sqrt_ps(v)). If that fixes the problem then you might be able to speed it up with some kind of root polishing Newton Raphson with SSE2 - can someone explain me these 3 lines

Another suggestion, you can make your function more SIMD friendly by not doing horizontal operations (which you do in your normalization function). Instead of normalizing the vectors before you transpose you can transpose first. This takes the vectors from (x,y,z,w) to (x,x,x,x), (y,y,y,y), (z,z,z,z), (w,w,w,w) - an Array of Structs (AoS) to a Struct of Arrays (SoA). Then you only need to do 1.0f/sqrt(r*r+u*u+t*t) to normalize.

__m128 t = _mm_sub_ps(target, position));
__m128 u = up;
__m128 r = vec4::cross(t, u);
u = vec4::cross(r, t);
t = _mm_sub_ps(_mm_setzero_ps(), t);
_MM_TRANSPOSE4_PS(r, u, t, _mm_setr_ps(0.0f, 0.0f, 0.0f, 1.0f));  //AoS to SoA

//now normalize
__m128 den = _mm_add_ps(_mm_add_ps(_mm_mul_ps(r,r),_mm_mul_ps(u,u)), _mm_mul_ps(t,t));
__m128 norm = _mm_div_ps(_mm_set1_ps(1.0f), _mm_sqrt_ps(den));
r= _mm_mul_ps(norm,r); u =_mm_mul_ps(norm,u); t = _mm_mul_ps(norm,t);

norm is not a single scalar. It contains the four different normalizations (n1,n2,n3,n4) so norm*r = (n1*x1, n2*x2, n3*x3, n4*x4). See this link for an efficient way to do matrix multiplication with SSE

Efficient 4x4 matrix vector multiplication with SSE: horizontal add and dot product - what's the point?

这篇关于我的SSE实现lookAt不工作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆