SSE,行主要vs列主要性能问题 [英] SSE, row major vs column major performance issue

查看:249
本文介绍了SSE,行主要vs列主要性能问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于个人和有趣的事,我使用SSE(4.1)编写一个geom库。



我花了最后12小时试图理解性能问题,行主要和列主要存储的矩阵。



我知道Dirext / OpenGL矩阵存储行主要,所以最好保持我的矩阵存储在行主要顺序所以我在GPU /着色器存储/加载矩阵时没有转换。



但是,我做了一些分析,我得到colomun专业的更快的结果。 / p>

要转换一个带有主行的transfrom矩阵的点,它的P'= P * M,在主列中,它的P'= M * P.
因此在Column major中,它只是4点积,因此只有4个SSE4.1指令(_mm_dp_ps)在Row major中,我必须对转置矩阵执行这4个点积。



10M向量上的效果结果



(30/05/2014 @ 08:48:10)Log:[5](Vec.Mul.Matrix)= 76.216653 ms (row major transform)



(30/05/2014 @ 08:48:10)Log:[6](Matrix.Mul.Vec)= 61.554892 ms major tranform)



我尝试了几种方法来做Vec * Matrix操作,使用_MM_TRANSPOSE或不是,我发现的最快的方法是:

  mssFloat Vec4 :: operator |(const Vec4& v)const //  -  Dot Product 
{
return _mm_dp_ps(m_val,v.m_val,0xFF).m128_f32 [0];
}
inline Vec4 operator *(const Vec4& vec,const Mat4& m)
{
return Vec4(Vec4(m [0] [0],m [1] [1],m [2] [1],m [2] [0],m [2] [0] 1],m [3] [1])| vec
,Vec4(m [0] [2],m [1] [2],m [2] ] | vec
,Vec4(m [0] [3],m [1] [3],m [2] [3],m [3] ;
}

我的类Vec4只是一个__m128 m_val,在优化的C ++中,向量构造是所有在SSE寄存器上都有效。



我的第一个猜测,是这个乘法不是最优的。我在SSE是新的,所以我有点困惑如何优化这,我的直觉告诉我使用shuffle指令,但我想知道为什么会更快。它会加载4 shuffle __m128快于赋值(__m128 m_val = _mm_set_ps(w,z,y,x);)



https://software.intel.com/sites/landingpage/IntrinsicsGuide/
我找不到效果信息on mm_set_ps



编辑:我仔细检查概要分析方法,每次测试都是以相同的方式完成的,所以没有内存缓存差异。为了避免本地缓存,我对随机化的bug矢量数组进行操作,种子对于每个测试是相同的。

解决方案

不要使用 _mm_dp_ps 用于矩阵乘法!我已在对此进行了详细解释有效的4x4矩阵矢量乘法与SSE:水平加法和点积 - 这是什么意思?(顺便说一下,这是我的第一篇文章SO)。



不需要任何东西比SSE更有效地做到这一点(甚至SSE2)。使用此代码有效地进行4x4矩阵乘法。如果矩阵按照 gemm4x4_SSE(A,B,C)的行主顺序存储。如果矩阵以列的主要顺序存储,而不是 gemm4x4_SSE(B,A,C)

  void gemm4x4_SSE(float * A,float * B,float * C){
__m128 row [4],sum [4]
for(int i = 0; i <4; i ++)row [i] = _mm_load_ps(& B [i * 4]);
for(int i = 0; i <4; i ++){
sum [i] = _mm_setzero_ps();
for(int j = 0; j <4; j ++){
sum [i] = _mm_add_ps(_mm_mul_ps(_mm_set1_ps(A [i * 4 + j]),row [j] [一世]);
}
}
for(int i = 0; i <4; i ++)_mm_store_ps(& C [i * 4],sum [i]
}


For personnal and fun matter, I'm coding a geom lib using SSE(4.1).

I spend last 12h trying to understand a performance issue when dealing with row major vs column major stored matrix.

I know Dirext/OpenGL matrices are stored row major, so it would be better for me to keep my matrices stored in row major order so I will have no conversion when storing/loading matrices to/from GPU/shaders.

But, I made some profiling, and I get faster result with colomun major.

To transform a point with a transfrom matrix in row major, it's P' = P * M. and in column major, it's P' = M * P. So in Column major it's simply 4 dot product , so only 4 SSE4.1 instruction ( _mm_dp_ps ) when in Row major I must do those 4 dot products on the transposed matrix.

Performance result on 10M vectors

(30/05/2014@08:48:10) Log : [5] ( Vec.Mul.Matrix ) = 76.216653 ms ( row major transform )

(30/05/2014@08:48:10) Log : [6] ( Matrix.Mul.Vec ) = 61.554892 ms ( column major tranform )

I tried several way to do Vec * Matrix operation, using _MM_TRANSPOSE or not, and the fastest way I found is this :

mssFloat    Vec4::operator|(const Vec4& v) const //-- Dot Product
{
    return _mm_dp_ps(m_val, v.m_val, 0xFF ).m128_f32[0];
}
inline Vec4 operator*(const Vec4& vec,const Mat4& m)
{
    return Vec4(    Vec4( m[0][0],m[1][0],m[2][0],m[3][0]) | vec
        ,   Vec4( m[0][1],m[1][1],m[2][1],m[3][1]) | vec
        ,   Vec4( m[0][2],m[1][2],m[2][2],m[3][2]) | vec
        ,   Vec4( m[0][3],m[1][3],m[2][3],m[3][3]) | vec
                );
}

my class Vec4 is simply a __m128 m_val, in optimized C++ the vector construction is all done efficiently on SSE register.

My first guess, is that this multiplication is not optimal. I'm new in SSE, so I'm a bit puzzled how to optimize this, my intuition tell me to use shuffle instruction, but I'd like to understand why it would be faster. Will it load 4 shuffle __m128 faster than assigning ( __m128 m_val = _mm_set_ps(w, z, y, x); )

From https://software.intel.com/sites/landingpage/IntrinsicsGuide/ I couldn't find performance info on mm_set_ps

EDIT : I double check the profiling method, each test are done in the same manner, so no memory cache differences. To avoid local cache, I'm doing operation for randomized bug vector array, seed is same for each test. Only 1 test at each execution to avoir performance increase from memory cache.

解决方案

Don't use _mm_dp_ps for matrix multiplication! I already explained this in great detail at Efficient 4x4 matrix vector multiplication with SSE: horizontal add and dot product - what's the point? (incidentally this was my first post on SO).

You don't need anything for more than SSE to do this efficiently (not even SSE2). Use this code to do 4x4 matrix multiplication efficiently. If the matrices are stored in row-major order than do gemm4x4_SSE(A,B,C). If the matrices are stored in column-major order than do gemm4x4_SSE(B,A,C).

void gemm4x4_SSE(float *A, float *B, float *C) {
    __m128 row[4], sum[4];
    for(int i=0; i<4; i++)  row[i] = _mm_load_ps(&B[i*4]);
    for(int i=0; i<4; i++) {
        sum[i] = _mm_setzero_ps();      
        for(int j=0; j<4; j++) {
            sum[i] = _mm_add_ps(_mm_mul_ps(_mm_set1_ps(A[i*4+j]), row[j]), sum[i]);
        }           
    }
    for(int i=0; i<4; i++) _mm_store_ps(&C[i*4], sum[i]); 
}

这篇关于SSE,行主要vs列主要性能问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆