高效的4x4矩阵向量乘法SSE:水平添加和积 - 有什么意义呢? [英] Efficient 4x4 matrix vector multiplication with SSE: horizontal add and dot product - what's the point?

查看:968
本文介绍了高效的4x4矩阵向量乘法SSE:水平添加和积 - 有什么意义呢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图找到使用SSE向量(U)的最高效的实现4x4矩阵(M)相乘。我的意思是穆= V。

据我了解有两种主要方式去了解这一点:

 方法1)V1 =点(ROW1,U),V2 =点(ROW2,U),V3 =点(ROW3,U),V4 =点(ROW4,U)
    方法2)V = U1 + COL1 COL2 U2 U3 + + col3的U4 COL4。

2的方法很容易在SSE2实现。方法1可以与SSE3或任水平加法指令在SSE4的点积指令执行。然而,在我所有的测试方法2始终优于方法1。

一个地方,我虽然方法1将有一个优势是在3×4矩阵,例如用于仿射变换。在这种情况下,最后的点积是不必要的。但是,即使在一个4x4矩阵这种情况下,方法2比3×4矩阵方法1快。我已发现,比上一4x4矩阵方法2更快的唯一方法是在4×3矩阵方法2。

那么,有什么水平的添加和点积指令的意义呢?事实上点产生指令给出在这种情况下表现最差。也许它是与数据的格式?如果一个人不能定义矩阵如何定购然后转置是必要的,在这种情况下,也许方法1会更好?

请参阅以下一些code。

  __ M128 m4x4v_colSSE(常量__m128 COLS [4],常量__m128 V){
  __m128 U1 = _mm_shuffle_ps(V,V,_MM_SHUFFLE(0,0,0,0));
  __m128 U2 = _mm_shuffle_ps(V,V,_MM_SHUFFLE(1,1,1,1));
  __m128 U3 = _mm_shuffle_ps(V,V,_MM_SHUFFLE(2,2,2,2));
  __m128 U4 = _mm_shuffle_ps(V,V,_MM_SHUFFLE(3,3,3,3));  __m128 PROD1 = _mm_mul_ps(U1,COLS [0]);
  __m128 Prod2的= _mm_mul_ps(U2,COLS [1]);
  __m128 prod3 = _mm_mul_ps(U3,COLS [2]);
  __m128 prod4 = _mm_mul_ps(U4,COLS [3]);  返回_mm_add_ps(_mm_add_ps(PROD1,Prod2的),_mm_add_ps(prod3,prod4));
}__m128 m4x4v_rowSSE3(常量__m128行[4],常量__m128 V){
  __m128 PROD1 = _mm_mul_ps(行[0],V);
  __m128 Prod2的= _mm_mul_ps(行[1],v)的;
  __m128 prod3 = _mm_mul_ps(行[2],V);
  __m128 prod4 = _mm_mul_ps(行[3],V);  返回_mm_hadd_ps(_mm_hadd_ps(PROD1,Prod2的),_mm_hadd_ps(prod3,prod4));
}__m128 m4x4v_rowSSE4(常量__m128行[4],常量__m128 V){
  __m128 PROD1 = _mm_dp_ps(行[0],V为0xFF);
  __m128 Prod2的= _mm_dp_ps(行[1],V为0xFF);
  __m128 prod3 = _mm_dp_ps(行[2],五,为0xFF);
  __m128 prod4 = _mm_dp_ps(行[3],V为0xFF);  返回_mm_shuffle_ps(_mm_movelh_ps(PROD1,Prod2的),_mm_movelh_ps(prod3,prod4),_MM_SHUFFLE(2,0,2,0));
}


解决方案

水平增加和积指令是复杂的:它们被分解成由处理器一样简单的指令执行多个简单的微操作。水平增加和积指令到微操作的精确分解处理器特有的,但最近Intel处理器的水平附加分解为2 SHUFFLE + 1 ADD微操作,并积分解为1 MUL + 1 SHUFFLE + 2 ADD微操作。除了微操作的数量较多,该指令还强调在处理器流水线中的指令代codeR:Intel处理器可以取消code每个周期只有一个这样的复杂的指令(相对于4简单说明)。在AMD推土机的这些复杂指令的相对成本则更高。

I am trying to find the most efficient implementation of 4x4 matrix (M) multiplication with a vector (u) using SSE. I mean Mu = v.

As far as I understand there are two primary ways to go about this:

    method 1) v1 = dot(row1, u), v2 = dot(row2, u), v3 = dot(row3, u), v4 = dot(row4, u)
    method 2) v = u1 col1 + u2 col2 + u3 col3 + u4 col4.

Method 2 is easy to implement in SSE2. Method 1 can be implement with either the horizontal add instruction in SSE3 or the dot product instruction in SSE4. However, in all my tests method 2 always outperforms method 1.

One place where I though method 1 would have an advantage is in a 3x4 matrix, for example for affine transform. In this case the last dot product is unnecessary. But even in this case method 2 on a 4x4 matrix is faster than method 1 on a 3x4 matrix. The only method I have found that is faster than method 2 on a 4x4 matrix is method 2 on a 4x3 matrix.

So what's the point of the horizontal add and the dot product instruction? In fact the dot production instruction gives the worst performance in this case. Maybe it has something to do with the format of the data? If one can't define how the matrix is ordered then a transpose is necessary and in that case maybe method 1 would be better?

See below for some code.

__m128 m4x4v_colSSE(const __m128 cols[4], const __m128 v) {
  __m128 u1 = _mm_shuffle_ps(v,v, _MM_SHUFFLE(0,0,0,0));
  __m128 u2 = _mm_shuffle_ps(v,v, _MM_SHUFFLE(1,1,1,1));
  __m128 u3 = _mm_shuffle_ps(v,v, _MM_SHUFFLE(2,2,2,2));
  __m128 u4 = _mm_shuffle_ps(v,v, _MM_SHUFFLE(3,3,3,3));

  __m128 prod1 = _mm_mul_ps(u1, cols[0]);
  __m128 prod2 = _mm_mul_ps(u2, cols[1]);
  __m128 prod3 = _mm_mul_ps(u3, cols[2]);
  __m128 prod4 = _mm_mul_ps(u4, cols[3]);

  return _mm_add_ps(_mm_add_ps(prod1, prod2), _mm_add_ps(prod3, prod4));
}

__m128 m4x4v_rowSSE3(const __m128 rows[4], const __m128 v) {
  __m128 prod1 = _mm_mul_ps(rows[0], v);
  __m128 prod2 = _mm_mul_ps(rows[1], v);
  __m128 prod3 = _mm_mul_ps(rows[2], v);
  __m128 prod4 = _mm_mul_ps(rows[3], v);

  return _mm_hadd_ps(_mm_hadd_ps(prod1, prod2), _mm_hadd_ps(prod3, prod4));
}

__m128 m4x4v_rowSSE4(const __m128 rows[4], const __m128 v) {
  __m128 prod1 = _mm_dp_ps (rows[0], v, 0xFF);
  __m128 prod2 = _mm_dp_ps (rows[1], v, 0xFF);
  __m128 prod3 = _mm_dp_ps (rows[2], v, 0xFF);
  __m128 prod4 = _mm_dp_ps (rows[3], v, 0xFF);

  return _mm_shuffle_ps(_mm_movelh_ps(prod1, prod2), _mm_movelh_ps(prod3, prod4),  _MM_SHUFFLE(2, 0, 2, 0));
}  

解决方案

Horizontal add and dot product instructions are complex: they are decomposed into multiple simpler microoperations which are executed by processor just like simple instructions. The exact decomposition of horizontal add and dot product instructions into microoperations is processor-specific, but for recent Intel processors horizontal add is decomposed into 2 SHUFFLE + 1 ADD microoperations, and dot product is decomposed into 1 MUL + 1 SHUFFLE + 2 ADD microoperations. Besides a larger number of microoperations, this instructions also stress the instruction decoder in the processor pipeline: Intel processors can decode only one such complex instruction per cycle (compared to 4 simple instructions). On AMD Bulldozer the relative cost of these complex instructions is even higher.

这篇关于高效的4x4矩阵向量乘法SSE:水平添加和积 - 有什么意义呢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆