使用 SSE 的高效 4x4 矩阵向量乘法:水平加法和点积 - 有什么意义? [英] Efficient 4x4 matrix vector multiplication with SSE: horizontal add and dot product - what's the point?

查看：33 发布时间：2021/12/20 15:50:14 c performance optimization sse matrix-multiplication

本文介绍了使用 SSE 的高效 4x4 矩阵向量乘法:水平加法和点积 - 有什么意义?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试使用 SSE 找到 4x4 矩阵 (M) 乘法与向量 (u) 的最有效实现.我的意思是 Mu = v.

I am trying to find the most efficient implementation of 4x4 matrix (M) multiplication with a vector (u) using SSE. I mean Mu = v.

据我所知，有两种主要方法可以解决这个问题:

As far as I understand there are two primary ways to go about this:

    method 1) v1 = dot(row1, u), v2 = dot(row2, u), v3 = dot(row3, u), v4 = dot(row4, u)
    method 2) v = u1 col1 + u2 col2 + u3 col3 + u4 col4.

方法 2 在 SSE2 中很容易实现.方法一既可以用SSE3中的水平相加指令，也可以用SSE4中的点积指令来实现.然而，在我所有的测试中，方法 2 总是优于方法 1.

Method 2 is easy to implement in SSE2. Method 1 can be implement with either the horizontal add instruction in SSE3 or the dot product instruction in SSE4. However, in all my tests method 2 always outperforms method 1.

我认为方法 1 有优势的一个地方是在 3x4 矩阵中，例如用于仿射变换.在这种情况下，最后一个点积是不必要的.但即使在这种情况下，4x4 矩阵上的方法 2 也比 3x4 矩阵上的方法 1 快.我发现的唯一比 4x4 矩阵上的方法 2 更快的方法是 4x3 矩阵上的方法 2.

One place where I though method 1 would have an advantage is in a 3x4 matrix, for example for affine transform. In this case the last dot product is unnecessary. But even in this case method 2 on a 4x4 matrix is faster than method 1 on a 3x4 matrix. The only method I have found that is faster than method 2 on a 4x4 matrix is method 2 on a 4x3 matrix.

那么水平加法和点积指令的意义何在?事实上，在这种情况下，点产生指令的性能最差.也许它与数据的格式有关?如果无法定义矩阵的排序方式，则需要转置，在这种情况下，方法 1 会更好吗?

So what's the point of the horizontal add and the dot product instruction? In fact the dot production instruction gives the worst performance in this case. Maybe it has something to do with the format of the data? If one can't define how the matrix is ordered then a transpose is necessary and in that case maybe method 1 would be better?

请参阅下面的一些代码.

See below for some code.

__m128 m4x4v_colSSE(const __m128 cols[4], const __m128 v) {
  __m128 u1 = _mm_shuffle_ps(v,v, _MM_SHUFFLE(0,0,0,0));
  __m128 u2 = _mm_shuffle_ps(v,v, _MM_SHUFFLE(1,1,1,1));
  __m128 u3 = _mm_shuffle_ps(v,v, _MM_SHUFFLE(2,2,2,2));
  __m128 u4 = _mm_shuffle_ps(v,v, _MM_SHUFFLE(3,3,3,3));

  __m128 prod1 = _mm_mul_ps(u1, cols[0]);
  __m128 prod2 = _mm_mul_ps(u2, cols[1]);
  __m128 prod3 = _mm_mul_ps(u3, cols[2]);
  __m128 prod4 = _mm_mul_ps(u4, cols[3]);

  return _mm_add_ps(_mm_add_ps(prod1, prod2), _mm_add_ps(prod3, prod4));
}

__m128 m4x4v_rowSSE3(const __m128 rows[4], const __m128 v) {
  __m128 prod1 = _mm_mul_ps(rows[0], v);
  __m128 prod2 = _mm_mul_ps(rows[1], v);
  __m128 prod3 = _mm_mul_ps(rows[2], v);
  __m128 prod4 = _mm_mul_ps(rows[3], v);

  return _mm_hadd_ps(_mm_hadd_ps(prod1, prod2), _mm_hadd_ps(prod3, prod4));
}

__m128 m4x4v_rowSSE4(const __m128 rows[4], const __m128 v) {
  __m128 prod1 = _mm_dp_ps (rows[0], v, 0xFF);
  __m128 prod2 = _mm_dp_ps (rows[1], v, 0xFF);
  __m128 prod3 = _mm_dp_ps (rows[2], v, 0xFF);
  __m128 prod4 = _mm_dp_ps (rows[3], v, 0xFF);

  return _mm_shuffle_ps(_mm_movelh_ps(prod1, prod2), _mm_movelh_ps(prod3, prod4),  _MM_SHUFFLE(2, 0, 2, 0));
}

推荐答案

水平加法和点积指令很复杂:它们被分解成多个更简单的微操作，就像简单的指令一样由处理器执行.水平加法和点积指令精确分解为微操作是特定于处理器的，但对于最近的英特尔处理器，水平加法被分解为 2 个 SHUFFLE + 1 ADD 微操作，而点积被分解为 1 MUL + 1 SHUFFLE + 2 ADD 微操作.除了大量的微操作外，这条指令还强调了处理器流水线中的指令解码器:英特尔处理器每个周期只能解码一条这样复杂的指令(与 4 条简单指令相比).在 AMD Bulldozer 上，这些复杂指令的相对成本更高.

Horizontal add and dot product instructions are complex: they are decomposed into multiple simpler microoperations which are executed by processor just like simple instructions. The exact decomposition of horizontal add and dot product instructions into microoperations is processor-specific, but for recent Intel processors horizontal add is decomposed into 2 SHUFFLE + 1 ADD microoperations, and dot product is decomposed into 1 MUL + 1 SHUFFLE + 2 ADD microoperations. Besides a larger number of microoperations, this instructions also stress the instruction decoder in the processor pipeline: Intel processors can decode only one such complex instruction per cycle (compared to 4 simple instructions). On AMD Bulldozer the relative cost of these complex instructions is even higher.

这篇关于使用 SSE 的高效 4x4 矩阵向量乘法:水平加法和点积 - 有什么意义?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用 SSE 的高效 4x4 矩阵向量乘法:水平加法和点积 - 有什么意义? [英] Efficient 4x4 matrix vector multiplication with SSE: horizontal add and dot product - what's the point?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用 SSE 的高效 4x4 矩阵向量乘法:水平加法和点积 - 有什么意义? [英] Efficient 4x4 matrix vector multiplication with SSE: horizontal add and dot product - what&#39;s the point?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

使用 SSE 的高效 4x4 矩阵向量乘法:水平加法和点积 - 有什么意义? [英] Efficient 4x4 matrix vector multiplication with SSE: horizontal add and dot product - what's the point?

登录关闭