如何使用SSE执行8 x 8矩阵操作？ [英] How do I perform 8 x 8 matrix operation using SSE?

查看：116 发布时间：2016/10/23 23:13:35 c++ sse intrinsics

本文介绍了如何使用SSE执行8 x 8矩阵操作？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我的初始尝试看起来像这样（假设我们要乘以）

My initial attempt looked like this (supposed we want to multiply)

  __m128 mat[n]; /* rows */
  __m128 vec[n] = {1,1,1,1};
  float outvector[n];
   for (int row=0;row<n;row++) {
       for(int k =3; k < 8; k = k+ 4)
       {
           __m128 mrow = mat[k];
           __m128 v = vec[row];
           __m128 sum = _mm_mul_ps(mrow,v);
           sum= _mm_hadd_ps(sum,sum); /* adds adjacent-two floats */
       }
           _mm_store_ss(&outvector[row],_mm_hadd_ps(sum,sum));
 }

但这显然不行。我如何处理这个问题？

But this clearly doesn't work. How do I approach this?

我应该一次加载4个....

I should load 4 at a time....

是：如果我的数组非常大（说n = 1000），如何使它16字节对齐？这是可能吗？

The other question is: if my array is very big (say n = 1000), how can I make it 16-bytes aligned? Is that even possible?

推荐答案

确定...我将使用行主矩阵约定。 [m] 的每一行都需要（2）__m128个元素，以产生8个浮点数。 8x1向量 v 是列向量。由于你使用 haddps 指令，我假设SSE3可用。查找 r = [m] * v ：

OK... I'll use a row-major matrix convention. Each row of [m] requires (2) __m128 elements to yield 8 floats. The 8x1 vector v is a column vector. Since you're using the haddps instruction, I'll assume SSE3 is available. Finding r = [m] * v :

void mul (__m128 r[2], const __m128 m[8][2], const __m128 v[2])
{
    __m128 t0, t1, t2, t3, r0, r1, r2, r3;

    t0 = _mm_mul_ps(m[0][0], v[0]);
    t1 = _mm_mul_ps(m[1][0], v[0]);
    t2 = _mm_mul_ps(m[2][0], v[0]);
    t3 = _mm_mul_ps(m[3][0], v[0]);

    t0 = _mm_hadd_ps(t0, t1);
    t2 = _mm_hadd_ps(t2, t3);
    r0 = _mm_hadd_ps(t0, t2);

    t0 = _mm_mul_ps(m[0][1], v[1]);
    t1 = _mm_mul_ps(m[1][1], v[1]);
    t2 = _mm_mul_ps(m[2][1], v[1]);
    t3 = _mm_mul_ps(m[3][1], v[1]);

    t0 = _mm_hadd_ps(t0, t1);
    t2 = _mm_hadd_ps(t2, t3);
    r1 = _mm_hadd_ps(t0, t2);

    t0 = _mm_mul_ps(m[4][0], v[0]);
    t1 = _mm_mul_ps(m[5][0], v[0]);
    t2 = _mm_mul_ps(m[6][0], v[0]);
    t3 = _mm_mul_ps(m[7][0], v[0]);

    t0 = _mm_hadd_ps(t0, t1);
    t2 = _mm_hadd_ps(t2, t3);
    r2 = _mm_hadd_ps(t0, t2);

    t0 = _mm_mul_ps(m[4][1], v[1]);
    t1 = _mm_mul_ps(m[5][1], v[1]);
    t2 = _mm_mul_ps(m[6][1], v[1]);
    t3 = _mm_mul_ps(m[7][1], v[1]);

    t0 = _mm_hadd_ps(t0, t1);
    t2 = _mm_hadd_ps(t2, t3);
    r3 = _mm_hadd_ps(t0, t2);

    r[0] = _mm_add_ps(r0, r1);
    r[1] = _mm_add_ps(r2, r3);
}

对于对齐，类型__m128的变量应该在堆栈。使用动态内存，这不是一个安全的假设。一些malloc / new实现只能返回保证为8字节对齐的内存。

As for alignment, a variable of a type __m128 should be automatically aligned on the stack. With dynamic memory, this is not a safe assumption. Some malloc / new implementations may only return memory guaranteed to be 8-byte aligned.

内在函数头提供_mm_malloc和_mm_free。在这种情况下，align参数应为（16）。

The intrinsics header provides _mm_malloc and _mm_free. The align parameter should be (16) in this case.

这篇关于如何使用SSE执行8 x 8矩阵操作？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何使用SSE执行8 x 8矩阵操作？ [英] How do I perform 8 x 8 matrix operation using SSE?

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

如何使用SSE执行8 x 8矩阵操作？ [英] How do I perform 8 x 8 matrix operation using SSE?

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

登录关闭