使用SSE Intrinsics对浮点数x，y，z数组进行矢量化处理，计算长度和差值 [英] Vectorizing a loop over float x,y,z arrays calculating length and differences using SSE Intrinsics

查看：105 发布时间：2020/5/21 20:41:27 c optimization vectorization sse intrinsics

本文介绍了使用SSE Intrinsics对浮点数x，y，z数组进行矢量化处理，计算长度和差值的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试将我拥有的循环转换为SSE内在函数.我似乎取得了相当不错的进步，我的意思是说这是正确的方向，但是我似乎在某处做了一些翻译错误，因为我没有从非SSE代码得到相同的正确"答案.

I am trying to convert a loop I have into a SSE intrinsics. I seem to have made fairly good progress, and by that I mean It's in the correct direction however I appear to have done some of the translation wrong somewhere as I am not getting the same "correct" answer which results from the non-sse code.

我原来的循环(展开了4倍)看起来像这样:

My original loop which I unrolled by a factor of 4 looks like this:

int unroll_n = (N/4)*4;

for (int j = 0; j < unroll_n; j++) {
        for (int i = 0; i < unroll_n; i+=4) {
            float rx = x[j] - x[i];
            float ry = y[j] - y[i];
            float rz = z[j] - z[i];
            float r2 = rx*rx + ry*ry + rz*rz + eps;
            float r2inv = 1.0f / sqrt(r2);
            float r6inv = r2inv * r2inv * r2inv;
            float s = m[j] * r6inv;
            ax[i] += s * rx;
            ay[i] += s * ry;
            az[i] += s * rz;
            //u
            rx = x[j] - x[i+1];
             ry = y[j] - y[i+1];
             rz = z[j] - z[i+1];
             r2 = rx*rx + ry*ry + rz*rz + eps;
             r2inv = 1.0f / sqrt(r2);
             r6inv = r2inv * r2inv * r2inv;
             s = m[j] * r6inv;
            ax[i+1] += s * rx;
            ay[i+1] += s * ry;
            az[i+1] += s * rz;
            //unroll i 3
             rx = x[j] - x[i+2];
             ry = y[j] - y[i+2];
             rz = z[j] - z[i+2];
             r2 = rx*rx + ry*ry + rz*rz + eps;
             r2inv = 1.0f / sqrt(r2);
             r6inv = r2inv * r2inv * r2inv;
             s = m[j] * r6inv;
            ax[i+2] += s * rx;
            ay[i+2] += s * ry;
            az[i+2] += s * rz;
            //unroll i 4
             rx = x[j] - x[i+3];
             ry = y[j] - y[i+3];
             rz = z[j] - z[i+3];
             r2 = rx*rx + ry*ry + rz*rz + eps;
             r2inv = 1.0f / sqrt(r2);
             r6inv = r2inv * r2inv * r2inv;
             s = m[j] * r6inv;
            ax[i+3] += s * rx;
            ay[i+3] += s * ry;
            az[i+3] += s * rz;
    }
}

然后我基本上逐行转到顶部，并将其转换为SSE内在函数.代码如下.我不确定是否需要前三行，但是我知道我的数据需要对齐16位才能正确，最佳地工作.

I essentially then went line by line for the top section and converted it into SSE intrinsics. The code is below. I'm not totally sure if the top three lines are needed however I understand that my data needs to be 16bit aligned for this to work correctly and optimally.

float *x = malloc(sizeof(float) * N);
float *y = malloc(sizeof(float) * N);
float *z = malloc(sizeof(float) * N); 

for (int j = 0; j < N; j++) {
    for (int i = 0; i < N; i+=4) {
        __m128 xj_v = _mm_set1_ps(x[j]);
        __m128 xi_v = _mm_load_ps(&x[i]);
        __m128 rx_v = _mm_sub_ps(xj_v, xi_v);

        __m128 yj_v = _mm_set1_ps(y[j]);
        __m128 yi_v = _mm_load_ps(&y[i]);
        __m128 ry_v = _mm_sub_ps(yj_v, yi_v);

        __m128 zj_v = _mm_set1_ps(z[j]);
        __m128 zi_v = _mm_load_ps(&z[i]);
        __m128 rz_v = _mm_sub_ps(zj_v, zi_v);

    __m128 r2_v = _mm_mul_ps(rx_v, rx_v) + _mm_mul_ps(ry_v, ry_v) + _mm_mul_ps(rz_v, rz_v) + _mm_set1_ps(eps);

    __m128 r2inv_v = _mm_div_ps(_mm_set1_ps(1.0f),_mm_sqrt_ps(r2_v));

    __m128 r6inv_1v = _mm_mul_ps(r2inv_v, r2inv_v);
    __m128 r6inv_v = _mm_mul_ps(r6inv_1v, r2inv_v);

    __m128 mj_v = _mm_set1_ps(m[j]);
    __m128 s_v = _mm_mul_ps(mj_v, r6inv_v);

    __m128 axi_v = _mm_load_ps(&ax[i]);
    __m128 ayi_v = _mm_load_ps(&ay[i]);
    __m128 azi_v = _mm_load_ps(&az[i]);

    __m128 srx_v = _mm_mul_ps(s_v, rx_v);
    __m128 sry_v = _mm_mul_ps(s_v, ry_v);
    __m128 srz_v = _mm_mul_ps(s_v, rz_v);

    axi_v = _mm_add_ps(axi_v, srx_v);
    ayi_v = _mm_add_ps(ayi_v, srx_v);
    azi_v = _mm_add_ps(azi_v, srx_v);

    _mm_store_ps(ax, axi_v);
    _mm_store_ps(ay, ayi_v);
    _mm_store_ps(az, azi_v);
    }
}

我认为主要思想是正确的，但是由于结果不正确，所以某个地方存在一个或多个错误.

I feel the main idea is correct however there is an/some error(s) somewhere as the resulting answer is incorrect.

推荐答案

我认为您唯一的错误是简单的错别字，而不是逻辑错误，请参见下文.

I think your only bugs are simple typos, not logic errors, see below.

您不能只使用clang的自动矢量化吗?还是需要为此代码使用gcc?自动向量化将使您无需修改即可从同一来源制作SSE，AVX和(以后)AVX512版本.不幸的是，本征不能扩展到不同的向量大小.

Can't you just use clang's auto-vectorization? Or do you need to use gcc for this code? Auto-vectorization would let you make SSE, AVX, and (in future) AVX512 versions from the same source with no modifications. Intrinsics aren't scalable to different vector sizes, unfortunately.

根据您开始矢量化的过程，我制作了一个优化版本.您应该尝试一下，我很想知道它是否比带错误修正的版本或clang的自动矢量化版本更快. :)

Based on your start at vectorizing, I made an optimized version. You should try it out, I'm curious to hear if it's faster than your version with bugfixes, or clang's auto-vectorized version. :)

_mm_store_ps(ax, axi_v);
_mm_store_ps(ay, ayi_v);
_mm_store_ps(az, azi_v);

您是从ax[i]加载的，但是现在您要存储到ax[0].

You loaded from ax[i], but now you're storing to ax[0].

此外，铛的未使用变量警告发现了此错误:

axi_v = _mm_add_ps(axi_v, srx_v);
ayi_v = _mm_add_ps(ayi_v, srx_v);  // should be sry_v
azi_v = _mm_add_ps(azi_v, srx_v);  // should be srz_v

就像我在上一个问题的回答中说的那样，您可能应该互换循环，所以相同的ax [i + 0..3]，ay [i + 0..3]和az [i + 0 ..3]，避免了这种加载/存储.

Like I said in my answer on your previous question, you should probably interchange the loops, so the same ax[i+0..3], ay[i+0..3], and az[i+0..3] are used, avoiding that load/store.

此外，如果您不打算使用

Also, if you're not going to use rsqrtps + Newton-Raphson, you should use the transformation I pointed out in my last answer: divide m[j] by sqrt(k2) ^3. There's no point dividing 1.0f by something using a divps, and then later multiplying only once.

rsqrt 可能实际上并不是赢家，因为总的uop吞吐量可能比div/sqrt吞吐量或延迟更是一个瓶颈. 三乘+ FMA + rsqrtps 比sqrtps + divps要多得多. rsqrt对AVX 256b向量更有用，因为在SnB到Broadwell上，divide/sqrt单位不是全角. Skylake具有12c延迟sqrtps ymm，与xmm相同，但是xmm的吞吐量仍然更好(每3c一次，而不是6c一次).

rsqrt might not actually be a win, because total uop throughput might be more of a bottleneck than div / sqrt throughput or latency. three multiplies + FMA + rsqrtps is significantly more uops than sqrtps + divps. rsqrt is more helpful with AVX 256b vectors, because the divide / sqrt unit isn't full-width on SnB to Broadwell. Skylake has 12c latency sqrtps ymm, same as for xmm, but throughput is still better for xmm (one per 3c instead of one per 6c).

-ffast-math 编译代码时，> clang和gcc都使用rsqrtps/rsqrtss. (当然只有使用打包版本的clang.)

clang and gcc were both using rsqrtps / rsqrtss when compiling your code with -ffast-math. (only clang using the packed version, of course.)

如果不交换循环，则应手动将仅依赖于j的所有内容提升到内部循环之外.编译器通常擅长于此，但是使源代码反映您希望编译器能够完成的工作似乎仍然是一个好主意.这有助于了解"循环的实际作用.

If you don't interchange the loops, you should manually hoist everything that only depends on j out of the inner loop. Compilers are generally good at this, but it still seems like a good idea to make the source reflect what you expect the compiler to be able to do. This helps with "seeing" what the loop is actually doing.

这是一个对原始版本进行了一些优化的版本:

Here's a version with some optimizations over your original:

为使gcc/clang将mul/adds融合到FMA中，我使用了-ffp-contract=fast.这无需使用-ffast-math即可获得高吞吐量的FMA指令. (三个独立的累加器有很多并行性，因此与addps相比，FMA的增加的延迟根本不会受到损害.我希望这里的port0/1吞吐量是限制因素.)我

To get gcc/clang to fuse the mul/adds into FMA, I used -ffp-contract=fast. This gets FMA instructions for high throughput without using -ffast-math. (There is a lot of parallelism with the three separate accumulators, so the increased latency of FMA compared to addps shouldn't hurt at all. I expect port0/1 throughput is the limiting factor here.) I thought gcc did this automatically, but it seems it doesn't here without -ffast-math.

请注意，v ^3/2 = sqrt(v)³ = sqrt(v)* v.这样具有更低的延迟和更少的指令.

Notice that v^3/2 = sqrt(v)³ = sqrt(v)*v. This has lower latency and fewer instructions.

交换了循环，并在内部循环中使用广播负载来提高局部性(使用AVX将带宽需求减少4或8).内部循环的每次迭代仅从四个源阵列中的每一个读取4B新数据. (x，y，z和m).因此，它在高温时充分利用了每个缓存行.

Interchanged the loops, and use broadcast-loads in the inner loop to improve locality (cut bandwidth requirement by 4, or 8 with AVX). Each iteration of the inner loop only reads 4B of new data from each of the four source arrays. (x,y,z, and m). So it makes a lot of use of each cache line while it's hot.

在内部循环中使用广播负载还意味着我们并行累加ax[i + 0..3]，从而无需水平总和需要额外的代码. (有关代码的内容，请替换此循环，但请在内部循环中使用矢量加载，且步幅= 16B .)

Using broadcast-loads in the inner loop also means we accumulate ax[i + 0..3] in parallel, avoiding the need for a horizontal sum, which takes extra code. (See a previous version of this answer for code with the loops interchanged, but that used vector loads in the inner loop, with stride = 16B.)

对于使用FCC的Haswell和gcc，它可以很好地编译 . (不过，与clang的自动矢量化256b版本不同，矢量大小仍然仅为128b).内部循环只有20条指令，其中只有13条是FPU ALU指令，它们需要Intel SnB系列上的端口0/1.即使使用基线SSE2，它也可以编写出体面的代码:无FMA，并且需要shufps的广播负载，但是那些不竞争带有add/mul的执行单元.

It compiles nicely for Haswell with gcc, using FMA. (Still only 128b vector size though, unlike clang's auto-vectorized 256b version). The inner loop is only 20 instructions, and only 13 of those are FPU ALU instructions that need port 0/1 on Intel SnB-family. It makes decent code even with baseline SSE2: no FMA, and needs shufps for the broadcast-loads, but those don't compete for execution units with add/mul.

#include <immintrin.h>

void ffunc_sse128(float *restrict ax, float *restrict ay, float *restrict az,
           const float *x, const float *y, const float *z,
           int N, float eps, const float *restrict m)
{
  for (int i = 0; i < N; i+=4) {
    __m128 xi_v = _mm_load_ps(&x[i]);
    __m128 yi_v = _mm_load_ps(&y[i]);
    __m128 zi_v = _mm_load_ps(&z[i]);

    // vector accumulators for ax[i + 0..3] etc.
    __m128 axi_v = _mm_setzero_ps();
    __m128 ayi_v = _mm_setzero_ps();
    __m128 azi_v = _mm_setzero_ps();

    // AVX broadcast-loads are as cheap as normal loads,
    // and data-reuse meant that stand-alone load instructions were used anyway.
    // so we're not even losing out on folding loads into other insns
    // An inner-loop stride of only 4B is a huge win if memory / cache bandwidth is a bottleneck
    // even without AVX, the shufps instructions are cheap,
    // and don't compete with add/mul for execution units on Intel

    for (int j = 0; j < N; j++) {
      __m128 xj_v = _mm_set1_ps(x[j]);
      __m128 rx_v = _mm_sub_ps(xj_v, xi_v);

      __m128 yj_v = _mm_set1_ps(y[j]);
      __m128 ry_v = _mm_sub_ps(yj_v, yi_v);

      __m128 zj_v = _mm_set1_ps(z[j]);
      __m128 rz_v = _mm_sub_ps(zj_v, zi_v);

      __m128 mj_v = _mm_set1_ps(m[j]);   // do the load early

      // sum of squared differences
      __m128 r2_v = _mm_set1_ps(eps) + rx_v*rx_v + ry_v*ry_v + rz_v*rz_v;   // GNU extension
      /* __m128 r2_v = _mm_add_ps(_mm_set1_ps(eps), _mm_mul_ps(rx_v, rx_v));
      r2_v = _mm_add_ps(r2_v, _mm_mul_ps(ry_v, ry_v));
      r2_v = _mm_add_ps(r2_v, _mm_mul_ps(rz_v, rz_v));
      */

      // rsqrt and a Newton-Raphson iteration might have lower latency
      // but there's enough surrounding instructions and cross-iteration parallelism
      // that the single-uop sqrtps and divps instructions prob. aren't be a bottleneck
      __m128 r2sqrt = _mm_sqrt_ps(r2_v);
      __m128 r6sqrt = _mm_mul_ps(r2_v, r2sqrt);  // v^(3/2) = sqrt(v)^3 = sqrt(v)*v
      __m128 s_v = _mm_div_ps(mj_v, r6sqrt);

      __m128 srx_v = _mm_mul_ps(s_v, rx_v);
      __m128 sry_v = _mm_mul_ps(s_v, ry_v);
      __m128 srz_v = _mm_mul_ps(s_v, rz_v);

      axi_v = _mm_add_ps(axi_v, srx_v);
      ayi_v = _mm_add_ps(ayi_v, sry_v);
      azi_v = _mm_add_ps(azi_v, srz_v);
    }
    _mm_store_ps(&ax[i], axi_v);
    _mm_store_ps(&ay[i], ayi_v);
    _mm_store_ps(&az[i], azi_v);
  }
}

我还尝试了带有rcpps的版本，但是如果使用IDK会更快.请注意，对于-ffast-math，gcc和clang会将除法转换为rcpps +牛顿迭代. (由于某种原因，即使在独立功能中，它们也无法将1.0f/sqrtf(x)转换为rsqrt +牛顿). clang的效果更好，将FMA用于迭代步骤.

I also tried a version with rcpps, but IDK if it will be faster. Note that with -ffast-math, gcc and clang will convert the division into rcpps + a Newton iteration. (They for some reason don't convert 1.0f/sqrtf(x) into rsqrt + Newton, even in a stand-alone function). clang does a better job, using FMA for the iteration step.

#define USE_RSQRT
#ifndef USE_RSQRT
      // even with -mrecip=vec-sqrt after -ffast-math, this still does sqrt(v)*v, then rcpps
      __m128 r2sqrt = _mm_sqrt_ps(r2_v);
      __m128 r6sqrt = _mm_mul_ps(r2_v, r2sqrt);  // v^(3/2) = sqrt(v)^3 = sqrt(v)*v
      __m128 s_v = _mm_div_ps(mj_v, r6sqrt);
#else
      __m128 r2isqrt = rsqrt_float4_single(r2_v);
      // can't use the sqrt(v)*v trick, unless we either do normal sqrt first then rcpps
      // or rsqrtps and rcpps.  Maybe it's possible to do a Netwon Raphson iteration on that product
      // instead of refining them both separately?
      __m128 r6isqrt = r2isqrt * r2isqrt * r2isqrt;
      __m128 s_v = _mm_mul_ps(mj_v, r6isqrt);
#endif

这篇关于使用SSE Intrinsics对浮点数x，y，z数组进行矢量化处理，计算长度和差值的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用SSE Intrinsics对浮点数x，y，z数组进行矢量化处理，计算长度和差值 [英] Vectorizing a loop over float x,y,z arrays calculating length and differences using SSE Intrinsics

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用SSE Intrinsics对浮点数x，y，z数组进行矢量化处理，计算长度和差值 [英] Vectorizing a loop over float x,y,z arrays calculating length and differences using SSE Intrinsics

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭