利用上证所内部函数优化 [英] Optimisation using SSE Intrinsics

查看:123
本文介绍了利用上证所内部函数优化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想一个循环我已经转换为上证所内部函数。我似乎已经取得了相当不错的进展,我的意思是这是正确的方向,但是我似乎已经做了一些翻译错误的地方,因为我没有得到相同的正确的答案从非SSE $结果C $℃。

I am trying to convert a loop I have into a SSE intrinsics. I seem to have made fairly good progress, and by that I mean It's in the correct direction however I appear to have done some of the translation wrong somewhere as I am not getting the same "correct" answer which results from the non-sse code.

我最初的循环是这样的:

My original loop which I unrolled by a factor of 4 looks like this:

int unroll_n = (N/4)*4;

for (int j = 0; j < unroll_n; j++) {
        for (int i = 0; i < unroll_n; i+=4) {
            float rx = x[j] - x[i];
            float ry = y[j] - y[i];
            float rz = z[j] - z[i];
            float r2 = rx*rx + ry*ry + rz*rz + eps;
            float r2inv = 1.0f / sqrt(r2);
            float r6inv = r2inv * r2inv * r2inv;
            float s = m[j] * r6inv;
            ax[i] += s * rx;
            ay[i] += s * ry;
            az[i] += s * rz;
            //u
            rx = x[j] - x[i+1];
             ry = y[j] - y[i+1];
             rz = z[j] - z[i+1];
             r2 = rx*rx + ry*ry + rz*rz + eps;
             r2inv = 1.0f / sqrt(r2);
             r6inv = r2inv * r2inv * r2inv;
             s = m[j] * r6inv;
            ax[i+1] += s * rx;
            ay[i+1] += s * ry;
            az[i+1] += s * rz;
            //unroll i 3
             rx = x[j] - x[i+2];
             ry = y[j] - y[i+2];
             rz = z[j] - z[i+2];
             r2 = rx*rx + ry*ry + rz*rz + eps;
             r2inv = 1.0f / sqrt(r2);
             r6inv = r2inv * r2inv * r2inv;
             s = m[j] * r6inv;
            ax[i+2] += s * rx;
            ay[i+2] += s * ry;
            az[i+2] += s * rz;
            //unroll i 4
             rx = x[j] - x[i+3];
             ry = y[j] - y[i+3];
             rz = z[j] - z[i+3];
             r2 = rx*rx + ry*ry + rz*rz + eps;
             r2inv = 1.0f / sqrt(r2);
             r6inv = r2inv * r2inv * r2inv;
             s = m[j] * r6inv;
            ax[i+3] += s * rx;
            ay[i+3] += s * ry;
            az[i+3] += s * rz;
    }
}

我基本上是后来去一行一行的顶部部分,将它改成上证所内部函数。在code如下。我不能完全确定是否需要前三名线不过据我所知,我的数据必须是16位对齐这一正确和最佳状态工作。

I essentially then went line by line for the top section and converted it into SSE intrinsics. The code is below. I'm not totally sure if the top three lines are needed however I understand that my data needs to be 16bit aligned for this to work correctly and optimally.

float *x = malloc(sizeof(float) * N);
float *y = malloc(sizeof(float) * N);
float *z = malloc(sizeof(float) * N); 

for (int j = 0; j < N; j++) {
    for (int i = 0; i < N; i+=4) {
        __m128 xj_v = _mm_set1_ps(x[j]);
        __m128 xi_v = _mm_load_ps(&x[i]);
        __m128 rx_v = _mm_sub_ps(xj_v, xi_v);

        __m128 yj_v = _mm_set1_ps(y[j]);
        __m128 yi_v = _mm_load_ps(&y[i]);
        __m128 ry_v = _mm_sub_ps(yj_v, yi_v);

        __m128 zj_v = _mm_set1_ps(z[j]);
        __m128 zi_v = _mm_load_ps(&z[i]);
        __m128 rz_v = _mm_sub_ps(zj_v, zi_v);

    __m128 r2_v = _mm_mul_ps(rx_v, rx_v) + _mm_mul_ps(ry_v, ry_v) + _mm_mul_ps(rz_v, rz_v) + _mm_set1_ps(eps);

    __m128 r2inv_v = _mm_div_ps(_mm_set1_ps(1.0f),_mm_sqrt_ps(r2_v));

    __m128 r6inv_1v = _mm_mul_ps(r2inv_v, r2inv_v);
    __m128 r6inv_v = _mm_mul_ps(r6inv_1v, r2inv_v);

    __m128 mj_v = _mm_set1_ps(m[j]);
    __m128 s_v = _mm_mul_ps(mj_v, r6inv_v);

    __m128 axi_v = _mm_load_ps(&ax[i]);
    __m128 ayi_v = _mm_load_ps(&ay[i]);
    __m128 azi_v = _mm_load_ps(&az[i]);

    __m128 srx_v = _mm_mul_ps(s_v, rx_v);
    __m128 sry_v = _mm_mul_ps(s_v, ry_v);
    __m128 srz_v = _mm_mul_ps(s_v, rz_v);

    axi_v = _mm_add_ps(axi_v, srx_v);
    ayi_v = _mm_add_ps(ayi_v, srx_v);
    azi_v = _mm_add_ps(azi_v, srx_v);

    _mm_store_ps(ax, axi_v);
    _mm_store_ps(ay, ayi_v);
    _mm_store_ps(az, azi_v);
    }
}

我觉得主要的思路是正确的但有一个/一些错误(S)为某处所产生的答案不正确。

I feel the main idea is correct however there is an/some error(s) somewhere as the resulting answer is incorrect.

先谢谢了。

推荐答案

我觉得你唯一的错误是简单的拼写错误,而不是逻辑错误,请参阅下文。

I think your only bugs are simple typos, not logic errors, see below.

你就不能使用的自动向量化?或者你需要使用gcc这个code?自动向量化会让你做SSE,AVX,以及(在未来)从没有修改相同的源AVX512版本。内部函数不能扩展到不同的向量长度,很遗憾。

Can't you just use clang's auto-vectorization? Or do you need to use gcc for this code? Auto-vectorization would let you make SSE, AVX, and (in future) AVX512 versions from the same source with no modifications. Intrinsics aren't scalable to different vector sizes, unfortunately.

根据您在启动量化,我做了一个优化版本。你应该尝试一下,我很好奇听到的话,它比你的版本错误修正,或铿锵的自动量化版本速度更快。 :)

Based on your start at vectorizing, I made an optimized version. You should try it out, I'm curious to hear if it's faster than your version with bugfixes, or clang's auto-vectorized version. :)

_mm_store_ps(ax, axi_v);
_mm_store_ps(ay, ayi_v);
_mm_store_ps(az, azi_v);

您从 AX [I] 加载,但现在你要存储 AX [0]

You loaded from ax[i], but now you're storing to ax[0].

此外,铛的未使用的变量警告发现这个bug:

Also, clang's unused-variable warning found this bug:

axi_v = _mm_add_ps(axi_v, srx_v);
ayi_v = _mm_add_ps(ayi_v, srx_v);  // should be sry_v
azi_v = _mm_add_ps(azi_v, srx_v);  // should be srz_v


就像我在你的previous问题答案说,你应该交换的循环,因此同样AX [I + 0..3],AY [I + 0..3],和AZ [ 1 + 0..3]的使用,避免负载/存储。


Like I said in my answer on your previous question, you should probably interchange the loops, so the same ax[i+0..3], ay[i+0..3], and az[i+0..3] are used, avoiding that load/store.

另外,如果你不打算使用<一个href=\"http://stackoverflow.com/questions/31555260/fast-vectorized-rsqrt-and-reciprocal-with-sse-avx-depending-on-$p$pcision\"><$c$c>rsqrtps +牛顿迭代,你应该使用我指出的改造出我最后的答案是:分 M [J] 的sqrt(K2 )^ 3 。有没有使用分点 1.0F 由东西 divps ,再后来乘以只有一次。

Also, if you're not going to use rsqrtps + Newton-Raphson, you should use the transformation I pointed out in my last answer: divide m[j] by sqrt(k2) ^3. There's no point dividing 1.0f by something using a divps, and then later multiplying only once.

rsqrt 可能的没有实际是一个双赢的,因为总吞吐量UOP公司可能会更瓶颈比DIV /开方吞吐量或延迟。 <一href=\"http://stackoverflow.com/questions/31555260/fast-vectorized-rsqrt-and-reciprocal-with-sse-avx-depending-on-$p$pcision/31559382#31559382\">three乘法+ FMA + RSQRTPS 比sqrtps + divps显著更多的微指令。 rsqrt 与AVX向量256B更有帮助,因为除法/开方单位没有SNB全宽为Broadwell微架构。 SKYLAKE微架构拥有12C延时 sqrtps青运,相同XMM,但吞吐量是XMM(每3C之一而不是每个6C一个)。犹未

rsqrt might not actually be a win, because total uop throughput might be more of a bottleneck than div / sqrt throughput or latency. three multiplies + FMA + rsqrtps is significantly more uops than sqrtps + divps. rsqrt is more helpful with AVX 256b vectors, because the divide / sqrt unit isn't full-width on SnB to Broadwell. Skylake has 12c latency sqrtps ymm, same as for xmm, but throughput is still better for xmm (one per 3c instead of one per 6c).

<一个href=\"http://stackoverflow.com/questions/32002277/why-does-gcc-or-clang-not-optimise-reciprocal-to-1-instruction-when-using-fast-m\">clang和gcc都是使用带有 -ffast编译code时, RSQRTPS / rsqrtss -math 。 (仅在使用过程中的包装版本,铛。)

clang and gcc were both using rsqrtps / rsqrtss when compiling your code with -ffast-math. (only clang using the packed version, of course.)

如果你不交换的循环,你应该手动葫芦的一切,只依赖于Ĵ出内部循环。编译器是在这个总体上是好的,但它仍然似乎是一个好主意,使源反映您期望编译器能够做什么。这有助于看什么环路实际上做。

If you don't interchange the loops, you should manually hoist everything that only depends on j out of the inner loop. Compilers are generally good at this, but it still seems like a good idea to make the source reflect what you expect the compiler to be able to do. This helps with "seeing" what the loop is actually doing.

下面是一些优化了原来的一个版本:

Here's a version with some optimizations over your original:


  • 要获得的gcc /铛融合/加进FMA的MUL,我用 -ffp合同=快速。这得到了高吞吐量FMA指令,而无需使用 -ffast-数学。 (有很多的三个独立的累加器并行,所以FMA的延迟增加相比, ADDPS 不应该疼。我希望端口0/1吞吐量是这里的限制因素。)I thought GCC这样做自动,但似乎并没有在这里没有 -ffast-数学

  • To get gcc/clang to fuse the mul/adds into FMA, I used -ffp-contract=fast. This gets FMA instructions for high throughput without using -ffast-math. (There is a lot of parallelism with the three separate accumulators, so the increased latency of FMA compared to addps shouldn't hurt at all. I expect port0/1 throughput is the limiting factor here.) I thought gcc did this automatically, but it seems it doesn't here without -ffast-math.

注意使V 3/2 =开方(V) 3 =开方(V)* V。这有更低的延迟和较少的指令。

Notice that v3/2 = sqrt(v)3 = sqrt(v)*v. This has lower latency and fewer instructions.

已互换的循环,并使用广播负载在内环(4或8与AVX切带宽要求)以提高局部性。内循环的每次迭代仅读取来自四个光源阵列的新的数据的4B。 (X,Y,Z,和m)。因此,它使大量使用每个高速缓存行的同时,它的热。

Interchanged the loops, and use broadcast-loads in the inner loop to improve locality (cut bandwidth requirement by 4, or 8 with AVX). Each iteration of the inner loop only reads 4B of new data from each of the four source arrays. (x,y,z, and m). So it makes a lot of use of each cache line while it's hot.

使用广播负载在内部循环也意味着我们积累 AX [1 + 0..3] 并行,避免了对的horizo​​ntal总和,这需要额外的code 。 (对于$看到这个答案的previous C版$ C 与循环互换,但使用的载体负载在内部循环,步幅= 16B 。)

Using broadcast-loads in the inner loop also means we accumulate ax[i + 0..3] in parallel, avoiding the need for a horizontal sum, which takes extra code. (See a previous version of this answer for code with the loops interchanged, but that used vector loads in the inner loop, with stride = 16B.)

它编译很好为Haswell的用gcc,使用FMA 。 (年仅128B矢量大小虽然不同,铛的自动向量化的256B版本)。内环只有20指令,并且只有那些的13是需要在Intel SNB-家族端口0/1的FPU ALU指令。它使体面code甚至与基线SSE2:没有FMA,需要SHUFPS的广播加载,但那些不执行单位添加/ MUL竞争

It compiles nicely for Haswell with gcc, using FMA. (Still only 128b vector size though, unlike clang's auto-vectorized 256b version). The inner loop is only 20 instructions, and only 13 of those are FPU ALU instructions that need port 0/1 on Intel SnB-family. It makes decent code even with baseline SSE2: no FMA, and needs shufps for the broadcast-loads, but those don't compete for execution units with add/mul.

#include <immintrin.h>

void ffunc_sse128(float *restrict ax, float *restrict ay, float *restrict az,
           const float *x, const float *y, const float *z,
           int N, float eps, const float *restrict m)
{
  for (int i = 0; i < N; i+=4) {
    __m128 xi_v = _mm_load_ps(&x[i]);
    __m128 yi_v = _mm_load_ps(&y[i]);
    __m128 zi_v = _mm_load_ps(&z[i]);

    // vector accumulators for ax[i + 0..3] etc.
    __m128 axi_v = _mm_setzero_ps();
    __m128 ayi_v = _mm_setzero_ps();
    __m128 azi_v = _mm_setzero_ps();

    // AVX broadcast-loads are as cheap as normal loads,
    // and data-reuse meant that stand-alone load instructions were used anyway.
    // so we're not even losing out on folding loads into other insns
    // An inner-loop stride of only 4B is a huge win if memory / cache bandwidth is a bottleneck
    // even without AVX, the shufps instructions are cheap,
    // and don't compete with add/mul for execution units on Intel

    for (int j = 0; j < N; j++) {
      __m128 xj_v = _mm_set1_ps(x[j]);
      __m128 rx_v = _mm_sub_ps(xj_v, xi_v);

      __m128 yj_v = _mm_set1_ps(y[j]);
      __m128 ry_v = _mm_sub_ps(yj_v, yi_v);

      __m128 zj_v = _mm_set1_ps(z[j]);
      __m128 rz_v = _mm_sub_ps(zj_v, zi_v);

      __m128 mj_v = _mm_set1_ps(m[j]);   // do the load early

      // sum of squared differences
      __m128 r2_v = _mm_set1_ps(eps) + rx_v*rx_v + ry_v*ry_v + rz_v*rz_v;   // GNU extension
      /* __m128 r2_v = _mm_add_ps(_mm_set1_ps(eps), _mm_mul_ps(rx_v, rx_v));
      r2_v = _mm_add_ps(r2_v, _mm_mul_ps(ry_v, ry_v));
      r2_v = _mm_add_ps(r2_v, _mm_mul_ps(rz_v, rz_v));
      */

      // rsqrt and a Newton-Raphson iteration might have lower latency
      // but there's enough surrounding instructions and cross-iteration parallelism
      // that the single-uop sqrtps and divps instructions prob. aren't be a bottleneck
      __m128 r2sqrt = _mm_sqrt_ps(r2_v);
      __m128 r6sqrt = _mm_mul_ps(r2_v, r2sqrt);  // v^(3/2) = sqrt(v)^3 = sqrt(v)*v
      __m128 s_v = _mm_div_ps(mj_v, r6sqrt);

      __m128 srx_v = _mm_mul_ps(s_v, rx_v);
      __m128 sry_v = _mm_mul_ps(s_v, ry_v);
      __m128 srz_v = _mm_mul_ps(s_v, rz_v);

      axi_v = _mm_add_ps(axi_v, srx_v);
      ayi_v = _mm_add_ps(ayi_v, sry_v);
      azi_v = _mm_add_ps(azi_v, srz_v);
    }
    _mm_store_ps(&ax[i], axi_v);
    _mm_store_ps(&ay[i], ayi_v);
    _mm_store_ps(&az[i], azi_v);
  }
}

我也试过一个版本rcpps 的,但IDK如果它会更快。请注意, -ffast-数学,gcc和铛将划分转换成 rcpps +一个牛顿迭代。 (他们因为某些原因没有转换 1.0F / sqrtf(X) rsqrt +牛顿,即使在单机功能)。铛做一个更好的工作,使用FMA为迭代步骤。

I also tried a version with rcpps, but IDK if it will be faster. Note that with -ffast-math, gcc and clang will convert the division into rcpps + a Newton iteration. (They for some reason don't convert 1.0f/sqrtf(x) into rsqrt + Newton, even in a stand-alone function). clang does a better job, using FMA for the iteration step.

#define USE_RSQRT
#ifndef USE_RSQRT
      // even with -mrecip=vec-sqrt after -ffast-math, this still does sqrt(v)*v, then rcpps
      __m128 r2sqrt = _mm_sqrt_ps(r2_v);
      __m128 r6sqrt = _mm_mul_ps(r2_v, r2sqrt);  // v^(3/2) = sqrt(v)^3 = sqrt(v)*v
      __m128 s_v = _mm_div_ps(mj_v, r6sqrt);
#else
      __m128 r2isqrt = rsqrt_float4_single(r2_v);
      // can't use the sqrt(v)*v trick, unless we either do normal sqrt first then rcpps
      // or rsqrtps and rcpps.  Maybe it's possible to do a Netwon Raphson iteration on that product
      // instead of refining them both separately?
      __m128 r6isqrt = r2isqrt * r2isqrt * r2isqrt;
      __m128 s_v = _mm_mul_ps(mj_v, r6isqrt);
#endif

这篇关于利用上证所内部函数优化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆