处理零的_mm256_rsqrt_ps() [英] Handling zeroes in _mm256_rsqrt_ps()

查看:435
本文介绍了处理零的_mm256_rsqrt_ps()的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

由于 _mm256_sqrt_ps()是比较慢的,而且我生成的值将立即与截断_mm256_floor_ps(),环顾四周,似乎这样做的:

Given that _mm256_sqrt_ps() is relatively slow, and that the values I am generating are immediately truncated with _mm256_floor_ps(), looking around it seems that doing:

_mm256_mul_ps(_mm256_rsqrt_ps(eightFloats),
              eightFloats);

是去表现,额外位的方式,避免了流水线停顿。

Is the way to go for that extra bit of performance and avoiding a pipeline stall.

不幸的是,零值,当然,我得到一个崩溃计算 1 /开方(0)。什么是解决这个问题的最好方法?我曾经尝试这样做(工作和速度更快),但有没有更好的办法,还是我碰到一定条件下的问题?

Unfortunately, with zero values, I of course get a crash calculating 1/sqrt(0). What is the best way around this? I have tried this (which works and is faster), but is there a better way, or am I going to run into problems under certain conditions?

_mm256_mul_ps(_mm256_rsqrt_ps(_mm256_max_ps(eightFloats,
                                            _mm256_set1_ps(0.1))),
              eightFloats);

我的code是一个垂直应用,因此我可以假设它会在Haswell的CPU(i7-4810MQ)上运行,所以FMA / AVX2可以使用。原来的code大约是:

My code is for a vertical application, so I can assume that it will be running on a Haswell CPU (i7-4810MQ), so FMA/AVX2 can be used. The original code is approximately:

float vals[MAX];
int sum = 0;
for (int i = 0; i < MAX; i++)
{
    int thisSqrt = (int) floor(sqrt(vals[i]));
    sum += min(thisSqrt, 0x3F);
}

所有的丘壑的值应该是整数值。 (为什么一切都不仅仅是 INT 是一个不同的问题...)

All the values of vals should be integer values. (Why everything isn't just int is a different question...)

推荐答案

TL;博士:见code的编译和结束的的作品

tl;dr: See the end for code that compiles and should work.

要的只是的解决 0.0 的问题,你也可以 0.0 与FP比较的 0.0 。使用比较结果作为掩膜零出从产生的任何非数字0 * +无限的sqrt(x)= X * rsqrt(X))。 autovectorizing当锵做到这一点。 (但它使用 blendps 与零向量,而是采用了比较 andnps 面膜直接到零或preserve元素。)

To just solve the 0.0 problem, you could also special-case inputs of 0.0 with an FP compare of the source against 0.0. Use the compare result as a mask to zero out any NaNs resulting from 0 * +Infinity in sqrt(x) = x * rsqrt(x)). Clang does this when autovectorizing. (But it uses blendps with the zeroed vector, instead of using the compare mask with andnps directly to zero or preserve elements.)

这也将有可能使用的sqrt(x)的〜= RECIP(rsqrt(X)),<一个href=\"http://stackoverflow.com/questions/1528727/why-is-sse-scalar-sqrtx-slower-than-rsqrtx-x#comment15698502_1528751\">as通过njuffa 建议。 rsqrt(0)= +天道酬勤 RECIP(+ Inf文件)= 0 。然而,使用两个近似值将加重的相对误差,这是一个问题。

It would also be possible to use sqrt(x) ~= recip(rsqrt(x)), as suggested by njuffa. rsqrt(0) = +Inf. recip(+Inf) = 0. However, using two approximations would compound the relative error, which is a problem.

截断为整数(而不是四舍五入)需要准确开方结果当输入是一个完美的正方形。如果25 * rsqrt(25)的结果是4.999999或东西(而不是5.00001),您将添加 4 而不是 5

Truncating to integer (instead of rounding) requires an accurate sqrt result when the input is a perfect square. If the result for 25*rsqrt(25) is 4.999999 or something (instead of 5.00001), you'll add 4 instead of 5.

即使有牛顿 - 拉夫逊迭代, RSQRTPS isn't完全准确的方式 sqrtps ,所以它仍然可能会给 5.0 - 1ulp 。 (1ulp =在过去的地方一个单位=尾数的最低位)。

Even with a Newton-Raphson iteration, rsqrtps isn't perfectly accurate the way sqrtps is, so it might still give 5.0 - 1ulp. (1ulp = one unit in the last place = lowest bit of the mantissa).

还有:

  • Newton Raphson formula explained
  • Newton Raphson SSE implementation performance (latency/throughput). Note that we care more about throughput than latency, since we're using it in a loop that doesn't do much else. sqrt isn't part of the loop-carried dep chain, so different iterations can have their sqrt calcs in flight at once.

可能可以通过执行(X +偏移)* approx_rsqrt(X +偏移量)之前增加一个小恒然后截断杀死2一箭双雕为整数。大到足以克服的1.5 * 2 -12 最大相对误差,但足够小,不要撞击 sqrt_approx(63 * 63-1 +偏移量) 63 (最敏感的情况下)。

It might be possible to kill 2 birds with one stone by adding a small constant before doing the (x+offset)*approx_rsqrt(x+offset) and then truncating to integer. Large enough to overcome the max relative error of 1.5*2-12, but small enough not to bump sqrt_approx(63*63-1+offset) up to 63 (the most sensitive case).

63*1.5*2^(-12)         ==  0.023071...
approx_sqrt(63*63-1)   == 62.99206... +/- 0.023068..

其实,我们没有牛顿迭代拧即使不添加任何 approx_sqrt(63 * 63-1)能够站出来上面的所有63.0本身。 N = 36 是在的相对误差的sqrt(N * N-1)+误差小于最大值比的sqrt(N * N)。 GNU计算:

Actually, we're screwed without a Newton iteration even without adding anything. approx_sqrt(63*63-1) could come out above 63.0 all by itself. n=36 is the largest value where the relative error in sqrt(n*n-1) + error is less than sqrt(n*n). GNU Calc:

define f(n) { local x=sqrt(n*n-1); local e=x*1.5*2^(-12); print x; print e, x+e; }
; f(36)
35.98610843089316319413
~0.01317850650545403926 ~35.99928693739861723339
; f(37)
36.9864840178138587015
~0.01354485498699237990 ~37.00002887280085108140

请问您的源数据都意味着你的任何属性不必担心它是仅低于大完美的正方形?例如是它总是完美的正方形?

Does your source data have any properties that mean you don't have to worry about it being just below a large perfect square? e.g. is it always perfect squares?

您的可能的检查所有可能的输入值,因为重要的领域是非常小的(整数FP从0..63 * 63个值),看看是否在实践中误差足够小,英特尔的Haswell ,但是这将是一个易碎的优化,可以使上AMD的CPU您code突破,甚至在未来的英特尔CPU。不幸的是,刚刚编码到ISA规范的保证的相对误差是高达1.5 * 2 -12 需要更多的指令。我看不出有什么名堂NR迭代。

You could check all possible input values, since the important domain is very small (integer FP values from 0..63*63) to see if the error in practice is small enough on Intel Haswell, but that would be a brittle optimization that could make your code break on AMD CPUs, or even on future Intel CPUs. Unfortunately, just coding to the ISA spec's guarantee that the relative error is up to 1.5*2-12 requires more instructions. I don't see any tricks a NR iteration.

如果您的上限较小(如20),你可以只是做 isqrt =的static_cast&LT; INT&GT; ((X + 0.5)* approx_rsqrt(X + 0.5))。你会得到20 20 * 20 ,但总是19 20×20-1

If your upper limit was smaller (like 20), you could just do isqrt = static_cast<int> ((x+0.5)*approx_rsqrt(x+0.5)). You'd get 20 for 20*20, but always 19 for 20*20-1.

; define test_approx_sqrt(x, off) { local s=x*x+off; local sq=s/sqrt(s); local sq_1=(s-1)/sqrt(s-1); local e=1.5*2^(-12); print sq, sq_1; print sq*e, sq_1*e; }
; test_approx_sqrt(20, 0.5)
~20.01249609618950056874 ~19.98749609130668473087   # (x+0.5)/sqrt(x+0.5)
 ~0.00732879495710064718  ~0.00731963968187500662   # relative error

注意 VAL *(X +/- ERR)= VAL * X +/- VAL * ERR 。 IEEE FP MUL产生一个正确四舍五入到0.5ulp结果,所以这应该FP相对误差工作。

Note that val * (x +/- err) = val*x +/- val*err. IEEE FP mul produces results that are correctly rounded to 0.5ulp, so this should work for FP relative errors.

最好的办法就是 0.5 使用 rsqrt 做一个approx_sqrt前添加到您的输入。这回避了0/0 = NaN的问题,推动+/-误差范围内的所有的整数切点的一侧(中数字的,我们关心的范围内)。

The best bet is to add 0.5 to your input before doing an approx_sqrt using rsqrt. That sidesteps the 0/0 = NaN problem, and pushes the +/- error range all to one side of the whole number cut point (for numbers in the range we care about).

FP最小/最大指令具有相同的性能FP添加,并且将是在关键路径上两种方式。使用附加,而不是最高也解决了结果完全平方作为潜在的问题,<一个href=\"http://stackoverflow.com/questions/1528727/why-is-sse-scalar-sqrtx-slower-than-rsqrtx-x/1528751#1528751\">a几ULP正确的结果下面。

FP min/max instructions have the same performance as FP add, and will be on the critical path either way. Using an add instead of a max also solves the problem of results for perfect squares potentially being a few ulp below the correct result.

我从铛3.7.1 pretty好自动向量化结果与 sum_int -fno-数学错误号-funsafe,数学-optimizations -ffinite-数学只是的的要求(但即使有完整的 -ffast-数学,铛方法避免的sqrt(0)= NaN的使用时 RSQRTPS )。

I get pretty good autovectorization results from clang 3.7.1 with sum_int, with -fno-math-errno -funsafe-math-optimizations. -ffinite-math-only is not required (but even with the full -ffast-math, clang avoids sqrt(0) = NaN when using rsqrtps).

sum_fp 没有自动矢量化,即使有完整的 -ffast-数学

sum_fp doesn't auto-vectorize, even with the full -ffast-math.

的版本来自同一个问题,因为你的想法受到影响:从rsqrt + NR截去一个不精确的结果,这可能使错误的整数。 IDK这是为什么GCC不自动向量化,因为它也可以使用 sqrtps 一个大提速不改变的结果。 (至少,只要所有彩车在0和INT_MAX之间 2 ,否则转换回整数会给INT_MIN(符号位设置,清除了所有其他位)的无限期的结果。这是 -ffast-数学伤了你的程序,除非你使用 -mrecip =无什么情况。

However clang's version suffers from the same problem as your idea: truncating an inexact result from rsqrt + NR, potentially giving the wrong integer. IDK if this is why gcc doesn't auto-vectorize, because it could have used sqrtps for a big speedup without changing the results. (At least, as long as all the floats are between 0 and INT_MAX2, otherwise converting back to integer will give the "indefinite" result of INT_MIN. (sign bit set, all other bits cleared). This is a case where -ffast-math breaks your program, unless you use -mrecip=none or something.

请参阅上godbolt 从ASM输出:

// autovectorizes with clang, but has rounding problems.
// Note the use of sqrtf, and that floorf before truncating to int is redundant. (removed because clang doesn't optimize away the roundps)
int sum_int(float vals[]){
  int sum = 0;
  for (int i = 0; i < MAX; i++) {
    int thisSqrt = (int) sqrtf(vals[i]);
    sum += std::min(thisSqrt, 0x3F);
  }
  return sum;
}

要手动内在矢量化,大家可以看一下从 -fno-解开-循环输出ASM (让事情变得简单)。我要包括这答案,但后来意识到它有问题。

To manually vectorize with intrinsics, we can look at the asm output from -fno-unroll-loops (to keep things simple). I was going to include this in the answer, but then realized that it had problems.

我想转换成循环内int是比使用更好的 floorf 然后 ADDPS roundps 是一个2 UOP指令(6C潜伏期)的Haswell上(1uop在SNB / IVB)。更糟的是,这两个微指令需要端口1,所以他们高下FP添加/ MUL。 cvttps2dq 是一个1 UOP指令端口1,与3C延迟,然后我们可以使用整数分钟,并添加钳和积累,所以PORT5得到事做。使用一个整数向量累加器也意味着循环携带依赖性链为1个周期,所以我们并不需要解开或使用多个蓄电池保持多次重复飞行。小code是永远为大局更好(UOP缓存,L1 I-cache中,分支predictors)。

I think converting to int inside the loop is better than using floorf and then addps. roundps is a 2-uop instruction (6c latency) on Haswell (1uop in SnB/IvB). Worse, both uops require port1, so they compete with FP add / mul. cvttps2dq is a 1-uop instruction for port1, with 3c latency, and then we can use integer min and add to clamp and accumulate, so port5 gets something to do. Using an integer vector accumulator also means the loop-carried dependency chain is 1 cycle, so we don't need to unroll or use multiple accumulators to keep multiple iterations in flight. Smaller code is always better for the big picture (uop cache, L1 I-cache, branch predictors).

只要我们不在四溢的32位累加器的危险,这似乎是最好的选择。 (无需任何基准,甚至测试它)。

As long as we aren't in danger of overflowing 32bit accumulators, this seems to be the best choice. (Without having benchmarked anything or even tested it).

我不使用的sqrt(x)的〜= approx_recip(approx_sqrt(X))方法,因为我不知道该怎么做了牛顿迭代完善它(可能这会涉及一个部门)。而且由于复合误差较大。

I'm not using the sqrt(x) ~= approx_recip(approx_sqrt(x)) method, because I don't know how to do a Newton iteration to refine it (probably it would involve a division). And because the compounded error is larger.

水平总和从这个答案

#include <immintrin.h>
#define MAX 4096

// 2*sqrt(x) ~= 2*x*approx_rsqrt(x), with a Newton-Raphson iteration
// dividing by 2 is faster in the integer domain, so we don't do it
__m256 approx_2sqrt_ps256(__m256 x) {
    // clang / gcc usually use -3.0 and -0.5.  We could do the same by using fnmsub_ps (add 3 = subtract -3), so we can share constants
    __m256 three = _mm256_set1_ps(3.0f);
    //__m256 half = _mm256_set1_ps(0.5f);  // we omit the *0.5 step

    __m256 nr  = _mm256_rsqrt_ps( x );  // initial approximation for Newton-Raphson
    //           1/sqrt(x) ~=    nr  * (3 - x*nr * nr) * 0.5 = nr*(1.5 - x*0.5*nr*nr)
    // sqrt(x) = x/sqrt(x) ~= (x*nr) * (3 - x*nr * nr) * 0.5
    // 2*sqrt(x) ~= (x*nr) * (3 - x*nr * nr)
    __m256 xnr = _mm256_mul_ps( x, nr );

    __m256 three_minus_muls = _mm256_fnmadd_ps( xnr, nr, three );  // -(xnr*nr) + 3
    return _mm256_mul_ps( xnr, three_minus_muls );
}

// packed int32_t: correct results for inputs from 0 to well above 63*63
__m256i isqrt256_ps(__m256 x) {
    __m256 offset = _mm256_set1_ps(0.5f);    // or subtract -0.5, to maybe share constants with compiler-generated Newton iterations.
    __m256 xoff = _mm256_add_ps(x, offset);  // avoids 0*Inf = NaN, and rounding error before truncation
    __m256 approx_2sqrt_xoff = approx_2sqrt_ps256(xoff);
    __m256i i2sqrtx = _mm256_cvttps_epi32(approx_2sqrt_xoff);
    return _mm256_srli_epi32(i2sqrtx, 1);     // divide by 2 with truncation
    // alternatively, we could mask the low bit to zero and divide by two outside the loop, but that has no advantage unless port0 turns out to be the bottleneck
}

__m256i isqrt256_ps_simple_exact(__m256 x) {
    __m256 sqrt_x = _mm256_sqrt_ps(x);
    __m256i isqrtx = _mm256_cvttps_epi32(sqrt_x);
    return isqrtx;
}

int hsum_epi32_avx(__m256i x256){
    __m128i xhi = _mm256_extracti128_si256(x256, 1);
    __m128i xlo = _mm256_castsi256_si128(x256);
    __m128i x  = _mm_add_epi32(xlo, xhi);
    __m128i hl = _mm_shuffle_epi32(x, _MM_SHUFFLE(1, 0, 3, 2));
    hl  = _mm_add_epi32(hl, x);
    x   = _mm_shuffle_epi32(hl, _MM_SHUFFLE(2, 3, 0, 1));
    hl  = _mm_add_epi32(hl, x);
    return _mm_cvtsi128_si32(hl);
}

int sum_int_avx(float vals[]){
  __m256i sum = _mm256_setzero_si256();
  __m256i upperlimit = _mm256_set1_epi32(0x3F);

  for (int i = 0; i < MAX; i+=8) {
    __m256 v = _mm256_loadu_ps(vals+i);
    __m256i visqrt = isqrt256_ps(v);
    // assert visqrt == isqrt256_ps_simple_exact(v) or something
    visqrt = _mm256_min_epi32(visqrt, upperlimit);
    sum = _mm256_add_epi32(sum, visqrt);
  }
  return hsum_epi32_avx(sum);
}

编译上godbolt尼斯code ,但我没有测试它。铛使得稍微更好code的GCC:铛采用从设置1常数4B地点广播负载,而不是在编译时重复它们变成32B常数。海湾合作委员会也有一个奇怪的MOVDQA复制寄存器。

Compiles on godbolt to nice code, but I haven't tested it. clang makes slightly nicer code that gcc: clang uses broadcast-loads from 4B locations for the set1 constants, instead of repeating them at compile time into 32B constants. gcc also has a bizarre movdqa to copy a register.

总之,整个循环卷起仅为9向量指令,相比12的编译器生成的 sum_int 版本。它可能没有注意到 X * initial_guess(X)中出现,当你相乘的结果由牛顿迭代迭代公式在共SUBEX pressions X ,或者类似的东西。它也做一个额外的次MULPS而不是psrld因为它转换为int之前做的* 0.5。所以这就是在预留额外两次MULPS指令从哪里来,还有的CMPPS / blendvps。

Anyway, the whole loop winds up being only 9 vector instructions, compared to 12 for the compiler-generated sum_int version. It probably didn't notice the x*initial_guess(x) common-subexpressions that occur in the Newton-Raphson iteration formula when you're multiplying the result by x, or something like that. It also does an extra mulps instead of a psrld because it does the *0.5 before converting to int. So that's where the extra two mulps instructions come from, and there's the cmpps/blendvps.

sum_int_avx(float*):
    vpxor   ymm3, ymm3, ymm3
    xor     eax, eax
    vbroadcastss    ymm0, dword ptr [rip + .LCPI4_0]  ; set1(0.5)
    vbroadcastss    ymm1, dword ptr [rip + .LCPI4_1]  ; set1(3.0)
    vpbroadcastd    ymm2, dword ptr [rip + .LCPI4_2]  ; set1(63)
LBB4_1:                                               ; latencies
    vaddps      ymm4, ymm0, ymmword ptr [rdi + 4*rax] ; 3c
    vrsqrtps    ymm5, ymm4                            ; 7c
    vmulps      ymm4, ymm4, ymm5   ; x*nr             ; 5c
    vfnmadd213ps ymm5, ymm4, ymm1                     ; 5c
    vmulps      ymm4, ymm4, ymm5                      ; 5c
    vcvttps2dq  ymm4, ymm4                            ; 3c
    vpsrld      ymm4, ymm4, 1      ; 1c this would be a mulps (but not on the critical path) if we did this in the FP domain
    vpminsd     ymm4, ymm4, ymm2                      ; 1c
    vpaddd      ymm3, ymm4, ymm3                      ; 1c
    ; ... (those 9 insns repeated: loop unrolling)
    add     rax, 16
    cmp     rax, 4096
    jl      .LBB4_1
    ;... horizontal sum

IACA认为没有解开,Haswell的可维持吞吐量一次迭代的每4.15个周期,瓶颈上的端口0和1。所以有可能,你可以通过积累的sqrt刮胡子一个周期(X)* 2(带截断到偶数使用 _mm256_and_si256 ),只有除以2外循环。

IACA thinks that with no unroll, Haswell can sustain a throughput of one iteration per 4.15 cycles, bottlenecking on ports 0 and 1. So potentially you could shave a cycle by accumulating sqrt(x)*2 (with truncation to even numbers using _mm256_and_si256), and only divide by two outside the loop.

另据IACA,单次迭代的潜伏期是38的Haswell周期。我只得到31C,所以可能它包括L1负载使用的延迟或东西。无论如何,这意味着以饱和执行单元,从8次迭代操作必须是在飞行中一次。这是必须要在飞行一次8 *〜14未融合域微指令= 112未融合,微指令(以下铿锵的解开)。的Haswell的调度实际上只有60项,而ROB是192项的。从早期的迭代早期的微指令将已经执行,所以他们只需要在ROB被跟踪,也不会在调度。许多慢微指令的是在每次迭代开始时,虽然。不过,我们有理由希望,除非数据在L1缓存热,缓存/内存带宽可能会成为瓶颈,这将接近十岁上下前来饱和端口0和1。

Also according to IACA, the latency of a single iteration is 38 cycles on Haswell. I only get 31c, so probably it's including L1 load-use latency or something. Anyway, this means that to saturate the execution units, operations from 8 iterations have to be in flight at once. That's 8 * ~14 unfused-domain uops = 112 unfused-uops (or less with clang's unroll) that have to be in flight at once. Haswell's scheduler is actually only 60 entries, but the ROB is 192 entries. The early uops from early iterations will already have executed, so they only need to be tracked in the ROB, not also in the scheduler. Many of the slow uops are at the beginning of each iteration, though. Still, there's reason to hope that this will come close-ish to saturating ports 0 and 1. Unless data is hot in L1 cache, cache/memory bandwidth will probably be the bottleneck.

交错操作也将是更好的。当铿锵将展开,它把所有9指令提前另一迭代所有9指令之一迭代。它采用了令人惊讶的小数目的寄存器,因此这将是可能的有2个或4次迭代混合在一起的指令。这是哪门子事编译器应该是擅长的,但它是人类的累赘。 :/

Interleaving operations from multiple dep chains would also be better. When clang unrolls, it puts all 9 instructions for one iteration ahead of all 9 instructions for another iteration. It uses a surprisingly small number of registers, so it would be possible to have instructions for 2 or 4 iterations mixed together. This is the sort of thing compilers are supposed to be good at, but which is cumbersome for humans. :/

这也将稍微更有效率,如果编译器选择了一个寄存器寻址模式,所以负载无法微型保险丝与 vaddps 。 GCC做到这一点。

It would also be slightly more efficient if the compiler chose a one-register addressing mode, so the load could micro-fuse with the vaddps. gcc does this.

这篇关于处理零的_mm256_rsqrt_ps()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆