快速矢量化的rsqrt和SSE/AVX的倒数取决于精度 [英] Fast vectorized rsqrt and reciprocal with SSE/AVX depending on precision

查看:135
本文介绍了快速矢量化的rsqrt和SSE/AVX的倒数取决于精度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假定有必要为压缩浮点数据计算倒数或倒数平方根.两者都可以轻松地通过以下方式完成:

__m128 recip_float4_ieee(__m128 x) { return _mm_div_ps(_mm_set1_ps(1.0f), x); }
__m128 rsqrt_float4_ieee(__m128 x) { return _mm_div_ps(_mm_set1_ps(1.0f), _mm_sqrt_ps(x)); }

这很好但是很慢:根据该指南,在Sandy Bridge上的14和28个周期(吞吐量).相应的AVX版本在Haswell上花费的时间几乎相同.

另一方面,可以使用以下版本:

__m128 recip_float4_half(__m128 x) { return _mm_rcp_ps(x); }
__m128 rsqrt_float4_half(__m128 x) { return _mm_rsqrt_ps(x); }

它们仅花费一个或两个周期的时间(吞吐量),从而大大提高了性能.但是,它们非常近似:它们产生的结果的相对误差小于1.5 * 2 ^ -12.假定单精度浮点数的机器ε为2 ^ −24,则可以说这种近似具有 half 精度.

似乎可以添加Newton-Raphson迭代以产生精度的结果(可能不完全符合IEEE标准要求),请参见 LLVM 上进行讨论.从理论上讲,相同的方法可以用于双精度值,产生 half single double 精度.

我对浮点和双精度数据类型以及所有(半精度,单精度,双精度)精度的这种方法的有效实现感兴趣.不需要处理特殊情况(除以零,sqrt(-1),inf/nan等).另外,对于我来说,尚不清楚这些例程中的哪一个比普通的IEEE编译器解决方案要快,哪些要慢.

以下是对答案的一些小限制,请:

  1. 在代码示例中使用内在函数.汇编是依赖于编译器的,因此用处不大.
  2. 对函数使用类似的命名约定.
  3. 使用包含密集打包的float/double值的单个SSE/AVX寄存器作为输入的实现例程.如果性能有了显着提高,您还可以发布以几个寄存器为输入的例程(两个寄存器可能是可行的).
  4. 如果两个SSE/AVX版本在将 _mm 更改为 _mm256 时反之亦然,则不要发布它们.

欢迎任何性能评估,测量,讨论.

摘要

以下是具有一次NR迭代的单精度浮点数的版本:

__m128 recip_float4_single(__m128 x) {
  __m128 res = _mm_rcp_ps(x);
  __m128 muls = _mm_mul_ps(x, _mm_mul_ps(res, res));
  return res =  _mm_sub_ps(_mm_add_ps(res, res), muls);
}
__m128 rsqrt_float4_single(__m128 x) {
  __m128 three = _mm_set1_ps(3.0f), half = _mm_set1_ps(0.5f);
  __m128 res = _mm_rsqrt_ps(x); 
  __m128 muls = _mm_mul_ps(_mm_mul_ps(x, res), res); 
  return res = _mm_mul_ps(_mm_mul_ps(half, res), _mm_sub_ps(three, muls)); 
}

彼得·科德斯(Peter Cordes)给出的答案解释了如何创建其他版本,并进行了详尽的理论性能分析.

您可以在此处找到所有使用基准测试的已实施解决方案: recip_rsqrt_benchmark

在Ivy Bridge上获得的吞吐量结果如下所示.仅对单寄存器SSE实施进行了基准测试.每次通话所花费的时间以周期为单位.第一个数字表示半精度(无NR),第二个数字表示单精度(1个NR迭代),第三个数字表示2个NR迭代.

    float 上的
  1. recipe 需要 1、4 个周期,而 7 个周期.
  2. float 上的
  3. rsqrt 需要 1、6 个周期,而 14 个周期.
  4. double 上的
  5. recipe 需要 3、6、9 个周期,而 14 个周期.
  6. double 上的
  7. rsqrt 需要 3、8、13 个周期,而 28 个周期.

警告:我不得不创造性地舍入原始结果...

解决方案

实践中有很多算法示例.例如:

Newton Raphson与SSE2-有人可以吗向我解释这三行内容有一个答案,解释了英特尔的例子之一.

让我们说说Haswell(与以前的设计不同,它可以在两个执行端口上进行FP mul)的性能分析,在这里我将重现代码(每行一个操作).请参阅 http://agner.org/optimize/以获得指令吞吐量和延迟表以及有关如何操作的文档.了解更多背景.

// execute (aka dispatch) on cycle 1, results ready on cycle 6
nr = _mm_rsqrt_ps( x );

// both of these execute on cycle 6, results ready on cycle 11
xnr = _mm_mul_ps( x, nr );         // dep on nr
half_nr = _mm_mul_ps( half, nr );  // dep on nr

// can execute on cycle 11, result ready on cycle 16
muls = _mm_mul_ps( xnr , nr );     // dep on xnr

// can execute on cycle 16, result ready on cycle 19
three_minus_muls = _mm_sub_ps( three, muls );  // dep on muls

// can execute on cycle 19, result ready on cycle 24
result = _mm_mul_ps( half_nr, three_minus_muls ); // dep on three_minus_muls
// result is an approximation of 1/sqrt(x), with ~22 to 23 bits of precision in the mantissa.

如果它不是依赖关系链的一部分,那么这里还有很多其他计算可以重叠的空间.但是,如果代码每次迭代的数据都依赖于上一个的数据,则11周期延迟sqrtps可能会更好.甚至即使每个循环迭代的时间都足够长,以致乱序执行也无法通过重叠独立的迭代将其全部隐藏.

要获取sqrt(x)而不是1/sqrt(x) ,请乘以x(如果x可以为零,则进行修正,例如,使用a的结果屏蔽(_mm_andn_ps)) CMPPS设为0.0).最佳方法是将half_nr替换为half_xnr = _mm_mul_ps( half, xnr );.这不会延长dep链的长度,因为half_xnr可以从周期11开始,但是直到结束(周期19)才需要.与FMA相同:没有增加总延迟.

如果11位精度就足够了(无需进行牛顿迭代),请 log()的非常快速的近似值 ,因此循环体总数约为9 FMA/add/mul/convert操作,2布尔值,加上VSQRTPS ymm.仅使用fastlog(v2_plus_1 * rsqrt(v2_plus_1) + v2_plus_1)可以提高速度,因此它不会成为内存带宽的瓶颈,但可能是延迟瓶颈(因此乱序执行无法利用独立迭代的所有并行性).

其他精度

对于半精度,没有指令可以对半浮点数进行计算.您应该按照转换说明在加载/存储时即时进行转换.

对于双精度,没有rsqrtpd.据推测,这种想法是,如果您不需要全精度,则应该首先使用float.因此,您可以转换为float并返回,然后执行完全相同的算法,但是使用pd而不是ps内部函数.或者,您可以将数据保持一段时间浮动.例如将两个ymm双打寄存器转换为一个ymm单打寄存器.

不幸的是,没有一条指令需要两个ymm的双精度寄存器并输出ymm的单精度寄存器.您必须两次ymm-> xmm,然后_mm256_insertf128_ps一个xmm到另一个的高128.但是您可以将256b的ymm向量提供给相同的算法.

如果您要在之后立即转换回double值,则可能需要对两个单身的128b寄存器分别执行rsqrt + Newton-Raphson迭代.开始插入/提取的额外2块,以及用于256b rsqrt的额外2块,开始累加,更不用说vinsertf128/vextractf128的3个周期的延迟了.计算将重叠,并且两个结果相隔两个周期即可准备好. (或者,如果您有一个特殊版本的代码可以同时对2个输入进行交织操作,则间隔为1个周期.)

请记住,单精度的指数范围比双精度的范围小,因此转换可能会溢出到+ Inf或下溢到0.0.如果不确定,绝对可以使用普通的_mm_sqrt_pd.


使用AVX512F,有_mm512_rsqrt14_pd( __m512d a).使用 AVX512ER(KNL,但不包括SKX或Cannonlake),出现_mm512_rsqrt28_pd .当然也有_ps版本.在更多情况下,无需进行牛顿迭代就可以使用14位的尾数精度.

牛顿迭代仍然无法像常规sqrt那样为您提供正确舍入的单精度浮点数.

Suppose that it is necessary to compute reciprocal or reciprocal square root for packed floating point data. Both can easily be done by:

__m128 recip_float4_ieee(__m128 x) { return _mm_div_ps(_mm_set1_ps(1.0f), x); }
__m128 rsqrt_float4_ieee(__m128 x) { return _mm_div_ps(_mm_set1_ps(1.0f), _mm_sqrt_ps(x)); }

This works perfectly well but slow: according to the guide, they take 14 and 28 cycles on Sandy Bridge (throughput). Corresponding AVX versions take almost the same time on Haswell.

On the other hand, the following versions can be used instead:

__m128 recip_float4_half(__m128 x) { return _mm_rcp_ps(x); }
__m128 rsqrt_float4_half(__m128 x) { return _mm_rsqrt_ps(x); }

They take only one or two cycles of time (throughput), giving major performance boost. However, they are VERY approximate: they produce result with relative error less than 1.5 * 2^-12. Given that machine epsilon of single precision float numbers is 2^−24, we can say that this approximation has approximately half precision.

It seems that Newton-Raphson iteration can be added to produce a result with single precision (perhaps not as exact as IEEE standard requires, through), see GCC, ICC, discussions at LLVM. Theoretically, the same method can be used for double precision values, producing half or single or double precision.

I'm interested in working implementations of this approach for both float and double data types and for all (half, single, double) precisions. Handling special cases (division by zero, sqrt(-1), inf/nan and the like) is not necessary. Also, it is not clear for me which of these routines would be faster than trivial IEEE-compilant solutions, and which would be slower.

Here is a few minor constraints on answers, please:

  1. Use intrinsics in your code samples. Assembly is compiler-dependent, so less useful.
  2. Use similar naming convention for functions.
  3. Implement routines taking single SSE/AVX register containing densely packed float/double values as input. If there is considerable performance boost, you can also post routines taking several registers as input (two regs may be viable).
  4. Do not post both SSE/AVX versions if they are absolutely equal up to changing _mm to _mm256 and vice versa.

Any performance estimates, measurements, discussions are welcome.

SUMMARY

Here are the versions for single-precision float numbers with one NR iteration:

__m128 recip_float4_single(__m128 x) {
  __m128 res = _mm_rcp_ps(x);
  __m128 muls = _mm_mul_ps(x, _mm_mul_ps(res, res));
  return res =  _mm_sub_ps(_mm_add_ps(res, res), muls);
}
__m128 rsqrt_float4_single(__m128 x) {
  __m128 three = _mm_set1_ps(3.0f), half = _mm_set1_ps(0.5f);
  __m128 res = _mm_rsqrt_ps(x); 
  __m128 muls = _mm_mul_ps(_mm_mul_ps(x, res), res); 
  return res = _mm_mul_ps(_mm_mul_ps(half, res), _mm_sub_ps(three, muls)); 
}

Answer given by Peter Cordes explains how to create other versions, and contains a thorough theoretical performance analysis.

You can find all the implemented solutions with benchmark here: recip_rsqrt_benchmark.

The obtained throughput results on Ivy Bridge are presented below. Only single-register SSE implementations have been benchmarked. Time spent is given in cycles per call. First number is for half precision (no NR), second is for single precision (1 NR iteration), third is for 2 NR iterations.

  1. recip on float takes 1, 4 cycles versus 7 cycles.
  2. rsqrt on float takes 1, 6 cycles versus 14 cycles.
  3. recip on double takes 3, 6, 9 cycles versus 14 cycles.
  4. rsqrt on double takes 3, 8, 13 cycles versus 28 cycles.

Warning: I had to round raw results creatively...

解决方案

There are lots of examples of the algorithm in practice. For example:

Newton Raphson with SSE2 - can someone explain me these 3 lines has an answer explaining the iteration used by one of Intel's examples.

For perf analysis on let's say Haswell (which can FP mul on two execution ports, unlike previous designs), I'll reproduce the code here (with one op per line). See http://agner.org/optimize/ for tables of instruction throughput and latency, and for docs on how to understand more background.

// execute (aka dispatch) on cycle 1, results ready on cycle 6
nr = _mm_rsqrt_ps( x );

// both of these execute on cycle 6, results ready on cycle 11
xnr = _mm_mul_ps( x, nr );         // dep on nr
half_nr = _mm_mul_ps( half, nr );  // dep on nr

// can execute on cycle 11, result ready on cycle 16
muls = _mm_mul_ps( xnr , nr );     // dep on xnr

// can execute on cycle 16, result ready on cycle 19
three_minus_muls = _mm_sub_ps( three, muls );  // dep on muls

// can execute on cycle 19, result ready on cycle 24
result = _mm_mul_ps( half_nr, three_minus_muls ); // dep on three_minus_muls
// result is an approximation of 1/sqrt(x), with ~22 to 23 bits of precision in the mantissa.

There is a LOT of room for other computation to overlap here, if it's not part of the dependency chain. However, if the data for each iteration of your code is dependent on data from the previous, you might be better off with the 11-cycle latency of sqrtps. Or even if each loop iteration is long enough that out-of-order execution can't hide it all by overlapping independent iterations.

To get sqrt(x) instead of 1/sqrt(x), multiply by x (and fixup if x can be zero, e.g. by masking (_mm_andn_ps) with the result of a CMPPS against 0.0). The optimal way is to replace half_nr with half_xnr = _mm_mul_ps( half, xnr );. That doesn't lengthen the dep chain any, because half_xnr can start on cycle 11 but isn't needed until the end (cycle 19). Same with FMA available: no increase in total latency.

If 11 bits of precision are enough (no Newton iteration), Intel's optimization manual (section 11.12.3) suggests using rcpps(rsqrt(x)), which is worse than multiplying by the original x, at least with AVX. It maybe saves a movdqa instruction with 128-bit SSE, but 256b rcpps is slower than 256b mul or fma. (And it lets you add the sqrt result to something for free with FMA instead of mul for the last step).


The SSE version of this loop, not accounting for any move instructions, is 6 uops. This means it should have a throughput on Haswell of one per 3 cycles (given that two execution ports can handle FP mul, and rsqrt is on the opposite port from FP add/sub). On SnB/IvB (and probably Nehalem), it should have a throughput of one per 5 cycles, since mulps and rsqrtps compete for port 0. subps is on port1, which isn't the bottleneck.

For Haswell, we can use FMA to combine the subtract with the mul. However, the dividers / sqrt unit isn't 256b wide, so unlike everything else, divps / sqrtps / rsqrtps / rcpps on ymm regs takes extra uops and extra cycles.

// vrsqrtps ymm has higher latency
// execute on cycle 1, results ready on cycle 8
nr = _mm256_rsqrt_ps( x );

// both of can execute on cycle 8, results ready on cycle 13
xnr = _mm256_mul_ps( x, nr );         // dep on nr
half_nr = _mm256_mul_ps( half, nr );  // dep on nr

// can execute on cycle 13, result ready on cycle 18
three_minus_muls = _mm256_fnmadd_ps( xnr, nr, three );  // -(xnr*nr) + 3

// can execute on cycle 18, result ready on cycle 23
result = _mm256_mul_ps( half_nr, three_minus_muls ); // dep on three_minus_muls

We save 3 cycles with FMA. This is offset by using 2-cyle-slower 256b rsqrt, for a net gain of 1 cycle less latency (pretty good for twice as wide). SnB/IvB AVX would be the 24c latency for 128b, 26c latency for 256b.

The 256b FMA version uses 7 uops total. (VRSQRTPS is 3 uops, 2 for p0, and one for p1/5.) 256b mulps and fma are both single-uop instructions, and both can run on port 0 or port 1. (p0 only on pre-Haswell). So throughput should be: one per 3c, if the OOO engine dispatches uops to optimal execution ports. (i.e. the shuffle uop from rsqrt always goes to p5, never to p1 where it would take up mul/fma bandwidth.) As far as overlapping with other computation, (not just independent execution of itself), again it's pretty lightweight. Only 7 uops with a 23 cycles dep chain leaves lots of room for other stuff to happen while those uops sit in the re-order buffer.

If this is a step in a giant dep chain with nothing else going on (not even an independent next iteration), then 256b vsqrtps is 19 cycle latency, with a throughput of one per 14 cycles. (Haswell). If you still really need the reciprocal, then 256b vdivps also has 18-21c latency, with one per 14c throughput. So for normal sqrt, the instruction has lower latency. For recip sqrt, an iteration of the approximation is less latency. (And FAR more throughput, if it's overlapping with itself. If overlapping with other stuff that doesn't the divide unit, sqrtps isn't a problem.)

sqrtps can be a throughput win vs. rsqrt + Newton iteration, if it's part of a loop body with enough other non-divide and non-sqrt work going on that the divide unit isn't saturated.

This is especially true if you need sqrt(x), not 1/sqrt(x). e.g. on Haswell with AVX2, a copy+arcsinh loop over an array of floats that fits in L3 cache, implemented as fastlog(v + sqrt(v*v + 1)), runs at about the same throughput with real VSQRTPS or with VRSQRTPS + a Newton-Raphson iteration. (Even with a very fast approximation for log(), so the total loop body is about 9 FMA/add/mul/convert operations, and 2 boolean, plus the VSQRTPS ymm. There is a speedup from using just fastlog(v2_plus_1 * rsqrt(v2_plus_1) + v2_plus_1), so it's not bottlenecked on memory bandwidth, but it might be bottlenecking on latency (so out-of-order execution can't exploit all the parallelism of independent iterations).

Other precisions

For half-precision, there are no instructions to do computation on half floats. You're meant to convert on the fly as you load/store, using the conversion instructions.

For double-precision, there is no rsqrtpd. Presumably, the thinking is that if you don't need full precision, you should be using float in the first place. So you could convert to float and back, then do the exact same algorithm, but with pd instead of ps intrinsics. Or, you could keep your data as float for a while. e.g. convert two ymm registers of doubles into one ymm register of singles.

Unfortunately there isn't a single instruction that takes two ymm registers of doubles and outputs a ymm of singles. You have to go ymm->xmm twice, then _mm256_insertf128_ps one xmm to the high 128 of the other. But then you can feed that 256b ymm vector to the same algorithm.

If you are going to convert back to double right after, it may make sense to do the rsqrt + Newton-Raphson iteration on the two 128b registers of singles separately. The extra 2 uops to insert / extract, and the extra 2 uops for 256b rsqrt, start to add up, not to mention the 3-cycle latency of vinsertf128 / vextractf128. Computation will overlap, and both results will be ready a couple cycles apart. (Or 1 cycle apart, if you have a special version of your code for interleaving operations on 2 inputs at once).

Remember that single-precision has a smaller range of exponents than double, so the conversion can overflow to +Inf or underflow to 0.0. If you're not sure, definitely just use the normal _mm_sqrt_pd.


With AVX512F, there's _mm512_rsqrt14_pd( __m512d a). With AVX512ER (KNL but not SKX or Cannonlake), there's _mm512_rsqrt28_pd. _ps versions of course also exist. 14 bits of mantissa precision may be enough to use without a Newton iteration in more cases.

A Newton iteration still won't give you a correctly rounded single-precision float the way regular sqrt will.

这篇关于快速矢量化的rsqrt和SSE/AVX的倒数取决于精度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆