性能损失:数字归一化与分支错误预测 [英] Performance penalty: denormalized numbers versus branch mis-predictions

查看:138
本文介绍了性能损失:数字归一化与分支错误预测的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于已经测量过或对此类注意事项有深入了解的人员,假定您必须执行以下操作(为示例选择任何一个)浮点运算符:

float calc(float y, float z)
{ return sqrt(y * y + z * z) / 100; }

yz可以是非正规数的情况下,我们假设两种可能的情况,其中y,z或全部以随机的方式都可以是非正规数

  • 50%的时间
  • < 1%的时间

现在假设我想避免处理非正规数的性能损失,我只想将它们视为0,然后通过以下方式更改该代码段:

float calc(float y, float z)
{
   bool yzero = y < 1e-37;
   bool zzero = z < 1e-37;
   bool all_zero = yzero and zzero;
   bool some_zero = yzero != zzero;

   if (all_zero)
      return 0f;

   float ret;

   if (!some_zero) ret = sqrt(y * y + z * z);
   else if (yzero) ret = z;
   else if (zzero) ret = y;

   return ret / 100;
}

更糟糕的是,分支预测错误的性能损失(对于50%或<1%的情况),或者使用非正常数字的性能损失?

为了正确地解释上一段代码中哪些操作可以正常还是不正常,我还希望获得有关以下紧密相关问题的单行但完全可选的答案:

float x = 0f; // Will x be just 0 or maybe some number like 1e-40;
float y = 0.; // I assume the conversion is just thin-air here and the compiler will see just a 0.
0; // Is "exact zero" a normal or a denormal number?
float z = x / 1; // Will this "no-op" (x == 0) cause z be something like 1e-40 and thus denormal?
float zz = x / c; // What about a "no-op" operating against any compiler-time constant?
bool yzero = y < 1e-37; // Have comparisions any performance penalty when y is denormal or they don't?

解决方案

许多ISA(包括x86)都免费提供对此的硬件支持,请参见以下内容:FTZ/DAZ.当您使用-ffast-math或同等版本进行编译时,大多数编译器会在启动期间设置这些标志.

还请注意,在某些情况下,您的代码无法避免惩罚(在有硬件的情况下): y * yz * z对于较小但已标准化的yz . (捕获好,@ chtz ). y*y的指数是y指数的两倍,负数或更正数.使用 float 中的23个显式尾数位,大约是12个指数值是次正规值的平方根,不会一直溢出到0.

对次正规量进行平方运算总是使0下溢;我不知道,次乘输入的乘积可能比乘乘的次乘输出少. 在一个微体系结构中,是否具有低于正常的惩罚可能会因操作而异,例如加/减vs.乘vs.除法.

此外,任何负数yz都将被视为0,除非您的输入已知为非负数,否则这可能是一个错误.

如果结果差异如此之大,则x86微体系结构将是我的主要用例

是的,处罚(或没有处罚)差别很大.

从历史上看(P6系列),英特尔过去总是非常缓慢地为次要结果和次要输入(包括比较)提供微码辅助.现代的Intel CPU(Sandybridge系列)不需要微码辅助就可以处理非正规操作数上的部分FP操作,但不是全部. (性能事件fp_assists.any)

微代码辅助就像一个例外,会刷新顺序混乱的流水线,并在SnB系列上花费超过160个周期,而分支未命中则需要10到20个周期. 而且分支未命中在现代CPU上具有快速恢复" .真正的分支未命中损失取决于周围的代码.例如如果分支条件确实准备就绪的时间很晚,则可能会导致放弃许多以后的独立工作.但是,如果您希望微码辅助频繁发生,它可能仍然会更糟.

请注意,您可以使用整数ops检查次正规:只需检查指数字段是否为全零(尾数为非零:0.0的全零编码在技术上是次正规的一种特殊情况) . 因此您可以使用andps/pcmpeqd/andps

之类的整数SIMD手动刷新为零

Agner Fog的 Microarch PDF有一些信息;他提到这一点时,通常不会在每个uarch都有详细的详细分类.我不认为 https://uops.info/不幸地测试了正常与次正常.

骑士的登陆(KNL)仅对分区有不正常的处罚,而不能加/mul.像GPU一样,他们采用的方法更倾向于吞吐量而不是延迟,并且FPU中有足够的流水线级来处理等效于无分支的硬件中的次规范.即使这可能意味着每个FP操作的等待时间都更长.

除非设置了FTZ,否则AMD推土机/打桩机会对低于正常水平或下溢"的结果造成约175个周期的罚款.阿格纳(Agner)没有提及非正规输入.压路机/挖掘机没有任何处罚.

AMD锐龙(摘自Agner Fog的microarch pdf文件)

给出不正常结果的浮点运算会额外花费几个时钟周期.这 乘法或除法下溢到零时的情况相同.这远远小于 推土机和打桩机受到高额罚款.刷新为零时没有任何惩罚 模式和零归零"模式都已启用.

相比之下,英特尔Sandybridge家族(至少是Skylake)对结果一直下溢至0.0的结果没有惩罚.

来自Agner Fog的microarch pdf的英特尔Silvermont(Atom)

具有不正常数字作为输入或输出或产生下溢的操作会占用 大约160个时钟周期,除非刷新至零模式和异常值为零 模式都被使用.

这将包括比较.


我不知道任何非x86微体系结构(例如ARM cortex-a76或任何RISC-V)的详细信息,以挑选一些可能​​也相关的随机示例.在简单的有序管道中,与现代x86之类的深层OoO执行CPU相比,错误预测的惩罚也有很大不同.真正的错误预测惩罚还取决于周围的代码.


现在假设我想避免处理非正规数的性能损失,我只想将它们视为0

然后,您应该将FPU设置为免费为您执行此操作,从而消除次等标准中所有可能的处罚.

某些/大多数(?)现代FPU(包括x86 SSE但不包括旧版x87)可让您免费将零态(也称为反态)视为零,因此仅当您希望某些行为为 some 时才会出现此问题在同一线程中起作用,但不是全部.而且,由于切换的粒度太细,因此不值得将FP控制寄存器更改为FTZ,然后再更改为FTZ.

或者如果您想编写无处不在的可怕的完全可移植的代码,或者这可能是有意义的,即使这意味着忽略硬件支持并因此比它可能的速度慢.

有些x86 CPU甚至重命名了MXCSR ,因此更改舍入模式或FTZ/DAZ可能不必耗尽乱序的后端.它仍然不便宜,并且您要避免每隔几条FP指令就这样做.

ARM还支持类似的功能:浮点运算中的齐零"行为关于此功能的跨平台可用性.


在x86上,特定的机制是您将MXCSR寄存器中的DAZ和FTZ位置设置(SSE FP数学控制寄存器;还具有FP舍入模式,FP例外掩码和粘性FP的位屏蔽异常状态位).

否,IEEE754不允许0.0 / 1.0提供除0.0以外的任何内容.

再次,普通人不会凭空出现. 仅当精确结果不能表示为浮点数或双精度数时,才发生舍入错误". IEEE基本"运算(*/+-和sqrt)的最大允许误差为0.5 ulp,即准确的结果必须正确舍入至最接近的可表示FP值,一直到尾数的最后一位.

 bool yzero = y < 1e-37; // Have comparisons any performance penalty when y is denormal or they don't?

也许,也许不是.对最近的AMD或Intel不会造成任何损失,但例如在Core 2上速度较慢.

请注意,1e-37的类型为double,将导致将y升级为double.您可能希望与使用1e-37f相比,这实际上可以避免不正常的处罚.次普通float-> int在Core 2上没有损失,但是不幸的是cvtss2sd在Core 2上仍然有很大的损失.(最近提出的LLVM优化在Skylake和Core 2上是安全的,根据我的测试.

在Skylake上,对次普通项进行平方(产生0)也不会受到惩罚.但这确实对Conroe(P6家族)造成了巨大的惩罚.

但是,即使在Skylake上(慢150倍),将普通数相乘以产生次普通结果也会带来损失.

For those that have already measured or have deep knowledge about this kind of considerations, assume that you have to do the following (just to pick any for the example) floating-point operator:

float calc(float y, float z)
{ return sqrt(y * y + z * z) / 100; }

Where y and z could be denormal numbers, let's assume two possible situations where just y, just z, or maybe both, in a totally random manner, can be denormal numbers

  • 50% of the time
  • <1% of the time

And now assume I want to avoid the performance penalty of dealing with denormal numbers and I just want to treat them as 0, and I change that piece of code by:

float calc(float y, float z)
{
   bool yzero = y < 1e-37;
   bool zzero = z < 1e-37;
   bool all_zero = yzero and zzero;
   bool some_zero = yzero != zzero;

   if (all_zero)
      return 0f;

   float ret;

   if (!some_zero) ret = sqrt(y * y + z * z);
   else if (yzero) ret = z;
   else if (zzero) ret = y;

   return ret / 100;
}

What will be worse, the performarce penalty for branch misprediction (for the 50% or <1% cases), or the performance penalty for working with denormal numbers?

To properly interpret which operations can be normal or denormal in the previous piece of code I would like as well to get some one-lined but totally optional answers about the following closely related questions:

float x = 0f; // Will x be just 0 or maybe some number like 1e-40;
float y = 0.; // I assume the conversion is just thin-air here and the compiler will see just a 0.
0; // Is "exact zero" a normal or a denormal number?
float z = x / 1; // Will this "no-op" (x == 0) cause z be something like 1e-40 and thus denormal?
float zz = x / c; // What about a "no-op" operating against any compiler-time constant?
bool yzero = y < 1e-37; // Have comparisions any performance penalty when y is denormal or they don't?

解决方案

There's HW support for this for free in many ISAs including x86, see below re: FTZ / DAZ. Most compilers set those flags during startup when you compile with -ffast-math or equivalent.

Also note that your code fails to avoid the penalty (on HW where there is any) in some cases: y * y or z * z can be subnormal for small but normalized y or z. (Good catch, @chtz). The exponent of y*y is twice the exponent of y, more negative or more positive. With 23 explicit mantissa bits in a float, that's about 12 exponent values that are the square roots of subnormal values, and wouldn't underflow all the way to 0.

Squaring a subnormal always gives underflow to 0; subnormal input may be less likely to have a penalty than subnormal output for a multiply, I don't know. Having a subnormal penalty or not can vary by operation within one microarchitecture, like add/sub vs. multiply vs. divide.

Also, any negative y or z gets treated as 0, which is probably a bug unless your inputs are known non-negative.

if results can vary so widely, x86 microarchitectures will be my main use case

Yes, penalties (or lack thereof) vary greatly.

Historically (P6-family) Intel used to always take a very slow microcode assist for subnormal results and subnormal inputs, including for compares. Modern Intel CPUs (Sandybridge-family) handle some but not all FP operations on subnormal operands without needing a microcode assist. (perf event fp_assists.any)

The microcode assist is like an exception and flushes the out-of-order pipeline, and takes over 160 cycles on SnB-family, vs. ~10 to 20 for a branch miss. And branch misses have "fast recovery" on modern CPUs. True branch-miss penalty depends on surrounding code; e.g. if the branch condition is really late to be ready it can result in discarding a lot of later independent work. But a microcode assist is still probably worse if you expect it to happen frequently.

Note that you can check for a subnormal using integer ops: just check the exponent field for all zero (and the mantissa for non-zero: the all-zero encoding for 0.0 is technically a special case of a subnormal). So you could manually flush to zero with integer SIMD operations like andps/pcmpeqd/andps

Agner Fog's microarch PDF has some info; he mentions this in general without a fully detailed breakdown for each uarch. I don't think https://uops.info/ tests for normal vs. subnormal unfortunately.

Knight's Landing (KNL) only has subnormal penalties for division, not add / mul. Like GPUs, they took an approach that favoured throughput over latency and have enough pipeline stages in their FPU to handle subnormals in the hardware equivalent of branchlessly. Even though this might mean higher latency for every FP operation.

AMD Bulldozer / Piledriver have a ~175 cycle penalty for results that are "subnormal or underflow", unless FTZ is set. Agner doesn't mention subnormal inputs. Steamroller/Excavator don't have any penalties.

AMD Ryzen (from Agner Fog's microarch pdf)

Floating point operations that give a subnormal result take a few clock cycles extra. The same is the case when a multiplication or division underflows to zero. This is far less than the high penalty on the Bulldozer and Piledriver. There is no penalty when flush-to-zero mode and denormals-are-zero mode are both on.

By contrast, Intel Sandybridge-family (at least Skylake) doesn't have penalties for results that underflow all the way to 0.0.

Intel Silvermont (Atom) from Agner Fog's microarch pdf

Operations that have subnormal numbers as input or output or generate underflow take approximately 160 clock cycles unless the flush-to-zero mode and denormals-are-zero mode are both used.

This would include compares.


I don't know the details for any non-x86 microarchitectures, like ARM cortex-a76 or any RISC-V to pick a couple random examples that might also be relevant. Mispredict penalties vary wildly as well, across simple in-order pipelines vs. deep OoO exec CPUs like modern x86. True mispredict penalty also depends on surrounding code.


And now assume I want to avoid the performance penalty of dealing with denormal numbers and I just want to treat them as 0

Then you should set your FPU to do that for you for free, removing all possibility of penalties from subnormals.

Some / most(?) modern FPUs (including x86 SSE but not legacy x87) let you treat subnormals (aka denormals) as zero for free, so this problem only occurs if you want this behaviour for some functions but not all, within the same thread. And with too fine-grained switching to be worth changing the FP control register to FTZ and back.

Or could be relevant if you wanted to write fully portable code that was terrible nowhere, even if it meant ignoring HW support and thus being slower than it could be.

Some x86 CPUs do even rename MXCSR so changing the rounding mode or FTZ/DAZ might not have to drain the out-of-order back-end. It's still not cheap and you'd want to avoid doing it every few FP instructions.

ARM also supports a similar feature: subnormal IEEE 754 floating point numbers support on iOS ARM devices (iPhone 4) - but apparently the default setting for ARM VFP / NEON is to treat subnormals as zero, favouring performance over strict IEEE compliance.

See also flush-to-zero behavior in floating-point arithmetic about cross-platform availability of this.


On x86 the specific mechanism is that you set the DAZ and FTZ bits in the MXCSR register (SSE FP math control register; also has bits for FP rounding mode, FP exception masks, and sticky FP masked-exception status bits). https://software.intel.com/en-us/articles/x87-and-sse-floating-point-assists-in-ia-32-flush-to-zero-ftz-and-denormals-are-zero-daz shows the layout and also discusses some performance effects on older Intel CPUs. Lots of good background / introduction.

Compiling with -ffast-math will link in some extra startup code that sets FTZ/DAZ before calling main. IIRC, threads inherit the MXCSR settings from the main thread on most OSes.

  • DAZ = Denormals Are Zero, treats input subnormals as zero. This affects compares (whether or not they would have experienced a slowdown) making it impossible to even tell the difference between 0 and a subnormal other than using integer stuff on the bit-pattern.
  • FTZ = Flush To Zero, subnormal outputs from calculations are just underflowed to zeroed. i.e. disable gradual underflow. (Note that multiplying two small normal numbers can underflow. I think add/sub of normal numbers whose mantissas cancel out except for the low few bits could produce a subnormal as well.)

Usually you simply set both or neither. If you're processing input data from another thread or process, or compile-time constants, you could still have subnormal inputs even if all results you produce are normalized or 0.


Specific random questions:

float x = 0f; // Will x be just 0 or maybe some number like 1e-40;

This is a syntax error. Presumably you mean 0.f or 0.0f

0.0f is exactly representable (with the bit-pattern 0x00000000) as an IEEE binary32 float, so that's definitely what you will get on any platform that uses IEEE FP. You won't randomly get subnormals that you didn't write.

float z = x / 1; // Will this "no-op" (x == 0) cause z be something like 1e-40 and thus denormal?

No, IEEE754 doesn't allow 0.0 / 1.0 to give anything other than 0.0.

Again, subnormals don't appear out of thin air. Rounding "error" only happens when the exact result can't be represented as a float or double. The max allowed error for the IEEE "basic" operations (* / + - and sqrt) is 0.5 ulp, i.e. the exact result must be correctly rounded to the nearest representable FP value, right down to the last digit of the mantissa.

 bool yzero = y < 1e-37; // Have comparisons any performance penalty when y is denormal or they don't?

Maybe, maybe not. No penalty on recent AMD or Intel, but is slow on Core 2 for example.

Note that 1e-37 has type double and will cause promotion of y to double. You might hope that this would actually avoid subnormal penalties vs. using 1e-37f. Subnormal float->int has no penalty on Core 2, but unfortunately cvtss2sd does still have the large penalty on Core 2. (GCC/clang don't optimize away the conversion even with -ffast-math, although I think they could because 1e-37 is exactly representable as a flat, and every subnormal float can be exactly represented as a normalized double. So the promotion to double is always exact and can't change the result).

On Intel Skylake, comparing two subnormals with vcmplt_oqpd doesn't result in any slowdown, and not with ucomisd into integer FLAGS either. But on Core 2, both are slow.

Comparison, if done like subtraction, does have to shift the inputs to line up their binary place-values, and the implied leading digit of the mantissa is a 0 instead of 1 so subnormals are a special case. So hardware might choose to not handle that on the fast path and instead take a microcode assist. Older x86 hardware might handle this slower.

It could be done differently if you built a special compare ALU separate from the normal add/sub unit. Float bit-patterns can be compared as sign/magnitude integers (with a special case for NaN) because the IEEE exponent bias is chosen to make that work. (i.e. nextafter is just integer ++ or -- on the bit pattern). But this apparently isn't what hardware does.


FP conversion to integer is fast even on Core 2, though. cvt[t]ps2dq or the pd equivalent convert packed float/double to int32 with truncation or the current rounding mode. So for example this recent proposed LLVM optimization is safe on Skylake and Core 2, according to my testing.

Also on Skylake, squaring a subnormal (producing a 0) has no penalty. But it does have a huge penalty on Conroe (P6-family).

But multiplying normal numbers to produce a subnormal result has a penalty even on Skylake (~150x slower).

这篇关于性能损失:数字归一化与分支错误预测的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆