在 x86 上给出无分支 FP min 和 max 的指令是什么? [英] What is the instruction that gives branchless FP min and max on x86?

查看:17
本文介绍了在 x86 上给出无分支 FP min 和 max 的指令是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

引用(感谢作者开发和分享算法!):

https://tavianator.com/fast-branchless-raybounding-box-路口/

<块引用>

因为现代浮点指令集可以在没有分支的情况下计算最小值和最大值

作者的对应代码就是

dmnsn_min(双 a, 双 b){返回一个 <乙?一:乙;}

我熟悉例如_mm_max_ps,但这是一个向量指令.上面的代码显然是要以标量形式使用的.

问题:

  • 什么是 x86 上的标量无分支 minmax 指令?是指令序列吗?
  • 假设它会被应用是否安全,或者我该如何称呼它?
  • 关心最小/最大的无分支是否有意义?据我了解,对于光线追踪器和/或其他可视化软件,给定光线-盒相交例程,分支预测器没有可靠的模式可供选择,因此消除分支确实有意义.我说得对吗?
  • 最重要的是,所讨论的算法是围绕与 (+/-) INFINITY 进行比较而构建的.对于我们正在讨论的(未知)指令和浮点标准,这是否可靠?

以防万一:我熟悉 使用 min 和C++ 中的最大函数,相信它是相关的,但不是我的问题.

解决方案

警告:谨防编译器将 _mm_min_ps/_mm_max_ps(和 _pd)内在函数视为可交换的,即使在严格的情况下FP(非快速数学)模式;即使 asm 指令不是.GCC 似乎特别有这个错误: PR72867 已在 GCC7 中修复但对于 _mm_min_ss 等标量内在函数(_mm_max_ss 在 clang 和 gcc 之间有不同的行为,GCC bugzilla PR99497).

GCC 知道 asm 指令本身是如何工作的,并且在使用它们在纯标量代码中实现严格的 FP 语义时不存在这个问题,仅使用 C/C++ 内在函数.

不幸的是,没有一条指令可以实现 fmin(a,b)(保证 NaN 传播),因此您必须在轻松检测问题与更高性能之间做出选择.p>


大多数向量 FP 指令都有标量等价物.MINSS/MAXSS/MINSD/MAXSD 是你想要的.他们以您期望的方式处理 +/-Infinity.

MINSS a,b 完全 实现 (a<b) ?a : b 根据 IEEE 规则,包含有关有符号零、NaN 和无穷大的所有内容.(即它使源操作数 b 保持无序.)这意味着 C++ 编译器可以将它们用于 std::min(b,a)std::max(b,a),因为这些函数基于相同的表达式.注意 std:: 函数的 b,a 操作数顺序,与 x86 asm 的 Intel 语法相反,但匹配 AT&T 语法.

MAXSS a,b 完全 实现 (b<a) ?a : b,再次保持源操作数 (b) 无序.像 std::max(b,a).

使用 x = std::min(arr[i], x); 循环数组(即 minssmaxss xmm0, [rsi]) 将从内存中取出一个 NaN(如果存在),然后取出下一个非 NaN 元素,因为该比较将是无序的.因此,您将获得最后一个 NaN 之后元素的最小值或最大值.你通常不想要这个,所以它只适用于不包含 NaN 的数组.但这意味着您可以在循环外从 float v = NAN; 开始,而不是第一个元素或 FLT_MAX 或 +Infinity,并且可能会简化处理可能为空的列表.在 asm 中也很方便,允许 init 与 pcmpeqd xmm0,xm​​m0 生成全为一的位模式(负 QNAN),但不幸的是 GCC 的 NAN 使用不同的位模式.

演示/证明 在 Godbolt 编译器浏览器上,包括显示 v = std::min(v, arr[i]);(或 max)忽略数组中的 NaN,代价是必须加载到一个寄存器中,然后进入该寄存器.

(注意数组的最小值应该使用向量,而不是标量;最好使用多个累加器来隐藏 FP 延迟.最后,减少到一个向量然后执行 水平最小值,就像对数组求和或做点积一样.)


不要尝试在标量浮点数上使用 _mm_min_ss;内在函数仅适用于 __m128 操作数英特尔的内在函数不提供任何方法来将标量浮点数放入 __m128 的低元素而不将高元素归零或以某种方式做额外的工作. 大多数编译器实际上会发出无用的指令来执行此操作,即使最终结果不依赖于上层元素中的任何内容.(不过,Clang 通常可以通过将 as-if 规则应用于死向量元素的内容来避免这种情况.)没有什么像 __m256 _mm256_castps128_ps256 (__m128 a) 那样只是将浮点数转换为 __m128 在上部元素中有垃圾.我认为这是一个设计缺陷.:/

但幸运的是,您不需要手动执行此操作,编译器知道如何为您使用 SSE/SSE2 最小值/最大值. 只需编写 C 语言即可.您问题中的功能是理想的:如下所示(Godbolt链接):

//可以并且确实内联到单个 MINSD 指令,并且可以轻松地自动向量化静态内联双dmnsn_min(双a,双b){返回一个 <乙?一:乙;}


注意它们与 NaN 的不对称行为:如果操作数是无序的,则 dest=src(即,如果任一操作数为 NaN,则它采用第二个操作数).这对于 SIMD 条件更新很有用,见下文.

(ab 如果其中任何一个为 NaN,则它们是无序的.这意味着 a, a==ba>b 都是 false.参见 Bruce Dawson 关于浮点数的系列文章了解很多 FP 陷阱.)

相应的 _mm_min_ss/_mm_min_ps 内在函数可能有也可能没有这种行为,具体取决于编译器.

我认为内在函数应该具有与 asm 指令相同的操作数顺序语义,但 gcc 已将 _mm_min_ps 的操作数视为可交换的,即使没有 -ffast-math 很长一段时间,gcc4.4 或者更早.GCC 7 终于改成了匹配 ICC 和 clang.

英特尔的在线内在函数查找器没有记录该函数的行为,但它可能不应该是详尽无遗的.asm insn 参考手册并没有说内在的 doesn't 具有该属性;它只是将 _mm_min_ss 列为 MINSS 的内在函数.

当我在 _mm_min_ps"上搜索时NaN,我发现 这个真实的代码 和其他一些关于使用内在函数处理 NaN 的讨论,很明显,许多人希望内在函数的行为类似于 asm 指令.(这是我昨天写的一些代码,我已经在考虑把它写成一个自我回答的问答.)

鉴于这个长期存在的 gcc 错误的存在,想要利用 MINPS 的 NaN 处理的可移植代码需要采取预防措施.许多现有 Linux 发行版上的标准 gcc 版本如果依赖于 _mm_min_ps 的操作数顺序,则会错误编译您的代码.所以你可能需要一个 #ifdef 来检测实际的 gcc(而不是 clang 等),以及一个替代方案.或者只是一开始就做不同的事情:/也许使用 _mm_cmplt_ps 和布尔 AND/ANDNOT/OR.

启用 -ffast-math 也会使 _mm_min_ps 在所有编译器上可交换.


像往常一样,编译器知道如何使用指令集正确实现 C 语义.MINSS 和 MAXSS 比任何你可以用分支做的任何事情都快,所以只需编写可以编译为的代码其中之一.

交换-_mm_min_ps 问题仅适用于 内在函数:gcc 确切地知道 MINSS/MINPS 是如何工作的,并使用它们来正确实现严格的 FP 语义(当你不'不要使用-ffast-math).

您通常不需要做任何特别的事情来从编译器中获得体面的标量代码.但是,如果您打算花时间关心编译器使用什么指令,那么如果编译器不这样做,您可能应该从手动向量化代码开始.

(如果条件几乎总是单向且延迟比吞吐量更重要,那么分支可能是最佳的极少数情况.MINPS 延迟约为 3 个周期,但完美预测的分支会向依赖链添加 0 个周期的关键路径.)


在 C++ 中,使用 std::minstd::max,它们是按照 ><,并且对 NaN 行为的要求与 fminfmax 不同.避免 fminfmax 用于性能,除非您需要它们的 NaN 行为.

在 C 语言中,我认为只需编写自己的 minmax 函数(或宏,如果你安全地这样做的话).


C &Godbolt 编译器资源管理器上的 asm

float minfloat(float a, float b) {返回 (a<b) ?一:乙;}# 任何体面的编译器(gcc、clang、icc),没有任何 -ffast-math 或任何东西:分钟 xmm0, xmm1ret//C++float minfloat_std(float a, float b) { return std::min(a,b);}# std::min 的这个实现使用 (b<a) : b : a;# 所以它只能在b所在的寄存器中产生结果# 这并不差(内联时),正好相反分钟 xmm1, xmm0movaps xmm0, xmm1retfloat minfloat_fmin(float a, float b) { return fminf(a, b);}# clang 内联 fmin;其他编译器只是对其进行尾调用.minfloat_fmin(浮动,浮动):movaps xmm2, xmm0cmpunordss xmm2, xmm2movaps xmm3, xmm2和ps xmm3, xmm1分钟 xmm1, xmm0和nps xmm2, xmm1orps xmm2, xmm3movaps xmm0, xmm2ret# 显然,如果你不需要它,你就不想要它.


如果您想自己使用 _mm_min_ss/_mm_min_ps,请编写让编译器即使没有 -ffast-math 也能生成良好 asm 的代码.

如果您不期望 NaN,或者想要专门处理它们,请编写类似

的内容

lowest = _mm_min_ps(lowest, some_loop_variable);

所以保存 lowest 的寄存器可以就地更新(即使没有 AVX).


利用 MINPS 的 NaN 行为:

说你的标量代码是这样的

if(一些条件)最低= min(最低,x);

假设条件可以使用 CMPPS 向量化,因此您有一个元素向量,其位全部设置或全部清除.(或者,如果您只关心它们的符号而不关心负零,也许您可​​以直接在浮点数上使用 ANDPS/ORPS/XORPS.这会在符号位中创建一个真值,而在其他地方则有垃圾.BLENDVPS 着眼于只有符号位,所以这非常有用.或者您可以使用 PSRAD xmm, 31 广播符号位.)

实现这一点的直接方法是根据条件掩码将 x+Inf 混合.或者执行 newval = min(lowest, x); 并将 newval 混合到 lowest 中.(BLENDVPS 或 AND/ANDNOT/OR).

但诀窍在于 all-one-bits 是 NaN,按位 OR 会传播它.所以:

__m128 inverse_condition = _mm_cmplt_ps(foo, bar);__m128 x = 随便;x = _mm_or_ps(x, 条件);//将元素转换为 NaN,其中掩码为全一最低 = _mm_min_ps(x, 最低);//x 中的 NaN 元素表示最低没有变化//需要非可交换 _mm_min_ps: no -ffast-math//并且对于大多数 GCC 版本根本不起作用.

所以只有 SSE2,我们已经在两个额外的指令(ORPS 和 MOVAPS,除非循环展开允许 MOVAPS 消失)中完成了条件 MINPS.

没有 SSE4.1 BLENDVPS 的替代方案是 ANDPS/ANDNPS/ORPS 进行混合,外加一个额外的 MOVAPS.无论如何,ORPS 比 BLENDVPS 更高效(在大多数 CPU 上是 2 微指令).

To quote (thanks to the author for developing and sharing the algorithm!):

https://tavianator.com/fast-branchless-raybounding-box-intersections/

Since modern floating-point instruction sets can compute min and max without branches

Corresponding code by the author is just

dmnsn_min(double a, double b)
{
  return a < b ? a : b;
}

I'm familiar with e.g. _mm_max_ps, but that's a vector instruction. The code above obviously is meant to be used in a scalar form.

Question:

  • What is the scalar branchless minmax instruction on x86? Is it a sequence of instructions?
  • Is it safe to assume it's going to be applied, or how do I call it?
  • Does it make sense to bother about branchless-ness of min/max? From what I understand, for a raytracer and / or other viz software, given a ray - box intersection routine, there is no reliable pattern for the branch predictor to pick up, hence it does make sense to eliminate the branch. Am I right about this?
  • Most importantly, the algorithm discussed is built around comparing against (+/-) INFINITY. Is this reliable w.r.t the (unknown) instruction we're discussing and the floating-point standard?

Just in case: I'm familiar with Use of min and max functions in C++, believe it's related but not quite my question.

解决方案

Warning: Beware of compilers treating _mm_min_ps / _mm_max_ps (and _pd) intrinsics as commutative even in strict FP (not fast-math) mode; even though the asm instruction isn't. GCC specifically seems to have this bug: PR72867 which was fixed in GCC7 but may be back or never fixed for _mm_min_ss etc. scalar intrinsics (_mm_max_ss has different behavior between clang and gcc, GCC bugzilla PR99497).

GCC knows how the asm instructions themselves work, and doesn't have this problem when using them to implement strict FP semantics in plain scalar code, only with the C/C++ intrinsics.

Unfortunately there isn't a single instruction that implements fmin(a,b) (with guaranteed NaN propagation), so you have to choose between easy detection of problems vs. higher performance.


Most vector FP instructions have scalar equivalents. MINSS / MAXSS / MINSD / MAXSD are what you want. They handle +/-Infinity the way you'd expect.

MINSS a,b exactly implements (a<b) ? a : b according to IEEE rules, with everything that implies about signed-zero, NaN, and Infinities. (i.e. it keeps the source operand, b, on unordered.) This means C++ compilers can use them for std::min(b,a) and std::max(b,a), because those functions are based on the same expression. Note the b,a operand order for the std:: functions, opposite Intel-syntax for x86 asm, but matching AT&T syntax.

MAXSS a,b exactly implements (b<a) ? a : b, again keeping the source operand (b) on unordered. Like std::max(b,a).

Looping over an array with x = std::min(arr[i], x); (i.e. minss or maxss xmm0, [rsi]) will take a NaN from memory if one is present, and then take whatever non-NaN element is next because that compare will be unordered. So you'll get the min or max of the elements following the last NaN. You normally don't want this, so it's only good for arrays that don't contain NaN. But it means you can start with float v = NAN; outside a loop, instead of the first element or FLT_MAX or +Infinity, and might simplify handling possibly-empty lists. It's also convenient in asm, allowing init with pcmpeqd xmm0,xmm0 to generate an all-ones bit-pattern (a negative QNAN), but unfortunately GCC's NAN uses a different bit-pattern.

Demo/proof on the Godbolt compiler explorer, including showing that v = std::min(v, arr[i]); (or max) ignores NaNs in the array, at the cost of having to load into a register and then minss into that register.

(Note that min of an array should use vectors, not scalar; preferably with multiple accumulators to hide FP latency. At the end, reduce to one vector then do horizontal min of it, just like summing an array or doing a dot product.)


Don't try to use _mm_min_ss on scalar floats; the intrinsic is only available with __m128 operands, and Intel's intrinsics don't provide any way to get a scalar float into the low element of a __m128 without zeroing the high elements or somehow doing extra work. Most compilers will actually emit the useless instructions to do that even if the final result doesn't depend on anything in the upper elements. (Clang can often avoid it, though, applying the as-if rule to the contents of dead vector elements.) There's nothing like __m256 _mm256_castps128_ps256 (__m128 a) to just cast a float to a __m128 with garbage in the upper elements. I consider this a design flaw. :/

But fortunately you don't need to do this manually, compilers know how to use SSE/SSE2 min/max for you. Just write your C such that they can. The function in your question is ideal: as shown below (Godbolt link):

// can and does inline to a single MINSD instruction, and can auto-vectorize easily
static inline double
dmnsn_min(double a, double b) {
  return a < b ? a : b;
}


Note their asymmetric behaviour with NaN: if the operands are unordered, dest=src (i.e. it takes the second operand if either operand is NaN). This can be useful for SIMD conditional updates, see below.

(a and b are unordered if either of them is NaN. That means a<b, a==b, and a>b are all false. See Bruce Dawson's series of articles on floating point for lots of FP gotchas.)

The corresponding _mm_min_ss / _mm_min_ps intrinsics may or may not have this behaviour, depending on the compiler.

I think the intrinsics are supposed to have the same operand-order semantics as the asm instructions, but gcc has treated the operands to _mm_min_ps as commutative even without -ffast-math for a long time, gcc4.4 or maybe earlier. GCC 7 finally changed it to match ICC and clang.

Intel's online intrinsics finder doesn't document that behaviour for the function, but it's maybe not supposed to be exhaustive. The asm insn ref manual doesn't say the intrinsic doesn't have that property; it just lists _mm_min_ss as the intrinsic for MINSS.

When I googled on "_mm_min_ps" NaN, I found this real code and some other discussion of using the intrinsic to handle NaNs, so clearly many people expect the intrinsic to behave like the asm instruction. (This came up for some code I was writing yesterday, and I was already thinking of writing this up as a self-answered Q&A.)

Given the existence of this longstanding gcc bug, portable code that wants to take advantage of MINPS's NaN handling needs to take precautions. The standard gcc version on many existing Linux distros will mis-compile your code if it depends on the order of operands to _mm_min_ps. So you probably need an #ifdef to detect actual gcc (not clang etc), and an alternative. Or just do it differently in the first place :/ Perhaps with a _mm_cmplt_ps and boolean AND/ANDNOT/OR.

Enabling -ffast-math also makes _mm_min_ps commutative on all compilers.


As usual, compilers know how to use the instruction set to implement C semantics correctly. MINSS and MAXSS are faster than anything you could do with a branch anyway, so just write code that can compile to one of those.

The commutative-_mm_min_ps issue applies to only the intrinsic: gcc knows exactly how MINSS/MINPS work, and uses them to correctly implement strict FP semantics (when you don't use -ffast-math).

You don't usually need to do anything special to get decent scalar code out of a compiler. But if you are going to spend time caring about what instructions the compiler uses, you should probably start by manually vectorizing your code if the compiler isn't doing that.

(There may be rare cases where a branch is best, if the condition almost always goes one way and latency is more important than throughput. MINPS latency is ~3 cycles, but a perfectly predicted branch adds 0 cycles to the dependency chain of the critical path.)


In C++, use std::min and std::max, which are defined in terms of > or <, and don't have the same requirements on NaN behaviour that fmin and fmax do. Avoid fmin and fmax for performance unless you need their NaN behaviour.

In C, I think just write your own min and max functions (or macros if you do it safely).


C & asm on the Godbolt compiler explorer

float minfloat(float a, float b) {
  return (a<b) ? a : b;
}
# any decent compiler (gcc, clang, icc), without any -ffast-math or anything:
    minss   xmm0, xmm1
    ret

// C++
float minfloat_std(float a, float b) { return std::min(a,b); }
  # This implementation of std::min uses (b<a) : b : a;
  # So it can produce the result only in the register that b was in
  # This isn't worse (when inlined), just opposite
    minss   xmm1, xmm0
    movaps  xmm0, xmm1
    ret


float minfloat_fmin(float a, float b) { return fminf(a, b); }

# clang inlines fmin; other compilers just tailcall it.
minfloat_fmin(float, float):
    movaps  xmm2, xmm0
    cmpunordss      xmm2, xmm2
    movaps  xmm3, xmm2
    andps   xmm3, xmm1
    minss   xmm1, xmm0
    andnps  xmm2, xmm1
    orps    xmm2, xmm3
    movaps  xmm0, xmm2
    ret
   # Obviously you don't want this if you don't need it.


If you want to use _mm_min_ss / _mm_min_ps yourself, write code that lets the compiler make good asm even without -ffast-math.

If you don't expect NaNs, or want to handle them specially, write stuff like

lowest = _mm_min_ps(lowest, some_loop_variable);

so the register holding lowest can be updated in-place (even without AVX).


Taking advantage of MINPS's NaN behaviour:

Say your scalar code is something like

if(some condition)
    lowest = min(lowest, x);

Assume the condition can be vectorized with CMPPS, so you have a vector of elements with the bits all set or all clear. (Or maybe you can get away with ANDPS/ORPS/XORPS on floats directly, if you just care about their sign and don't care about negative zero. This creates a truth value in the sign bit, with garbage elsewhere. BLENDVPS looks at only the sign bit, so this can be super useful. Or you can broadcast the sign bit with PSRAD xmm, 31.)

The straight-forward way to implement this would be to blend x with +Inf based on the condition mask. Or do newval = min(lowest, x); and blend newval into lowest. (either BLENDVPS or AND/ANDNOT/OR).

But the trick is that all-one-bits is a NaN, and a bitwise OR will propagate it. So:

__m128 inverse_condition = _mm_cmplt_ps(foo, bar);
__m128 x = whatever;


x = _mm_or_ps(x, condition);   // turn elements into NaN where the mask is all-ones
lowest = _mm_min_ps(x, lowest);  // NaN elements in x mean no change in lowest
//  REQUIRES NON-COMMUTATIVE _mm_min_ps: no -ffast-math
//  AND DOESN'T WORK AT ALL WITH MOST GCC VERSIONS.

So with only SSE2, and we've done a conditional MINPS in two extra instructions (ORPS and MOVAPS, unless loop unrolling allows the MOVAPS to disappear).

The alternative without SSE4.1 BLENDVPS is ANDPS/ANDNPS/ORPS to blend, plus an extra MOVAPS. ORPS is more efficient than BLENDVPS anyway (it's 2 uops on most CPUs).

这篇关于在 x86 上给出无分支 FP min 和 max 的指令是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆