如何实现浮点值的 totalOrder 谓词? [英] How to implement the totalOrder predicate for floating point values?

查看:25
本文介绍了如何实现浮点值的 totalOrder 谓词?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

IEEE 754 规范在 §5.10 中定义了一个总顺序,我想在汇编中实现.

维基百科描述,听起来很像这样可以实现了无分支,或几乎无分支,但我还没有想出一个像样的方法;我在主要编程语言中找不到任何现有的符合规范的实现

<块引用>

当比较两个浮点数时,它作为≤操作,除了totalOrder(−0, +0) ∧ ¬ totalOrder(+0, -0),同一个浮点数的不同表示是按其指数乘以符号位排序.然后通过排序 -qNaN < 将排序扩展到 NaN.-sNaN <数字<+sNaN <+qNaN,同一类中的两个 NaN 之间的排序基于整数有效负载乘以这些数据的符号位.

首先检查 NaN 然后跳转到浮点比较或处理 NaN 情况是否有意义,还是将浮点值移动到整数寄存器并在那里执行所有操作更有意义?

(至少从阅读描述来看,感觉规范作者努力允许使用整数指令直接实现.)

在 x86-64 处理器上实现浮点全序的最佳"方法是什么?

解决方案

如果您将 FP 位模式作为符号/大小整数进行比较,这一切都有效,包括 -0 +0 和 NaN 位模式1.这就是为什么像 binary64 (double)<这样的 IEEE 格式的原因之一/a> 使用有偏指数并按该顺序放置字段.(另一个是nextaftera> 在位模式上通过 ++-- .)

就 2 的补码整数比较而言,这可以有效地实现:

<小时>

使用 SSE4.2 pcmpgtq,您可以对正常 XMM 寄存器中的双 FP 值进行操作,或 SSE2(x86-64 保证)pcmpgtd 用于 32 位浮点数.(请注意,pcmpgtqpcmpgtd 相比相对较慢:端口更少,延迟更高.https://agner.org/optimize/.例如,在 Skylake 上,1 p5 uop 具有 3c 延迟,而 pcmpgtd 和 pcmpeqq 对于 p0/p1 是 1 uop,具有 1 个周期延迟.)

我们无法仅使用一个 pcmpgtq + 符号修正来处理按位相等的情况.
x1 bitwise_eq x0 无论输入是正还是负,都会给出 0 的 pcmpgtq 结果.根据 sign(x0&x1) 翻转它会产生不一致的行为,无论您希望 0 还是 1 表示 >>=<<= 根据总顺序.但不幸的是,FP 比较的 -0.0 == +0.0 行为意味着我们必须对 FP-equal 进行特殊处理,而不仅仅是 FP-unordered.

您不需要汇编,只需在 C 中键入 uint64_t 即可让编译器可能使用 movq rax, xmm0,或使用内在函数矢量规则.

但是如果你使用 asm,你可以考虑在 ZF=1 上做一个 FP 比较和分支,这将被设置为无序或相等,然后才做整数.如果您希望 NaN 和完全相等(包括 +-0.0 == -+0.0)很少见,这可能会很好地工作.请注意,对于 ucomisd 文档.所有 x86 FP 都以相同的方式比较设置标志,直接或通过 fcom/fnstsw ax/lahf.

例如,独立版本可能如下所示.(内联时简化,例如,如果调用者分支,则直接使用 jb 而不是 setb 分支):

totalOrder: ;EAX 中的 0/1 整数 = (xmm0 <= xmm1 totalOrder)异或 eax, eaxucomisd xmm0, xmm1 ;ZF=0 意味着 PF=0(有序)所以只需检查 ZFjz .compare_as_integer ;无序或 FP 相等;否则 CF 准确地反映了 <或 >(总)xmm0 与 xmm1 的顺序设置 ;或者用 jb 分支回复;;SSE4.2,使用 AVX 3 操作数版本.根据需要为非 AVX 使用 movaps### 未经测试;用于无序或 FP-equal,包括 -0.0 == +0.0;但也包括 -1.0 == -1.0 例如.compare_as_integer: ;一般适用于任何符号/大小整数vpcmpgtq xmm2, xmm1, xmm0 ;比较顺序相反:x1>x0 == x0x0回复

使用 AVX512VL,vpternlogq 可以替换所有 3 个 AND/XOR/OR 操作;它可以实现任意 3 个输入的布尔函数.<代码>(y_gt_x) ^ (x&y) |y_eq_x.

没有 SSE4.2,或者只是作为标量无分支策略,我想出了这个.(例如,如果值实际上在内存中,那么您可以只加载 mov 而不是来自 XMM regs 的 movq).

<代码>;;独立工作,或作为 ucomisd/jz 之后的后备compare_as_integer:movq rcx, xmm0movq rsi, xmm1异或 eax, eaxcmp rcx, rsi;按位相等的特殊情况将简化其余部分设置 ;2 的补码 x <是设置和 rcx, rsi ;也许与 TEST/CMOVS 有关系?shr rcx, 63xor al, cl ;如果两个输入都为负,则翻转 SETL 结果或 al, dl ;按位相等时始终为真回复

EAX 的异或归零 即使在 P6 系列上,在使用 setlcode> 和 8 位 xoror.(为什么 GCC 不使用部分寄存器?).在大多数其他 CPU 上,这里唯一的缺点是对 RDX 旧值的错误依赖,我在 sete dl 之前没有破坏它.如果我先将 EDX 置零,我们可以将 xoror 转换为 EAX.

分支策略可以这样工作:

<代码>;;除非数据是可预测的,否则可能会更慢,例如大多是非负的compare_as_integer_branchy:movq rcx, xmm0movq rsi, xmm1异或 eax, eax ;mov eax,1 with je to a ret 不会避免 setl al 的部分寄存器停顿cmp rcx, rsije .flip_result ;返回 1设置 ;2 的补码 x <是测试 rcx, rsijs .flip_result ;if (x&y 都是负数)回复.flip_result: ;不是按位均衡器,并且两个输入都是负的异或,1回复

如果需要,可以混合和匹配部分内容;AND/SHR/XOR 可以用于非等路径而不是 test+js.

<小时>

如果在对结果进行分支的情况下内联它,则可以将 common(?)-case(有限且不相等)分支放在特殊情况处理之前.但是,特殊情况包括有序 < 因此在 ZF=1(包括 PF=1 无序情况)上希望可预测的分支可能仍然是一个好主意.

 ucomisd xmm1, xmm0ja x1_gt_x0 ;CF==0 &&ZF==0;也许是无序的,也许是 -0 vs +0,也许只是 x1 <x0

<小时>

脚注 1:NaN 编码作为总顺序的一部分

FP 值(及其符号/幅度编码)围绕零对称.符号位始终是符号位,即使对于 NaN 也是如此,因此可以以相同的方式处理.

  • 最小的幅度当然是 +-0.0:所有指数和尾数位为零.
  • 次正规数具有零指数字段(最小值),这意味着尾数的前导零.显式部分非零.幅度与尾数成线性关系.(零实际上只是次正规的一个特例.)
  • 归一化的数字从 exponent = 1 到 exponent <;max,表示尾数前导 1.一个指数内的最高值(所有尾数位设置)刚好低于 ++exponent;尾数=0 值:即从尾数到指数的进位加 1 增加到下一个可表示的浮点数/双精度数
  • +- Infinity 的指数 = 全 1,尾数 = 0
  • +- NaN 的指数 = 全 1,尾数 = 非零
    • 在 x86 上,sNaN 清除了尾数的最高位.Rest 是有效载荷,在任何地方都至少有 1 个设置位(否则它是一个 Inf).
    • 在 x86 上,qNaN 具有尾数集的最高位.休息是有效载荷

https://cwiki.apache.org/confluence/display/stdcxx/FloatingPoint(链接自 NaN 的位模式真的硬件相关?) 在其他几个 ISA 上显示了一些 sNaN 和 qNaN 编码.有些与 x86 不同,但 POWER 和 Alpha 为 qNaN 设置了尾数的 MSB,因此它们的整数幅度大于任何 sNaN.

PA-RISC 选择了相反的方式,因此在该(过时的)ISA 上实现全序谓词需要为 FP-compare 无序情况做额外的工作;如果在进行整数比较之前,如果它们中的任何一个是任何类型的 NaN,那么在两个值中翻转那个位可能会起作用.

(我提到这一点是因为相同的算法可以用在可能不会专门用于 x86 的更高级别的语言中.但是您可能只想保留它并始终以相同的方式处理相同的二进制位模式,即使这意味着在某些平台上 qNaN <小时>

PS:我知道有效数"在技术上更正确,但尾数"的音节较少,我更喜欢它,并且在这种情况下很好理解.

The IEEE 754 specification defines a total order in §5.10, which I want to implement in assembly.

From the Wikipedia description, it sounds a lot like this can be implemented branch-free, or almost branch-free, but I haven't been able to come up with a decent approach; and I couldn't find any existing spec-compliant implementation in major programming languages

When comparing two floating-point numbers, it acts as the ≤ operation, except that totalOrder(−0, +0) ∧ ¬ totalOrder(+0, −0), and different representations of the same floating-point number are ordered by their exponent multiplied by the sign bit. The ordering is then extended to the NaNs by ordering −qNaN < −sNaN < numbers < +sNaN < +qNaN, with ordering between two NaNs in the same class being based on the integer payload, multiplied by the sign bit, of those data.

Does it make sense to first check for NaNs and then either jump to a floating point comparison or handle the NaN case, or does it make more sense to move the floating point value to integer registers and do all the operations there?

(At least from reading the description, it feels like the spec authors made an effort to allow a straightforward implementation with integer instructions.)

What's the "best" way to implement a total order for floating points on x86-64 processors?

解决方案

This all Just Works if you compare the FP bit-patterns as sign/magnitude integers, including -0 < +0 and the NaN bit-patterns1. This is one reason why IEEE formats like binary64 (double) use a biased exponent and put the fields in that order. (Another being convenient implementation of nextafter by ++ or -- on the bit pattern.)

That can be implemented efficiently in terms of 2's complement integer compare:

  • if the signs are both cleared: non-negative numbers Just Work
  • if only one has its sign bit set: any negative is less than any non-negative, including -0.0 < +0.0 as 0x80000000 < 0x00000000 so 2's complement x <= y Just Works.
  • if both have their sign bit set ((x&y)>>63): 2's complement x<y is sign/magnitude FP x>y. In x86 asm, you may avoid the shift and just look at SF, or use the high bit of a SIMD element.

    Handling this without messing up the == case is tricky: you can't just XOR x&y sign into a < result; that would flip it when they compared equal. It would give you <= when both inputs are negative but < for other cases. I'm not sure if that's usable for sorting.


With SSE4.2 pcmpgtq you can operate on double FP values in their normal XMM registers, or SSE2 (guaranteed for x86-64) pcmpgtd for 32-bit float. (Note that pcmpgtq is relatively slow compared to pcmpgtd: fewer ports and higher latency. https://agner.org/optimize/. e.g. on Skylake, 1 p5 uop with 3c latency, vs. pcmpgtd and pcmpeqq being 1 uop for p0/p1 with 1 cycle latency.)

We can't handle the bitwise-equal case using just one pcmpgtq + sign fixups.
x1 bitwise_eq x0 gives a pcmpgtq result of 0 whether or not the inputs are positive or negative. Flipping it based on sign(x0&x1) would give inconsistent behaviour whether you want 0 or 1 to mean >, >=, < or <= according to the total order. But unfortunately the -0.0 == +0.0 behaviour of FP comparisons means we have to special-case on FP-equal, not just FP-unordered.

You don't need assembly, just type-pun to uint64_t in C for example to get the compiler to probably use movq rax, xmm0, or use intrinsics for vector regs.

But if you are using asm, you could consider doing an FP compare and branching on ZF=1 which will be set for unordered or equal, and only then doing integer. If you expect NaN and exact equality (including +-0.0 == -+0.0) to be rare, this could work well. Notice that ZF,CF,PF = 1,1,1 for unordered in the ucomisd docs. All x86 FP compares set flags the same way, either directly or via fcom/fnstsw ax/lahf.

For example a stand-alone version could look like this. (Simplify when inlining, e.g. branch directly with jb instead of setb if the caller branches):

totalOrder:   ; 0/1 integer in EAX = (xmm0 <= xmm1 totalOrder)
    xor      eax, eax
    ucomisd  xmm0, xmm1           ; ZF=0 implies PF=0 (ordered) so just check ZF
    jz    .compare_as_integer     ; unordered or FP-equal
     ; else CF accurately reflects the < or > (total) order of xmm0 vs. xmm1
    setb     al                 ; or branch with jb
    ret

;; SSE4.2, using AVX 3-operand versions.  Use movaps as needed for non-AVX
### Untested
        ; Used for unordered or FP-equal, including -0.0 == +0.0
        ; but also including -1.0 == -1.0 for example
 .compare_as_integer:          ; should work in general for any sign/magnitude integer
    vpcmpgtq xmm2, xmm1, xmm0     ; reversed order of comparison: x1>x0 == x0<x1
    vpand    xmm3, xmm1, xmm0     ; we only care about the MSB of the 64-bit integer
    vpxor    xmm2, xmm3           ; flip if x0 & x1 are negative

    vpcmpeqq xmm1, xmm0
    vpor     xmm2, xmm1
       ; top bits of XMM2 hold the boolean result for each SIMD element
       ; suitable for use with blendvpd

    vmovmskpd eax, xmm2           ; low bit of EAX = valid, high bit might be garbage
    and      eax, 1          ; optional depending on use-case
     ; EAX=1 if x0 bitwise_eq x1 or sign/magnitude x1 > x0
    ret

With AVX512VL, vpternlogq can replace all 3 of the AND/XOR/OR operations; it can implement any arbitrary boolean function of 3 inputs. (y_gt_x) ^ (x&y) | y_eq_x.

Without SSE4.2, or just as a scalar branchless strategy, I came up with this. (e.g. if values were actually in memory so you could just do mov loads instead of movq from XMM regs).

;; works on its own, or as the fallback after ucomisd/jz
compare_as_integer:
    movq     rcx, xmm0
    movq     rsi, xmm1

    xor      eax, eax
    cmp      rcx, rsi
   ; je  bitwise equal special case would simplify the rest
    setl     al                 ; 2's complement x < y
    sete     dl
    and      rcx, rsi           ; maybe something with TEST / CMOVS?
    shr      rcx, 63
    xor      al, cl           ; flip the SETL result if both inputs were negative
    or       al, dl           ; always true on bitwise equal
    ret

The xor-zeroing of EAX should make it safe to read EAX without a partial-reg stall even on P6-family, after writing AL with setl and 8-bit xor and or. (Why doesn't GCC use partial registers?). On most other CPUs, the only downside here is a false dependency on the old value of RDX which I didn't break before sete dl. If I had xor-zeroed EDX first, we could xor and or into EAX.

A branchy strategy could work like this:

;; probably slower unless data is predictable, e.g. mostly non-negative
compare_as_integer_branchy:
    movq     rcx, xmm0
    movq     rsi, xmm1

    xor      eax, eax       ; mov eax,1 with je to a ret wouldn't avoid partial-register stalls for setl al
    cmp      rcx, rsi
    je      .flip_result        ; return 1
    setl     al                 ; 2's complement x < y

    test     rcx, rsi
    js     .flip_result         ; if (x&y both negative)
    ret

.flip_result:         ; not bitwise EQ, and both inputs negative
    xor      al, 1
    ret

Mix and match parts of this if you want; the AND/SHR/XOR could be used along the non-equal path instead of test+js.


If inlining this in a case where you branch on the result, you could put the common(?)-case (finite and not equal) branch ahead of the special case handling. But then the special case includes ordered < so a hopefully-predictable branch on ZF=1 (which includes the PF=1 unordered case) might be a good idea still.

    ucomisd  xmm1, xmm0
    ja       x1_gt_x0                ; CF==0 && ZF==0
    ; maybe unordered, maybe -0 vs +0, maybe just x1 < x0


Footnote 1: NaN encodings as part of the total order

FP values (and their sign/magnitude encodings) are symmetric around zero. The sign bit is always a sign bit, even for NaNs, and can thus be handled the same way.

  • The smallest magnitude are +-0.0 of course: all exponent and mantissa bits zero.
  • The subnormals have a zero exponent field (minimum value) implying a leading zero for the mantissa. The explicit part is non-zero. Magnitude is linear with mantissa. (Zero is actually just a special case of a subnormal.)
  • The normalized numbers span exponent = 1 to exponent < max, implying a leading 1 in the mantissa. The highest value within one exponent (all mantissa bits set) is just below the ++exponent; mantissa=0 value: i.e. increment by 1 with carry from mantissa to exponent increases to next representable float/double
  • +- Infinity has exponent = all-ones, mantissa = 0
  • +- NaN has exponent = all-ones, mantissa = non-zero
    • on x86, sNaN has the highest bit of the mantissa cleared. Rest is payload with at least 1 set bit anywhere (else it's an Inf).
    • on x86, qNaN has the highest bit of the mantissa set. Rest is payload

https://cwiki.apache.org/confluence/display/stdcxx/FloatingPoint (linked from Are the bit patterns of NaNs really hardware-dependent?) shows some sNaN and qNaN encodings on a couple other ISAs. Some differ from x86, but POWER and Alpha has the MSB of the mantissa set for qNaN so they have larger integer magnitude than any sNaN.

PA-RISC chose the other way around so implementing a total order predicate on that (obsolete) ISA would need to do extra work for the FP-compare unordered case; maybe flipping that bit in both values might work if either of them are any kind of NaN, before proceeding with integer compare.

(I mention this because the same algorithm could be used in higher level languages that might not be exclusively used on x86. But you might want to just leave it and always handle the same binary bit-patterns the same way, even if that means qNaN < sNaN on some platforms. You only even get sNaN in the first place by manually writing the bit-pattern.)


PS: I know "significand" is more technically correct, but "mantissa" has fewer syllables and I like it better, and is well understood in this context.

这篇关于如何实现浮点值的 totalOrder 谓词?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆