为什么 clang 为 64 位双精度值的绝对值发出 32 位浮点 ps 指令? [英] Why does clang emit a 32-bit float ps instruction for the absolute value of a 64-bit double?

查看:29
本文介绍了为什么 clang 为 64 位双精度值的绝对值发出 32 位浮点 ps 指令?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为什么 clang 将 fabs(double) 变成 vandps 而不是 vandpd(就像 GCC 那样)?

Why is clang turning fabs(double) into vandps instead of vandpd (like GCC does)?

来自编译器资源管理器的示例:

#include <math.h>

double float_abs(double x) {
    return fabs(x);
}

clang 12.0.1 -std=gnu++11 -Wall -O3 -march=znver3

.LCPI0_0:
        .quad   0x7fffffffffffffff              # double NaN
        .quad   0x7fffffffffffffff              # double NaN
float_abs(double):                          # @float_abs(double)
        vandps  xmm0, xmm0, xmmword ptr [rip + .LCPI0_0]
        ret

gcc 11.2 -std=gnu++11 -Wall -O3 -march=znver3

float_abs(double):
        vandpd  xmm0, xmm0, XMMWORD PTR .LC0[rip]
        ret
.LC0:
        .long   -1
        .long   2147483647
        .long   0
        .long   0

(具有讽刺意味的是,GCC 使用 vandpd 但使用 32 位 .long 块定义常量(有趣的是上半部分为零),而 clang 使用 vandps 但将常量定义为两个 .quad 一半.

(Ironically, GCC uses vandpd but defines the constant with 32-bit .long chunks (interestingly with the upper half zero), while clang uses vandps but defines the constant as two .quad halves.

推荐答案

TL:DR: 可能是因为优化器/代码生成器始终执行此操作更容易,而不仅仅是使用旧版-SSE 指令以节省代码大小.没有性能下降,它们在架构上是等效的(即没有正确性差异.)

TL:DR: Probably because it's easier for the optimizer / code-generator to always do this, instead of only with legacy-SSE instructions to save code-size. There's no performance downside, and they're architecturally equivalent (i.e. no correctness difference.)

可能叮当总是正常化";在体系结构上与其 ps 版本等效的指令,因为对于旧版 SSE 版本,这些指令具有较短的机器代码编码.

Probably clang always "normalizes" architecturally equivalent instructions to their ps version, because those have a shorter machine-code encoding for the legacy-SSE versions.

现有的 x86 CPU 没有在 pspd 指令之间转发的任何旁路延迟延迟1,所以使用 [v]mulpd[v]fmadd...pd 指令之间的 >[v]andps.

No existing x86 CPUs have any bypass delay latency for forwarding between ps and pd instructions1, so it's always safe to use [v]andps between [v]mulpd or [v]fmadd...pd instructions.

正如 orpd 等 SSE2 指令的重点是什么? 指出,movupd 等指令和 andpd 是完全无用的空间浪费,只存在于解码器一致性中:SSE1 操作码前面的 66 前缀总是做它的 pd 版本.为其他未来的扩展节省一些编码空间可能更明智,但英特尔没有这样做.

As What is the point of SSE2 instructions such as orpd? points out, instructions like movupd and andpd are completely useless wastes of space that only exist for decoder consistency: a 66 prefix in front of an SSE1 opcode always does the pd version of it. It might have been smarter to save some of that coding space for other future extensions, but Intel didn't do that.

或者也许动机是 CPU 的未来可能性确实具有单独的 SIMD-double 域和 SIMD-float 域,因为当 SSE2在纸上设计.现在我们可以说这不太可能,因为 FMA 单元需要很多晶体管,并且显然可以构建为在每个 64 位元素一个 53 位尾数与每个 2x 32- 两个 23 位尾数之间共享一些尾数乘法器硬件位元素.

Or perhaps the motivation was the future possibility of a CPU that did have separate SIMD-double vs. SIMD-float domains, since it was early days for Intel's FP SIMD in general when SSE2 was being designed on paper. These days we can say that's unlikely because FMA units take a lot of transistors, and can apparently be built to share some mantissa-multiplier hardware between one 53-bit mantissa per 64-bit element vs. two 23-bit mantissas per 2x 32-bit elements.

拥有单独的转发域可能只有在您还具有用于浮点数与双数数学的单独执行单元而不是共享晶体管时才有用,除非您有不同类型的不同输入和输出端口但实际内部结构相同?IDK 足以说明该级别的 CPU 设计细节.

Having separate forwarding domains would probably only be useful if you also had separate execution units for float vs. double math, not sharing transistors, unless you had different input and output ports for different types but the same actual internals? IDK enough about that level of CPU design detail.

ps 对于 AVX VEX 编码版本没有优势,但也没有劣势,因此对于 LLVM 的优化器/代码生成器来说,总是这样做可能更简单而不是关心尝试尊重源内在函数.(Clang/LLVM 通常不会尝试这样做,例如,它可以自由地将 shuffle 内在函数优化为不同的 shuffles.通常这很好,但有时当它不知道作者的技巧时,它会取消优化精心设计的内在函数内在函数做到了.)

There's no advantage to ps for the AVX VEX-encoded versions, but also no disadvantage, so it's probably simpler for LLVM's optimizer / code generator to just always do that instead of ever caring about trying to respect the source intrinsics. (Clang / LLVM doesn't in general try to do that, e.g. it freely optimizes shuffle intrinsics into different shuffles. Often this is good, but sometimes it de-optimizes carefully crafted intrinsics when it doesn't know a trick that the author of the intrinsics did.)

例如LLVM 可能会考虑FP 域 128 位按位与",并且知道该指令是 andps/vandps.clang 甚至没有理由知道 vandpd 存在,因为没有任何情况下它会有助于使用它.

e.g. LLVM probably thinks in terms of "FP-domain 128-bit bitwise AND", and knows the instruction for that is andps / vandps. There's no reason for clang to even know that vandpd exists, because there's no case where it would help to use it.

脚注 1:推土机隐藏元数据和数学指令之间的转发:
AMD 推土机系列对诸如 mulps 之类的无意义事物有惩罚 ->mulpd,用于实际的 FP 数学 指令,这些指令实际上关心 FP 值的符号/指数/尾数分量(不是布尔值或随机数).

Footnote 1: Bulldozer hidden metadata and forwarding between math instructions:
AMD Bulldozer-family has a penalty for nonsensical things like mulps -> mulpd, for actual FP math instructions that actually care about the sign/exponent/mantissa components of an FP value (not booleans or shuffles).

将两个 IEEE binary32 FP 值的串联视为 binary64 基本上没有意义,因此这不是需要解决的问题.它主要是让我们深入了解 CPU 内部结构的设计方式.

It basically never makes sense to treat the concatenation of two IEEE binary32 FP values as a binary64, so this isn't a problem that needs to be worked around. It's mostly just something that gives us insight into how the CPU internals might be designed.

Agner Fog 的微架构指南的推土机系列部分中,他解释了转发的旁路延迟在 FMA 单元上运行的两条数学指令之间的间隔比另一条指令在途中低 1 个周期.例如addps/orps/addpsaddps/addps/orps 有更糟糕的延迟,假设这三个指令形成一个依赖链.

In the Bulldozer-family section of Agner Fog's microarch guide, he explains that the bypass delay for forwarding between two math instructions that run on the FMA units is 1 cycle lower than if another instruction is in the way. e.g. addps / orps / addps has worse latency than addps / addps / orps, assuming those three instructions form a dependency chain.

但是对于像 addps/addpd/orps 这样疯狂的事情,你会得到额外的延迟.但不适用于 addps/orps/addpd.(orpsorpd 在这里没有任何区别.shufps 也是等效的.)

But for a crazy thing like addps / addpd / orps, you get extra latency. But not for addps / orps / addpd. (orps vs orpd never makes a difference here. shufps would also be equivalent.)

可能的解释是 BD 保留了额外的向量元素,以便在这种特殊转发情况下重用,以便在转发 FMA->FMA 时避免一些格式化/规范化工作.如果格式错误,那么乐观的方法必须恢复并执行架构要求的事情,但同样,只有当您实际将浮点 FMA/add/mul 的结果视为双精度值时才会发生这种情况,反之亦然.

The likely explanation is that BD kept extra stuff with vector elements to be reused in that special forwarding case, to maybe avoid some formatting / normalization work when forwarding FMA->FMA. If it's in the wrong format, that optimistic approach has to recover and do the architecturally required thing, but again, that only happens if you actually treat the result of a float FMA/add/mul as doubles, or vice versa.

addps 可以毫不延迟地转发到像 unpcklpd 这样的 shuffle,所以这不是 3 个独立的旁路网络的证据,也不是使用(或存在)andpd/orpd.

addps could forward to a shuffle like unpcklpd without delay, so it's not evidence of 3 separate bypass networks, or any justification for the use (or existence) of andpd / orpd.

这篇关于为什么 clang 为 64 位双精度值的绝对值发出 32 位浮点 ps 指令?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆