是否有任何浮点密集型代码在任何基于 x86 的架构中产生位精确的结果? [英] Does any floating point-intensive code produce bit-exact results in any x86-based architecture?

查看:19
本文介绍了是否有任何浮点密集型代码在任何基于 x86 的架构中产生位精确的结果?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道使用浮点运算的 C 或 C++ 中的任何代码是否会在任何基于 x86 的体系结构中产生精确的结果,而不管代码的复杂性.

I would like to know if any code in C or C++ using floating point arithmetic would produce bit exact results in any x86 based architecture, regardless of the complexity of the code.

据我所知,自 Intel 8087 以来的任何 x86 架构都使用准备处理 IEEE-754 浮点数的 FPU 单元,而且我看不出任何原因导致结果在不同架构中会有所不同.但是,如果它们不同(即由于不同的编译器或不同的优化级别),是否可以通过配置编译器来产生精确的结果?

To my knowledge, any x86 architecture since the Intel 8087 uses a FPU unit prepared to handle IEEE-754 floating point numbers, and I cannot see any reason why the result would be different in different architectures. However, if they were different (namely due to different compiler or different optimization level), would there be some way to produce bit-exact results by just configuring the compiler?

推荐答案

目录:

  • C/C++
  • 汇编
  • 创建实现这一目标的真实软件.

不,完全符合 ISO C11 和 IEEE 标准的 C 实现并不能保证与其他 C 实现(甚至是同一硬件上的其他实现)位相同的结果.

(首先,我假设我们在讨论普通的 C 实现,其中 doubleIEEE-754 binary64 格式 等,即使 x86 上的 C 实现对 double 使用其他格式是合法的并通过软件仿真实现 FP 数学,并在 float.h 中定义限制.当并非所有 x86 CPU 都包含在 FPU 中时,这可能是合理的,但在 2016 年是 死亡站 9000 领土.)

(And first of all, I'm going to assume we're talking about normal C implementations where double is the IEEE-754 binary64 format, etc., even though it would be legal for a C implementation on x86 to use some other format for double and implement FP math with software emulation, and define the limits in float.h. That might have been plausible when not all x86 CPUs included with an FPU, but in 2016 that's Deathstation 9000 territory.)

相关:Bruce Dawson 的浮点决定论 博客帖子是这个问题的答案.他的开篇很有趣(后面还有很多有趣的东西):

related: Bruce Dawson's Floating-Point Determinism blog post is an answer to this question. His opening paragraph is amusing (and is followed by a lot of interesting stuff):

IEEE 浮点数学是确定性的吗?您会始终从相同的输入中获得相同的结果吗?答案是明确的是".不幸的是,答案也是明确的不".恐怕你需要澄清你的问题.

Is IEEE floating-point math deterministic? Will you always get the same results from the same inputs? The answer is an unequivocal "yes". Unfortunately the answer is also an unequivocal "no". I’m afraid you will need to clarify your question.

如果你正在思考这个问题,那么你肯定想看看Bruce 关于浮点数学的系列文章的索引,由 x86 上的 C 编译器以及一般的 asm 和 IEEE FP 实现.

If you're pondering this question, then you will definitely want to have a look at the index to Bruce's series of articles about floating point math, as implemented by C compilers on x86, and also asm, and IEEE FP in general.

第一个问题:只有"基本操作": + - */和 sqrt 需要返回"正确舍入"的结果,即 <= 0.5ulp 的错误,正确舍入到尾数的最后一位,所以结果是最接近精确结果的可表示值.

First problem: Only "basic operations": + - * / and sqrt are required to return "correctly rounded" results, i.e. <= 0.5ulp of error, correctly-rounded out to the last bit of the mantissa, so the results is the closest representable value to the exact result.

其他数学库函数,如 pow()log()sin() 允许实现者在速度和准确性.例如,glibc 通常偏向于准确性,并且在某些函数 IIRC 上比 Apple 的 OS X 数学库慢.另请参阅 glibc 的每个 libm 的错误界限文档跨不同架构的功能.

Other math library functions like pow(), log(), and sin() allow implementers to make a tradeoff between speed and accuracy. For example, glibc generally favours accuracy, and is slower than Apple's OS X math libraries for some functions, IIRC. See also glibc's documentation of the error bounds for every libm function across different architectures.

但是等等,情况会变得更糟.即使只使用正确舍入的基本操作的代码也不能保证相同的结果.

But wait, it gets worse. Even code that only uses the correctly-rounded basic operations doesn't guarantee the same results.

C 规则还允许在保持更高精度的临时变量方面具有一定的灵活性.该实现定义了 FLT_EVAL_METHOD 以便代码可以检测它是如何工作的,但是如果您不喜欢实现的功能,您就没有选择.您确实可以选择(使用 #pragma STDC FP_CONTRACT off)来禁止编译器例如将 a*b + c 转换为 FMA,在添加之前不舍入 a*b 临时变量.

C rules also allow some flexibility in keeping higher precision temporaries. The implementation defines FLT_EVAL_METHOD so code can detect how it works, but you don't get a choice if you don't like what the implementation does. You do get a choice (with #pragma STDC FP_CONTRACT off) to forbid the compiler from e.g. turning a*b + c into an FMA with no rounding of the a*b temporary before the add.

在 x86 上,针对 32 位非 SSE 代码(即使用过时的 x87 指令)的编译器通常在操作之间将 FP 临时文件保存在 x87 寄存器中.这会产生 80 位精度的 FLT_EVAL_METHOD = 2 行为.(该标准规定在每次赋值时仍会进行舍入,但真正的编译器(如 gcc)实际上不会为舍入进行额外的存储/重新加载,除非您使用 -ffloat-store.请参阅 https://gcc.gnu.org/wiki/FloatingPointMath.标准的那部分似乎是在假设非优化编译器,或有效提供舍入到类型宽度的硬件,如非 x86,或像 x87,精度设置为舍入到 64 位 double 而不是 80 位 long double.在每个语句之后存储正是gcc-O0 和大多数其他编译器都可以,并且标准允许在对一个表达式的计算中获得额外的精度.)

On x86, compilers targeting 32-bit non-SSE code (i.e. using obsolete x87 instructions) typically keep FP temporaries in x87 registers between operations. This produces the FLT_EVAL_METHOD = 2 behaviour of 80-bit precision. (The standard specifies that rounding still happens on every assignment, but real compilers like gcc don't actually do extra store/reloads for rounding unless you use -ffloat-store. See https://gcc.gnu.org/wiki/FloatingPointMath. That part of the standard seems to have been written assuming non-optimizing compilers, or hardware that efficiently provides rounding to the type width like non-x86, or like x87 with precision set to round to 64-bit double instead of 80-bit long double. Storing after every statement is exactly what gcc -O0 and most other compilers do, and the standard allows extra precision within evaluation of one expression.)

因此,当以 x87 为目标时,编译器可以使用两条 x87 FADD 指令计算三个 float 的总和,而无需将前两个的总和四舍五入为32 位 float.在那种情况下,临时文件具有 80 位精度......或者是吗?并非总是如此,因为 C 实现的启动代码(或 Direct3D 库!!!)可能已更改 x87 控制字中的精度设置,因此 x87 寄存器中的值四舍五入为 53 或 24 位尾数.(这使得 FDIV 和 FSQRT 运行得更快一些.)所有这些都来自 Bruce Dawson 关于中间 FP 精度的文章).

So when targeting x87, the compiler is allowed to evaluate the sum of three floats with two x87 FADD instructions, without rounding off the sum of the first two to a 32-bit float. In that case, the temporary has 80-bit precision... Or does it? Not always, because the C implementation's startup code (or a Direct3D library!!!) may have changed the precision setting in the x87 control word, so values in x87 registers are rounded to 53 or 24 bit mantissa. (This makes FDIV and FSQRT run a bit faster.) All of this from Bruce Dawson's article about intermediate FP precision).

在舍入模式和精度设置相同的情况下,我认为每个 x86 CPU 都应该为相同的输入提供位相同的结果,即使对于像 FSIN 这样的复杂 x87 指令也是如此.

With rounding mode and precision set the same, I think every x86 CPU should give bit-identical results for the same inputs, even for complex x87 instructions like FSIN.

英特尔的手册并没有准确定义每种情况下的结果,但我认为英特尔的目标是精确的向后兼容性.例如,我怀疑他们是否会为 FSIN 添加扩展精度范围缩减.它使用您通过 fldpi 获得的 80 位 pi 常数(正确舍入的 64 位尾数,实际上是 66 位,因为确切值的下 2 位为零).英特尔关于最坏情况错误的文档减少了 1.3 分之一 直到他们在 Bruce Dawson 注意到最坏情况实际上有多糟糕之后更新它.但这只能通过降低扩展精度范围来解决,因此硬件上不会便宜.

Intel's manuals don't define exactly what those results are for every case, but I think Intel aims for bit-exact backwards compatibility. I doubt they'll ever add extended-precision range-reduction for FSIN, for example. It uses the 80-bit pi constant you get with fldpi (correctly-rounded 64-bit mantissa, actually 66-bit because the next 2 bits of the exact value are zero). Intel's documentation of the worst-case-error was off by a factor of 1.3 quintillion until they updated it after Bruce Dawson noticed how bad the worst-case actually was. But this can only be fixed with extended-precision range reduction, so it wouldn't be cheap in hardware.

我不知道 AMD 是否实施了他们的 FSIN 和其他微编码指令以始终为英特尔提供位相同的结果,但我不会感到惊讶.我想有些软件确实依赖它.

I don't know if AMD implements their FSIN and other micro-coded instructions to always give bit-identical results to Intel, but I wouldn't be surprised. Some software does rely on it, I think.

由于SSE 只提供了 add/sub/mul/div/sqrt 的说明,所以没什么好说的.它们准确地实现了 IEEE 操作,因此任何 x86 实现都不可能给您任何不同的东西(除非舍入模式设置不同,或者非正规数为零和/或刷新为零是不同的,并且您有任何非规范化).

Since SSE only provides instructions for add/sub/mul/div/sqrt, there's nothing too interesting to say. They implement the IEEE operation exactly, so there's no chance that any x86 implementation will ever give you anything different (unless the rounding mode is set differently, or denormals-are-zero and/or flush-to-zero are different and you have any denormals).

SSE rsqrt(快速近似倒数平方根)是没有完全指定,我认为即使在牛顿迭代之后你也可能得到不同的结果,但是除了 SSE/SSE2 在 asm 中总是有点精确,假设 MXCSR不奇怪.所以唯一的问题是让编译器生成相同的代码,或者只是使用相同的二进制文件.

SSE rsqrt (fast approximate reciprocal square root) is not exactly specified, and I think it's possible you might get a different result even after a Newton iteration, but other than that SSE/SSE2 is always bit exact in asm, assuming the MXCSR isn't set weird. So the only question is getting the compiler go generate the same code, or just using the same binaries.

因此,如果您静态链接使用 SSE/SSE2 的 libm 并分发这些二进制文件,它们将在任何地方运行.除非该库使用运行时 CPU 检测来选择替代实现...

So, if you statically link a libm that uses SSE/SSE2 and distribute those binaries, they will run the same everywhere. Unless that library uses run-time CPU detection to choose alternate implementations...

正如@Yan Zhou 指出的那样,您几乎需要控制实现的每一位,直到 asm 才能获得位精确的结果.

As @Yan Zhou points out, you pretty much need to control every bit of the implementation down to the asm to get bit-exact results.

但是,对于多人游戏,某些游戏确实依赖于此,但通常会检测/纠正不同步的客户端.每个客户端都计算接下来会发生什么,而不是每一帧都通过网络发送整个游戏状态.如果游戏引擎被小心地实现为确定性的,它们就会保持同步.

However, some games really do depend on this for multi-player, but often with detection/correction for clients that get out of sync. Instead of sending the entire game state over the network every frame, every client computes what happens next. If the game engine is carefully implemented to be deterministic, they stay in sync.

在 Spring RTS 中,客户端校验他们的游戏状态以检测不同步.我有一段时间没玩了,但我确实记得至少 5 年前读过一些关于他们试图通过确保所有 x86 构建使用 SSE 数学来实现同步的文章,即使是 32 位构建.

In the Spring RTS, clients checksum their gamestate to detect desync. I haven't played it for a while, but I do remember reading something at least 5 years ago about them trying to achieve sync by making sure all their x86 builds used SSE math, even the 32-bit builds.

某些游戏不允许在 PC 和非 x86 控制台系统之间进行多人游戏的一个可能原因是引擎在所有 PC 上给出相同的结果,但在具有不同编译器的不同架构控制台上给出不同的结果.

One possible reason for some games not allowing multi-player between PC and non-x86 console systems is that the engine gives the same results on all PCs, but different results on the different-architecture console with a different compiler.

进一步阅读:GAFFER ON GAMES:浮点决定论.真实游戏引擎用于获得确定性结果的一些技术.例如将 sin/cos/tan 包装在未优化的函数调用中,以强制编译器将它们保留为单精度.

Further reading: GAFFER ON GAMES: Floating Point Determinism. Some techniques that real game engines use to get deterministic results. e.g. wrap sin/cos/tan in non-optimized function calls to force the compiler to leave them at single-precision.

这篇关于是否有任何浮点密集型代码在任何基于 x86 的架构中产生位精确的结果?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆