fabs(double)如何在x86上实现?这是一项昂贵的手术吗? [英] How would fabs(double) be implemented on x86? Is it an expensive operation?
问题描述
高级编程语言通常提供确定浮点值的绝对值的功能.例如,在C标准库中,有fabs(double)
函数.
High-level programming languages often provide a function to determine the absolute-value of a floating-point value. For example, in the C standard library, there is the fabs(double)
function.
该库函数实际上是如何针对x86目标实现的?当我调用这样的高级函数时,幕后"实际上会发生什么?
How is this library function actually implemented for x86 targets? What would actually be happening "under the hood" when I call a high-level function like this?
这是一项昂贵的操作(乘法和取平方根的组合)吗?还是仅通过删除内存中的负号就能找到结果?
Is it an expensive operation (a combination of multiplication and taking the square root)? Or is the result found just by removing a negative sign in memory?
推荐答案
通常,计算浮点数的绝对值是非常便宜且快速的操作.
在几乎所有情况下,您都可以简单地将标准库中的fabs
函数视为黑匣子,并在必要时将其撒入算法中,而无需担心它将如何影响执行速度.
In practically all cases, you can simply treat the fabs
function from the standard library as a black box, sprinkling it in your algorithms where necessary, without any need to worry about how it will affect execution speed.
如果您想了解为什么为什么这样便宜的操作,那么您需要了解一些有关如何表示浮点值的知识.尽管C和C ++语言标准实际上并未强制要求使用它,但是大多数实现都遵循 IEEE-754 标准.在该标准中,每个浮点值的表示形式都包含一个称为 sign bit 的位,这标志着该值是正数还是负数.例如,考虑double
,它是64位的双精度浮点值:
If you want to understand why this is such a cheap operation, then you need to know a little bit about how floating-point values are represented. Although the C and C++ language standards do not actually mandate it, most implementations follow the IEEE-754 standard. In that standard, each floating-point value's representation contains a bit known as the sign bit, and this marks whether the value is positive or negative. For example, consider a double
, which is a 64-bit double-precision floating-point value:
>
(图片由Codekaizen提供,通过Wikipedia,由CC-bySA许可.)
(Image courtesy of Codekaizen, via Wikipedia, licensed under CC-bySA.)
您可以在最左侧看到淡蓝色的符号位.对于IEEE-754中所有浮点值的精度都是如此.因此,取绝对值基本上等于在内存中翻转该值表示形式中的一个字节.特别是,您只需要屏蔽符号位(按位与),将其强制为0,即无符号.
You can see the sign bit over there on the far left, in light blue. This is true for all precisions of floating-point values in IEEE-754. Therefore, taking the absolute value basically just amounts to flipping a byte in the value's representation in memory. In particular, you just need to mask off the sign bit (bitwise-AND), forcing it to 0—thus, unsigned.
假设您的目标体系结构支持浮点运算的硬件,这通常是一条单一的单周期指令,基本上是尽可能快的.优化的编译器将内联调用fabs
库函数,并在该位置处发出单个硬件指令.
Assuming that your target architecture has hardware support for floating-point operations, this is generally a single, one-cycle instruction—basically, as fast as can possibly be. An optimizing compiler will inline a call to the fabs
library function, emitting that single hardware instruction in its place.
如果您的目标体系结构没有对浮点的硬件支持(在当今非常罕见),那么将有一个库可以在软件中模拟这些语义,从而提供浮点支持.通常,浮点仿真速度很慢,但是找到绝对值是您可以做的最快的事情之一,因为从字面上看,它只是在进行一点操作.您需要支付对fabs
的函数调用的开销,但最糟糕的是,该函数的实现将仅涉及从内存中读取字节,屏蔽符号位并将结果存储回内存中.
If your target architecture doesn't have hardware support for floating-point (which is pretty rare nowadays), then there will be a library that emulates these semantics in software, thus providing floating-point support. Typically, floating-point emulation is slow, but finding the absolute value is one of the fastest things you can do, since it is literally just manipulating a bit. You'll pay the overhead of a function call to fabs
, but at worst, the implementation of that function will just involve reading the bytes from memory, masking off the sign bit, and storing the result back to memory.
专门在硬件上实现IEEE-754的x86 ,C编译器将对fabs
的调用转换为机器代码的主要方式有两种.
Looking specifically at x86, which does implement IEEE-754 in hardware, there are two main ways that your C compiler will transform a call to fabs
into machine code.
在32位版本中,旧版x87 FPU 用于浮点运算操作,它将发出 fabs
指令. (是的,名称与C函数相同.)这会将x87寄存器堆栈顶部的浮点值中的符号位(如果存在)剥离.在AMD处理器和Intel Pentium 4上,fabs
是具有2个周期延迟的1个周期指令.在AMD Ryzen和所有其他Intel处理器上,这是一条具有1个周期延迟的1周期指令.
In 32-bit builds, where the legacy x87 FPU is being used for floating-point operations, it will emit an fabs
instruction. (Yep, same name as the C function.) This strips the sign bit, if present, from the floating-point value at the top of the x87 register stack. On AMD processors and Intel Pentium 4, fabs
is a 1-cycle instruction with a 2-cycle latency. On AMD Ryzen and all other Intel processors, this is a 1-cycle instruction with a 1-cycle latency.
在可以假定SSE支持的32位版本中,在所有 64位版本(始终支持SSE)上,编译器将发出 Agner Fog的指令表中查找感兴趣的处理器
In 32-bit builds that can assume SSE support, and on all 64-bit builds (where SSE is always supported), the compiler will emit an ANDPS
instruction* that does exactly what I described above: it bitwise-ANDs the floating-point value with a constant mask, masking out the sign bit. Notice that SSE2 doesn't have a dedicated instruction for taking the absolute value like x87 does, but that it doesn't even need one, because the multi-purpose bitwise-op instructions serve the job just fine. The execution time (cycles, latency, etc. characteristics) vary a bit more widely from one processor microarchitecture to another, but it generally has a throughput of 1–3 cycles, with a similar latency. If you like, you can look it up in Agner Fog's instruction tables for the processors of interest.
如果您真的有兴趣对此进行深入研究,则可能会看到此答案(Peter Cordes的技巧提示) ,探讨了使用SSE指令实现绝对值函数的各种不同方法,比较了它们的性能,并讨论了如何使编译器生成适当的代码.如您所见,由于您只是在操作位,因此有多种可能的解决方案!但是,实际上,当前的编译器完全按照我对C库函数fabs
的描述进行操作,这很有意义,因为这是最佳的通用解决方案.
If you're really interested in digging into this, you might see this answer (hat tip to Peter Cordes), which explores a variety of different ways to implement an absolute-value function using SSE instructions, comparing their performance and discussing how you could get a compiler to generate the appropriate code. As you can see, since you're just manipulating bits, there are a variety of possible solutions! In practice, though, the current crop of compilers do exactly as I've described for the C library function fabs
, which makes sense, because this is the best general-purpose solution.
__
* 从技术上讲,它也可能是ANDPD
,其中D
表示"double"(而S
表示"single"),但是ANDPD
需要SSE2支持. SSE支持单精度浮点运算,并且一直可以追溯到Pentium III. SSE2是双精度浮点运算所必需的,并随Pentium 4一起引入.x86-64 CPU始终支持SSE2.是否使用ANDPS
或ANDPD
是由编译器的优化器决定的;有时您会看到ANDPS
用于双精度浮点值,因为它只需要以正确的方式编写掩码即可.
此外,在支持AVX指令的CPU上,通常会在ANDPS
/ANDPD
指令上看到VEX前缀,因此它变成VANDPS
/VANDPD
.有关其工作方式及其目的的详细信息,请参见在线其他地方;可以说,混合使用VEX和非VEX指令会导致性能下降,因此编译器会尽量避免使用它.同样,这两个版本都具有相同的效果和几乎相同的执行速度.
__
* Technically, this might also be ANDPD
, where the D
means "double" (and the S
meant "single"), but ANDPD
requires SSE2 support. SSE supports single-precision floating-point operations, and was available all the way back to the Pentium III. SSE2 is required for double-precision floating-point operations, and was introduced with the Pentium 4. SSE2 is always supported on x86-64 CPUs. Whether ANDPS
or ANDPD
is used is a decision made by the compiler's optimizer; sometimes you will see ANDPS
being used on a double-precision floating-point value, since it just requires writing the mask the right way.
Also, on CPUs that support AVX instructions, you'll generally see a VEX-prefix on the ANDPS
/ANDPD
instruction, so that it becomes VANDPS
/VANDPD
. Details on how this works and what its purpose is can be found elsewhere online; suffice it to say that mixing VEX and non-VEX instructions can result in a performance penalty, so compilers try to avoid it. Again, though, both of these versions have the same effect and virtually identical execution speeds.
哦,由于SSE是 SIMD 指令集,因此可以一次计算 multiple 浮点值的绝对值.正如您可能想象的那样,这特别有效.具有自动矢量化功能的编译器将在可能的情况下生成此类代码.示例(如此处所示,可以动态生成掩码,也可以将其作为常量加载):
Oh, and because SSE is a SIMD instruction set, it is possible to compute the absolute value of multiple floating-point values at once. This, as you might imagine, is especially efficient. Compilers with auto-vectorization capabilities will generate code like this where possible. Example (mask can either be generated on-the-fly, as shown here, or loaded as a constant):
cmpeqd xmm1, xmm1 ; generate the mask (all 1s) in a temporary register
psrld xmm1, 1 ; put 1s in but the left-most bit of each packed dword
andps xmm0, xmm1 ; mask off sign bit in each packed floating-point value
这篇关于fabs(double)如何在x86上实现?这是一项昂贵的手术吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!