fabs(double)如何在x86上实现?这是一项昂贵的手术吗? [英] How would fabs(double) be implemented on x86? Is it an expensive operation?

查看:114
本文介绍了fabs(double)如何在x86上实现?这是一项昂贵的手术吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

高级编程语言通常提供确定浮点值的绝对值的功能.例如,在C标准库中,有fabs(double)函数.

High-level programming languages often provide a function to determine the absolute-value of a floating-point value. For example, in the C standard library, there is the fabs(double) function.

该库函数实际上是如何针对x86目标实现的?当我调用这样的高级函数时,幕后"实际上会发生什么?

How is this library function actually implemented for x86 targets? What would actually be happening "under the hood" when I call a high-level function like this?

这是一项昂贵的操作(乘法和取平方根的组合)吗?还是仅通过删除内存中的负号就能找到结果?

Is it an expensive operation (a combination of multiplication and taking the square root)? Or is the result found just by removing a negative sign in memory?

推荐答案

通常,计算浮点数的绝对值是非常便宜且快速的操作.

在几乎所有情况下,您都可以简单地将标准库中的fabs函数视为黑匣子,并在必要时将其撒入算法中,而无需担心它将如何影响执行速度.

In practically all cases, you can simply treat the fabs function from the standard library as a black box, sprinkling it in your algorithms where necessary, without any need to worry about how it will affect execution speed.

如果您想了解为什么为什么这样便宜的操作,那么您需要了解一些有关如何表示浮点值的知识.尽管C和C ++语言标准实际上并未强制要求使用它,但是大多数实现都遵循 IEEE-754 标准.在该标准中,每个浮点值的表示形式都包含一个称为 sign bit 的位,这标志着该值是正数还是负数.例如,考虑double,它是64位的双精度浮点值:

If you want to understand why this is such a cheap operation, then you need to know a little bit about how floating-point values are represented. Although the C and C++ language standards do not actually mandate it, most implementations follow the IEEE-754 standard. In that standard, each floating-point value's representation contains a bit known as the sign bit, and this marks whether the value is positive or negative. For example, consider a double, which is a 64-bit double-precision floating-point value:

>
    (图片由Codekaizen提供,通过Wikipedia,由CC-bySA许可.)


     (Image courtesy of Codekaizen, via Wikipedia, licensed under CC-bySA.)

您可以在最左侧看到淡蓝色的符号位.对于IEEE-754中所有浮点值的精度都是如此.因此,取绝对值基本上等于在内存中翻转该值表示形式中的一个字节.特别是,您只需要屏蔽符号位(按位与),将其强制为0,即无符号.

You can see the sign bit over there on the far left, in light blue. This is true for all precisions of floating-point values in IEEE-754. Therefore, taking the absolute value basically just amounts to flipping a byte in the value's representation in memory. In particular, you just need to mask off the sign bit (bitwise-AND), forcing it to 0—thus, unsigned.

假设您的目标体系结构支持浮点运算的硬件,这通常是一条单一的单周期指令,基本上是尽可能快的.优化的编译器将内联调用fabs库函数,并在该位置处发出单个硬件指令.

Assuming that your target architecture has hardware support for floating-point operations, this is generally a single, one-cycle instruction—basically, as fast as can possibly be. An optimizing compiler will inline a call to the fabs library function, emitting that single hardware instruction in its place.

如果您的目标体系结构没有对浮点的硬件支持(在当今非常罕见),那么将有一个库可以在软件中模拟这些语义,从而提供浮点支持.通常,浮点仿真速度很慢,但是找到绝对值是您可以做的最快的事情之一,因为从字面上看,它只是在进行一点操作.您需要支付对fabs的函数调用的开销,但最糟糕的是,该函数的实现将仅涉及从内存中读取字节,屏蔽符号位并将结果存储回内存中.

If your target architecture doesn't have hardware support for floating-point (which is pretty rare nowadays), then there will be a library that emulates these semantics in software, thus providing floating-point support. Typically, floating-point emulation is slow, but finding the absolute value is one of the fastest things you can do, since it is literally just manipulating a bit. You'll pay the overhead of a function call to fabs, but at worst, the implementation of that function will just involve reading the bytes from memory, masking off the sign bit, and storing the result back to memory.

专门在硬件上实现IEEE-754的x86 ,C编译器将对fabs的调用转换为机器代码的主要方式有两种.

Looking specifically at x86, which does implement IEEE-754 in hardware, there are two main ways that your C compiler will transform a call to fabs into machine code.

在32位版本中,旧版x87 FPU 用于浮点运算操作,它将发出 fabs指令. (是的,名称与C函数相同.)这会将x87寄存器堆栈顶部的浮点值中的符号位(如果存在)剥离.在AMD处理器和Intel Pentium 4上,fabs是具有2个周期延迟的1个周期指令.在AMD Ryzen和所有其他Intel处理器上,这是一条具有1个周期延迟的1周期指令.

In 32-bit builds, where the legacy x87 FPU is being used for floating-point operations, it will emit an fabs instruction. (Yep, same name as the C function.) This strips the sign bit, if present, from the floating-point value at the top of the x87 register stack. On AMD processors and Intel Pentium 4, fabs is a 1-cycle instruction with a 2-cycle latency. On AMD Ryzen and all other Intel processors, this is a 1-cycle instruction with a 1-cycle latency.

在可以假定SSE支持的32位版本中,在所有 64位版本(始终支持SSE)上,编译器将发出 Agner Fog的指令表中查找感兴趣的处理器

In 32-bit builds that can assume SSE support, and on all 64-bit builds (where SSE is always supported), the compiler will emit an ANDPS instruction* that does exactly what I described above: it bitwise-ANDs the floating-point value with a constant mask, masking out the sign bit. Notice that SSE2 doesn't have a dedicated instruction for taking the absolute value like x87 does, but that it doesn't even need one, because the multi-purpose bitwise-op instructions serve the job just fine. The execution time (cycles, latency, etc. characteristics) vary a bit more widely from one processor microarchitecture to another, but it generally has a throughput of 1–3 cycles, with a similar latency. If you like, you can look it up in Agner Fog's instruction tables for the processors of interest.

如果您真的有兴趣对此进行深入研究,则可能会看到此答案(Peter Cordes的技巧提示) ,探讨了使用SSE指令实现绝对值函数的各种不同方法,比较了它们的性能,并讨论了如何使编译器生成适当的代码.如您所见,由于您只是在操作位,因此有多种可能的解决方案!但是,实际上,当前的编译器完全按照我对C库函数fabs的描述进行操作,这很有意义,因为这是最佳的通用解决方案.

If you're really interested in digging into this, you might see this answer (hat tip to Peter Cordes), which explores a variety of different ways to implement an absolute-value function using SSE instructions, comparing their performance and discussing how you could get a compiler to generate the appropriate code. As you can see, since you're just manipulating bits, there are a variety of possible solutions! In practice, though, the current crop of compilers do exactly as I've described for the C library function fabs, which makes sense, because this is the best general-purpose solution.

__
* 从技术上讲,它也可能是ANDPD,其中D表示"double"(而S表示"single"),但是ANDPD需要SSE2支持. SSE支持单精度浮点运算,并且一直可以追溯到Pentium III. SSE2是双精度浮点运算所必需的,并随Pentium 4一起引入.x86-64 CPU始终支持SSE2.是否使用ANDPSANDPD是由编译器的优化器决定的;有时您会看到ANDPS用于双精度浮点值,因为它只需要以正确的方式编写掩码即可.
此外,在支持AVX指令的CPU上,通常会在ANDPS/ANDPD指令上看到VEX前缀,因此它变成VANDPS/VANDPD.有关其工作方式及其目的的详细信息,请参见在线其他地方;可以说,混合使用VEX和非VEX指令会导致性能下降,因此编译器会尽量避免使用它.同样,这两个版本都具有相同的效果和几乎相同的执行速度.

__
* Technically, this might also be ANDPD, where the D means "double" (and the S meant "single"), but ANDPD requires SSE2 support. SSE supports single-precision floating-point operations, and was available all the way back to the Pentium III. SSE2 is required for double-precision floating-point operations, and was introduced with the Pentium 4. SSE2 is always supported on x86-64 CPUs. Whether ANDPS or ANDPD is used is a decision made by the compiler's optimizer; sometimes you will see ANDPS being used on a double-precision floating-point value, since it just requires writing the mask the right way.
Also, on CPUs that support AVX instructions, you'll generally see a VEX-prefix on the ANDPS/ANDPD instruction, so that it becomes VANDPS/VANDPD. Details on how this works and what its purpose is can be found elsewhere online; suffice it to say that mixing VEX and non-VEX instructions can result in a performance penalty, so compilers try to avoid it. Again, though, both of these versions have the same effect and virtually identical execution speeds.

哦,由于SSE是 SIMD 指令集,因此可以一次计算 multiple 浮点值的绝对值.正如您可能想象的那样,这特别有效.具有自动矢量化功能的编译器将在可能的情况下生成此类代码.示例(如此处所示,可以动态生成掩码,也可以将其作为常量加载):

Oh, and because SSE is a SIMD instruction set, it is possible to compute the absolute value of multiple floating-point values at once. This, as you might imagine, is especially efficient. Compilers with auto-vectorization capabilities will generate code like this where possible. Example (mask can either be generated on-the-fly, as shown here, or loaded as a constant):

cmpeqd xmm1, xmm1     ; generate the mask (all 1s) in a temporary register
psrld  xmm1, 1        ; put 1s in but the left-most bit of each packed dword
andps  xmm0, xmm1     ; mask off sign bit in each packed floating-point value

这篇关于fabs(double)如何在x86上实现?这是一项昂贵的手术吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆