每个浮点操作是否都采用相同的时间? [英] Does each Floating point operation take the same time?

查看:187
本文介绍了每个浮点操作是否都采用相同的时间?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我相信无论操作数有多大,整数加法或减法总是占用相同的时间。 ALU输出稳定所需的时间可能随输入操作数而变化,但利用ALU输出的CPU组件将等待足够长的时间,以便在同一周期内处理任何整数运算。
(ADD,SUB,MUL和DIV所需的周期将会不同,但ADD会采用相同的周期,无论输入操作数是多少,我认为。)

这对于浮点运算也是如此吗?

我试图实现一个包含大量浮点操作的程序。我想知道是否有助于扩大我处理快速运行时间的数字。

DR:避免非正常数字,你很好。如果您不需要逐步下溢,请将x86 MXCSR中的Denormals Are Zero和Flush To Zero位或其他体系结构的等效设置。在大多数CPU中,产生一个反常的结果陷阱到微码,所以它需要数百个周期而不是5个。

Agner Fog的insn表为x86 CPU的细节,也是 x86 标记wiki。



这取决于您的CPU,但是典型的现代FPU在这方面都是相似的。除了非正规操作数,add / sub / mul操作的延迟/吞吐量不是数据 - 依赖于典型的现代FPU 。它们通常是完全流水线式的,但是有多周期延迟(即如果输入准备就绪,每个周期都会开始执行一个新的MUL),这使得变量延迟不便于乱序调度。
$ b

可变延迟意味着两个输出将在同一个周期内准备就绪,从而破坏了完全流水线化的目的,并使调度程序无法可靠地避免冲突已知但混合延迟指令/微软。 (这些关于有序管道的讲义显示了这是一个结构性危害回写(WB),但同样的想法适用于ALU本身需要一个额外的缓冲区,直到它可以交付已经准备好的所有结果。)



<作为高性能领域的一个例子: Intel Haswell


  • mulpd (标量,双精度128b或256b向量):5c延迟,每个1c吞吐量两个(两个独立的ALU)。
  • FMA: 5c等待时间,每1c吞吐量两个
  • addpd / subpd :3c latency ,每1c吞吐量一个。 (但是add单元和mul / FMA单元在同一个端口上)

  • divpd (标量或128b向量):10-20c延迟,每8-14c个吞吐量一个。 (也在与mul / FMA单元相同的端口上)。对于256b矢量来说速度更慢(div ALU不是全角)。 $ float s有点快,不像add / sub / mul。
  • sqrtpd :16c延迟,每8-14c个吞吐量一个。对于 float

  • rsqrtps (快速非常近似,仅适用于 float ):5c延迟,每个1c吞吐量一个。




div / sqrt是例外:它们的吞吐量和延迟是依赖于数据的。



<对于div或sqrt,没有快速的并行算法,即使在硬件上也是如此。需要进行某种迭代计算,因此完全流水线化将需要为每个流水线阶段重复大量非常类似的硬件。尽管如此,现代英特尔x86 CPU具有部分流水线的div和sqrt,相互吞吐量低于延迟。相对于mul,div / sqrt的吞吐量要低得多(〜1 / 10或更差)以及显着更高的延迟(〜2x至4x)。现代FPU中div / sqrt单元的不完全流水线特性意味着它可以是可变延迟,而不会在ALU输出端口造成太多冲突。


SSE / AVX不会将sin / cos / exp / log作为单个指令执行;数学图书馆应该自己编码。好的数学图书馆没有使用 x87 fsin 甚至在上证所存在之前,因为 fsin 必须与8087 兼容,并使用一个66位的Pi值来减小范围+/- / 2。 ( Bruce Dawson的一系列有关浮点的文章很好,如果你准备写一些浮点代码的话你一定要阅读。 索引在这一个



关于x87 exp或日志说明的IDK,如 fyl2x 。他们微码,所以他们没有什么特别的速度,但可能是准确的好。不过,一个现代化的数学库不会将一个xmm寄存器的值复制到x87,只是为了这个指令。 x87指令可能比用普通的SSE数学指令慢。






关于快速倒数和快速互为sqrt,请参阅为什么SSE标量sqrt( x)慢于rsqrt(x)* x?

rsqrtps 比正常的sqrtps稍微不准确。在Intel Haswell / Skylake上,它与IIRC的延迟大致相同,但可能有更好的吞吐量。如果没有NR迭代,对于大多数用途来说就太不准确了。

无论如何,这已经变得非常特别。 mul和sqrt的相对性能在很大程度上取决于CPU的微体系结构,但是即使在x86和ARM之间,对比大多数其他具有硬件FPU的现代CPU,您也应该找到 mul 添加性能不依赖于数据。


I believe integer addition or subtraction always take the same time no matter how big the operands are. Time needed for ALU output to be stabilized may vary over input operands, but CPU component that exploits ALU output will wait sufficiently long time so that any integer operation will be processed in SAME cycles. (Cycles needed for ADD, SUB, MUL, and DIV will be different, but ADD will take the same cycles regardless of input operands, I think.)

Is this true for floating point operation, too?

I'm trying to implement a program which includes extensive floating point operations. I wonder if it is helpful to scale the numbers i'm dealing with for fast running time.

解决方案

TL:DR: avoid denormal numbers and you're fine. If you don't need gradual underflow, set the Denormals Are Zero and Flush To Zero bits in the x86 MXCSR, or the equivalent for other architectures. In most CPUs, producing a denormal result traps to microcode, so it takes hundreds of cycles instead of 5.

See Agner Fog's insn tables for x86 CPU details, and also the tag wiki.


It depends on your CPU, but typical modern FPUs are all similar in this respect.

Other than denormal operands, latency/throughput of add/sub/mul operations are not data-dependent on typical modern FPUs. They're usually fully pipelined but with multi-cycle latency (i.e. a new MUL can begin execution every cycle, if its inputs are ready), which makes variable-latency inconvenient for out-of-order scheduling.

Variable latency would mean that two outputs would be ready in the same cycle, defeating the purpose of fully pipelining it, and making it impossible for the scheduler to reliably avoid conflicts like it does normally when dealing with known but mixed latency instructions / uops. (These lecture notes about in-order pipelines show how that's a structural hazard for write-back (WB), but the same idea applies for the ALU itself needing an extra buffer until it can hand off all the results it has ready.)

As an example on the high-performance end of the spectrum: Intel Haswell:

  • mulpd (scalar, 128b or 256b vector of double-precision): 5c latency, two per 1c throughput (two separate ALUs).
  • FMA: 5c latency, two per 1c throughput
  • addpd/subpd: 3c latency, one per 1c throughput. (But the add unit is on the same port as one of the mul/FMA units)
  • divpd (scalar or 128b-vectors): 10-20c latency, one per 8-14c throughput. (Also on the same port as one of the mul/FMA units). Slower for 256b vectors (the div ALU isn't full-width). Somewhat faster for floats, unlike add/sub/mul.
  • sqrtpd: 16c latency, one per 8-14c throughput. Again not full width, and faster for float.
  • rsqrtps (fast very approximate, only available for float): 5c latency, one per 1c throughput.

div/sqrt are the exception: their throughput and latency is data-dependent.

There are no fast parallel algorithms for div or sqrt, even in hardware. Some kind of iterative calculation is required, so fully pipelining would require duplicating lots of very similar hardware for each pipeline stage. Still, modern Intel x86 CPUs have partially-pipelined div and sqrt, with reciprocal throughput less than latency.

Compared to mul, div/sqrt have much lower throughput (~1/10th or worse), and significantly higher latency (~2x to 4x). The not-fully-pipelined nature of the div/sqrt unit in modern FPUs means that it can be variable latency without causing too many collisions at the ALU output port.

SSE/AVX doesn't implement sin/cos/exp/log as single instructions; math libraries should code their own. Good math libraries didn't use x87 fsin either even before SSE existed, because fsin has to be bug-compatible with 8087, and use a 66-bit value of Pi for range reduction to +/- pi/2. (Bruce Dawson's series of articles about floating point are excellent, and you should definitely read them if you're about to write some floating point code. Index in this one.)

IDK about x87 exp or log instructions, like fyl2x. They're microcoded, so they're nothing special for speed, but might be ok for accuracy. Still, a modern math library wouldn't copy a value from an xmm register to x87 just for that instruction. The x87 instruction is probably slower than what you can do with normal SSE math instructions.


For more about fast reciprocal and fast reciprocal sqrt, see Why is SSE scalar sqrt(x) slower than rsqrt(x) * x?

rsqrtps with a Newton-Raphson iteration is slightly less accurate than normal sqrtps. On Intel Haswell/Skylake, it's about the same latency IIRC, but may have better throughput. Without a NR iteration, it's too inaccurate for most uses.

Anyway, this has gotten quite x86-specific. The relative performance of mul vs. sqrt depends heavily on CPU microarchitecture, but even across x86 vs. ARM vs. most other modern CPUs with hardware FPUs, you should find that mul and add performance aren't data-dependent.

这篇关于每个浮点操作是否都采用相同的时间?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆