每个浮点操作是否都采用相同的时间? [英] Does each Floating point operation take the same time?
问题描述
(ADD,SUB,MUL和DIV所需的周期将会不同,但ADD会采用相同的周期,无论输入操作数是多少,我认为。)
这对于浮点运算也是如此吗?
我试图实现一个包含大量浮点操作的程序。我想知道是否有助于扩大我处理快速运行时间的数字。
DR:避免非正常数字,你很好。如果您不需要逐步下溢,请将x86 MXCSR中的Denormals Are Zero和Flush To Zero位或其他体系结构的等效设置。在大多数CPU中,产生一个反常的结果陷阱到微码,所以它需要数百个周期而不是5个。
Agner Fog的insn表为x86 CPU的细节,也是 x86 标记wiki。这取决于您的CPU,但是典型的现代FPU在这方面都是相似的。除了非正规操作数,add / sub / mul操作的延迟/吞吐量不是数据 - 依赖于典型的现代FPU 。它们通常是完全流水线式的,但是有多周期延迟(即如果输入准备就绪,每个周期都会开始执行一个新的MUL),这使得变量延迟不便于乱序调度。
$ b
可变延迟意味着两个输出将在同一个周期内准备就绪,从而破坏了完全流水线化的目的,并使调度程序无法可靠地避免冲突已知但混合延迟指令/微软。 (这些关于有序管道的讲义显示了这是一个结构性危害回写(WB),但同样的想法适用于ALU本身需要一个额外的缓冲区,直到它可以交付已经准备好的所有结果。)
<作为高性能领域的一个例子: Intel Haswell :
-
mulpd
(标量,双精度128b或256b向量):5c延迟,每个1c吞吐量两个(两个独立的ALU)。
- FMA: 5c等待时间,每1c吞吐量两个
addpd
/subpd
:3c latency ,每1c吞吐量一个。 (但是add单元和mul / FMA单元在同一个端口上)
divpd
(标量或128b向量):10-20c延迟,每8-14c个吞吐量一个。 (也在与mul / FMA单元相同的端口上)。对于256b矢量来说速度更慢(div ALU不是全角)。 $float
s有点快,不像add / sub / mul。
sqrtpd
:16c延迟,每8-14c个吞吐量一个。对于float
。
rsqrtps
(快速非常近似,仅适用于float
):5c延迟,每个1c吞吐量一个。
<对于div或sqrt,没有快速的并行算法,即使在硬件上也是如此。需要进行某种迭代计算,因此完全流水线化将需要为每个流水线阶段重复大量非常类似的硬件。尽管如此,现代英特尔x86 CPU具有部分流水线的div和sqrt,相互吞吐量低于延迟。相对于mul,div / sqrt的吞吐量要低得多(〜1 / 10或更差)以及显着更高的延迟(〜2x至4x)。现代FPU中div / sqrt单元的不完全流水线特性意味着它可以是可变延迟,而不会在ALU输出端口造成太多冲突。
SSE / AVX不会将sin / cos / exp / log作为单个指令执行;数学图书馆应该自己编码。好的数学图书馆没有使用 x87 fsin
甚至在上证所存在之前,因为 fsin
必须与8087 兼容,并使用一个66位的Pi值来减小范围+/- / 2。 ( Bruce Dawson的一系列有关浮点的文章很好,如果你准备写一些浮点代码的话你一定要阅读。 索引在这一个)
关于x87 exp或日志说明的IDK,如 fyl2x
。他们微码,所以他们没有什么特别的速度,但可能是准确的好。不过,一个现代化的数学库不会将一个xmm寄存器的值复制到x87,只是为了这个指令。 x87指令可能比用普通的SSE数学指令慢。
关于快速倒数和快速互为sqrt,请参阅为什么SSE标量sqrt( x)慢于rsqrt(x)* x?
rsqrtps 比正常的sqrtps稍微不准确。在Intel Haswell / Skylake上,它与IIRC的延迟大致相同,但可能有更好的吞吐量。如果没有NR迭代,对于大多数用途来说就太不准确了。
无论如何,这已经变得非常特别。 mul和sqrt的相对性能在很大程度上取决于CPU的微体系结构,但是即使在x86和ARM之间,对比大多数其他具有硬件FPU的现代CPU,您也应该找到 mul
和添加
性能不依赖于数据。
I believe integer addition or subtraction always take the same time no matter how big the operands are. Time needed for ALU output to be stabilized may vary over input operands, but CPU component that exploits ALU output will wait sufficiently long time so that any integer operation will be processed in SAME cycles. (Cycles needed for ADD, SUB, MUL, and DIV will be different, but ADD will take the same cycles regardless of input operands, I think.)
Is this true for floating point operation, too?
I'm trying to implement a program which includes extensive floating point operations. I wonder if it is helpful to scale the numbers i'm dealing with for fast running time.
TL:DR: avoid denormal numbers and you're fine. If you don't need gradual underflow, set the Denormals Are Zero and Flush To Zero bits in the x86 MXCSR, or the equivalent for other architectures. In most CPUs, producing a denormal result traps to microcode, so it takes hundreds of cycles instead of 5.
See Agner Fog's insn tables for x86 CPU details, and also the x86 tag wiki.
It depends on your CPU, but typical modern FPUs are all similar in this respect.
Other than denormal operands, latency/throughput of add/sub/mul operations are not data-dependent on typical modern FPUs. They're usually fully pipelined but with multi-cycle latency (i.e. a new MUL can begin execution every cycle, if its inputs are ready), which makes variable-latency inconvenient for out-of-order scheduling.
Variable latency would mean that two outputs would be ready in the same cycle, defeating the purpose of fully pipelining it, and making it impossible for the scheduler to reliably avoid conflicts like it does normally when dealing with known but mixed latency instructions / uops. (These lecture notes about in-order pipelines show how that's a structural hazard for write-back (WB), but the same idea applies for the ALU itself needing an extra buffer until it can hand off all the results it has ready.)
As an example on the high-performance end of the spectrum: Intel Haswell:
mulpd
(scalar, 128b or 256b vector of double-precision): 5c latency, two per 1c throughput (two separate ALUs).- FMA: 5c latency, two per 1c throughput
addpd
/subpd
: 3c latency, one per 1c throughput. (But the add unit is on the same port as one of the mul/FMA units)divpd
(scalar or 128b-vectors): 10-20c latency, one per 8-14c throughput. (Also on the same port as one of the mul/FMA units). Slower for 256b vectors (the div ALU isn't full-width). Somewhat faster forfloat
s, unlike add/sub/mul.sqrtpd
: 16c latency, one per 8-14c throughput. Again not full width, and faster forfloat
.rsqrtps
(fast very approximate, only available forfloat
): 5c latency, one per 1c throughput.
div/sqrt are the exception: their throughput and latency is data-dependent.
There are no fast parallel algorithms for div or sqrt, even in hardware. Some kind of iterative calculation is required, so fully pipelining would require duplicating lots of very similar hardware for each pipeline stage. Still, modern Intel x86 CPUs have partially-pipelined div and sqrt, with reciprocal throughput less than latency.
Compared to mul, div/sqrt have much lower throughput (~1/10th or worse), and significantly higher latency (~2x to 4x). The not-fully-pipelined nature of the div/sqrt unit in modern FPUs means that it can be variable latency without causing too many collisions at the ALU output port.
SSE/AVX doesn't implement sin/cos/exp/log as single instructions; math libraries should code their own. Good math libraries didn't use x87 fsin
either even before SSE existed, because fsin
has to be bug-compatible with 8087, and use a 66-bit value of Pi for range reduction to +/- pi/2. (Bruce Dawson's series of articles about floating point are excellent, and you should definitely read them if you're about to write some floating point code. Index in this one.)
IDK about x87 exp or log instructions, like fyl2x
. They're microcoded, so they're nothing special for speed, but might be ok for accuracy. Still, a modern math library wouldn't copy a value from an xmm register to x87 just for that instruction. The x87 instruction is probably slower than what you can do with normal SSE math instructions.
For more about fast reciprocal and fast reciprocal sqrt, see Why is SSE scalar sqrt(x) slower than rsqrt(x) * x?
rsqrtps with a Newton-Raphson iteration is slightly less accurate than normal sqrtps. On Intel Haswell/Skylake, it's about the same latency IIRC, but may have better throughput. Without a NR iteration, it's too inaccurate for most uses.
Anyway, this has gotten quite x86-specific. The relative performance of mul vs. sqrt depends heavily on CPU microarchitecture, but even across x86 vs. ARM vs. most other modern CPUs with hardware FPUs, you should find that mul
and add
performance aren't data-dependent.
这篇关于每个浮点操作是否都采用相同的时间?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!