为什么sqrtsd指令的延迟会根据输入而变化?英特尔处理器 [英] Why does the latency of the sqrtsd instruction change based on the input? Intel processors

查看:118
本文介绍了为什么sqrtsd指令的延迟会根据输入而变化?英特尔处理器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

英特尔内部指南中,该指令被称为"sqrtsd "的延迟时间为18个周期.

Well on the Intel intrinsic guide it is stated that the instruction called "sqrtsd" has a latency of 18 cycles.

我用自己的程序对其进行了测试,例如,如果我们将0.15作为输入,那是正确的.但是,当我们采用256(或任何2 ^ x)个数字时,延迟仅为13.为什么呢?

I tested it with my own program and it is correct if, for example, we take 0.15 as input. But when we take 256 (or any 2^x) number then the latency is only 13. Why is that?

我有一个理论是,由于13是"sqrtss"的等待时间,与"sqrtsd"相同,但在32位浮点数上完成,因此处理器足够聪明,可以理解256泰铢可以容纳32位,因此使用该版本,而0.15需要完整的64位,因为它不能以有限的方式表示.

One theory I had is that since 13 is the latency of "sqrtss" which is the same as "sqrtsd" but done on 32bits floating points then maybe the processor was smart enough to understand taht 256 can fit in 32 bit and hence use that version while 0.15 needs the full 64 bit since it isn't representable in a finite way.

我正在使用内联汇编进行此操作,这是用gcc -O3和-fno-tree-vectorize编译的相关部分.

I am doing it using inline assembly, here is the relveant part compiled with gcc -O3 and -fno-tree-vectorize.

static double sqrtsd (double x) {
    double r;
    __asm__ ("sqrtsd %1, %0" : "=x" (r) : "x" (x));
    return r;
}

推荐答案

SQRT *和DIV *是仅有的两个简单" ALU指令(单uop,不是微码分支/循环),它们具有与数据相关的吞吐量或延迟现代的Intel/AMD CPU. (不对微码进行计数有助于在加/乘/fma中获得非正常aka次正规FP值).其他所有事情几乎都是固定的,因此无序的uop调度机制无需等待确认某个结果已经准备好某个周期的时间,它只是知道会做到这一点.

SQRT* and DIV* are the only two "simple" ALU instructions (single uop, not microcoded branching / looping) that have data-dependent throughput or latency on modern Intel/AMD CPUs. (Not counting microcode assists for denormal aka subnormal FP values in add/multiply/fma). Everything else is pretty much fixed so the out-of-order uop scheduling machinery doesn't need to wait for confirmation that a result was ready some cycle, it just knows it will be.

与往常一样,英特尔的内在函数指南对性能进行了过于简化的描述.对于Skylake,双精度的实际延迟时间不是固定的18个周期. (根据您选择的报价数字,我假设您有一个Skylake.)

As usual, Intel's intrinsics guide gives an over-simplified picture of performance. The actual latency isn't a fixed 18 cycles for double-precision on Skylake. (Based on the numbers you chose to quote, I assume you have a Skylake.)

div/sqrt难以实现;即使在硬件上,我们能做的最好的事情就是迭代的优化过程.一次精炼更多位(自Broadwell以来为基数1024的除法器)加快了它的速度(请参见

div/sqrt are hard to implement; even in hardware the best we can do is an iterative refinement process. Refining more bits at once (radix-1024 divider since Broadwell) speeds it up (see this Q&A about the hardware). But it's still slow enough that an early-out is used to speed up simple cases (Or maybe the speedup mechanism is just skipping a setup step for all-zero mantissas on modern CPUs with partially-pipelined div/sqrt units. Older CPUs had throughput=latency for FP div/sqrt; that execution unit is harder to pipeline.)

https://www.uops.info/html-instr/VSQRTSD_XMM_XMM_XMM.html 显示Skylake SQRTSD的延迟可能在13到19个周期之间变化. SKL(客户端)数字仅显示13个周期延迟,但是我们可以从详细的此页面具有详细的细分所使用的测试代码,包括用于测试的二进制位模式.)在

https://www.uops.info/html-instr/VSQRTSD_XMM_XMM_XMM.html shows Skylake SQRTSD can vary from 13 to 19 cycle latency. The SKL (client) numbers only show 13 cycle latency, but we can see from the detailed SKL vsqrtsd page that they only tested with input = 0. SKX (server) numbers show 13-19 cycle latency. (This page has the detailed breakdown of the test code they used, including the binary bit-patterns for the tests.) Similar testing (with only 0 for client cores) was done on the non-VEX sqrtsd xmm, xmm page. :/

InstLatx64 结果显示了13至18个周期的最佳/最坏情况延迟.在Skylake-X(使用与Skylake-client相同的内核,但启用了AVX512)上.

InstLatx64 results show best / worst case latencies of 13 to 18 cycles on Skylake-X (which uses the same core as Skylake-client, but with AVX512 enabled).

Agner Fog的指令表显示了Skylake上15-16个周期的延迟. (Agner通常会使用一系列不同的输入值进行测试.)他的测试自动化程度较低,有时与其他结果并不完全匹配.

Agner Fog's instruction tables show 15-16 cycle latency on Skylake. (Agner does normally test with a range of different input values.) His tests are less automated and sometimes don't exactly match other results.

请注意,大多数ISA(包括x86)都使用 二进制浮点数:
这些位将值表示为线性有效位数(即尾数)乘以2 exp 和一个符号位.

Note that most ISAs (including x86) use binary floating point:
the bits represent values as a linear significand (aka mantissa) times 2exp, and a sign bit.

现代英特尔似乎至少只有2种速度(至少从Haswell起)(请参阅与@harold的讨论,以发表评论).偶数的2的幂都很快,例如0.25、1、4和16.它们的小数尾数= 0x0表示1.0. https://www.h-schmidt.net/FloatConverter/IEEE754.html有一个很好的交互式十进制<->位模式转换器,用于单精度,带有用于设置位的复选框以及尾数和指数表示的注释.

It seems that there may only be 2 speeds on modern Intel (since Haswell at least) (See discussion with @harold in comments.) e.g. even powers of 2 are all fast, like 0.25, 1, 4, and 16. These have trivial mantissa=0x0 representing 1.0. https://www.h-schmidt.net/FloatConverter/IEEE754.html has a nice interactive decimal <-> bit-pattern converter for single-precision, with checkboxes for the set bits and annotations of what the mantissa and exponent represent.

在Skylake上,我快速检查过的唯一快速案例是偶数的2的次幂,例如4.0,但不是2.0.这些数字具有精确的sqrt结果,输入和输出的尾数均为1.0(仅设置了隐式1位). 9.0并不快,即使它可以精确表示,3.0结果也是如此. 3.0的尾数= 1.5,尾数的最高有效位设置为二进制表示形式. 9.0的尾数为1.125(0b00100 ...).因此,非零位非常接近顶部,但显然足以取消其资格.

On Skylake the only fast cases I've found in a quick check are even powers of 2 like 4.0 but not 2.0. These numbers have an exact sqrt result with both input and output having a 1.0 mantissa (only the implicit 1 bit set). 9.0 is not fast, even though it's exactly representable and so is the 3.0 result. 3.0 has mantissa = 1.5 with just the most significant bit of the mantissa set in the binary representation. 9.0's mantissa is 1.125 (0b00100...). So the non-zero bits are very close to the top, but apparently that's enough to disqualify it.

( +-InfNaN也很快.普通的负数也是如此:result = -NaN .我在i7-6700k上测量了13个周期的延迟,与4.0.对于慢速情况,则为18个周期的延迟.)

(+-Inf and NaN are fast, too. So are ordinary negative numbers: result = -NaN. I measure 13 cycle latency for these on i7-6700k, same as for 4.0. vs. 18 cycle latency for the slow case.)

x = sqrt(x)x = 1.0绝对快(尾数全为零,隐式前导1位除外).它具有简单的输入和简单的输出.

x = sqrt(x) is definitely fast with x = 1.0 (all-zero mantissa except for the implicit leading 1 bit). It has a simple input and simple output.

对于2.0,输入也很简单(全零尾数和指数高1),但输出不是整数. sqrt(2)是不合理的,因此在任何基数中都具有无限的非零位.显然,这使它在Skylake上变慢了.

With 2.0 the input is also simple (all-zero mantissa and exponent 1 higher) but the output is not a round number. sqrt(2) is irrational and thus has infinite non-zero bits in any base. This apparently makes it slow on Skylake.

Agner Fog的指令表说,AMD K10的整数div指令性能取决于有效位数. 股息(输入)中的位,而不是商,但是在搜索Agner的microarch pdf和指令表时没有找到任何脚注或有关sqrt具体如何与数据相关的信息.

Agner Fog's instruction tables say that AMD K10's integer div instruction performance depends on the number of significant bits in the dividend (input), not the quotient, but searching Agner's microarch pdf and instruction tables didn't find any footnotes or info about how sqrt specifically is data-dependent.

在具有更慢FP sqrt的较旧CPU上,可能存在更大的速度范围.我认为输入 的尾数中的有效位数可能是相关的.如果这是正确的话,有效位数越少(有效数中的尾随零越多)可以使其速度更快.但是同样,在Haswell/Skylake上,最快的情况似乎只有2的幂.

On older CPUs with even slower FP sqrt, there might be more room for a range of speeds. I think number of significant bits in the mantissa of the input will probably be relevant. Fewer significant bits (more trailing zeros in the significand) makes it faster, if this is correct. But again, on Haswell/Skylake the only fast cases seem to be even powers of 2.

您可以使用将输出耦合到输入而不破坏数据依赖性的方法进行测试,例如andps xmm0, xmm1/orps xmm0, xmm2在xmm0中设置一个固定值,该值取决于sqrtsd输出.

You can test this with something that couples the output back to the input without breaking the data dependency, e.g. andps xmm0, xmm1 / orps xmm0, xmm2 to set a fixed value in xmm0 that's dependent on the sqrtsd output.

或更简单的测试延迟的方法是利用" sqrtsd xmm0, xmm1 的错误输出依赖项-它和(xmm寄存器)输入和输出(立即)?

Or a simpler way to test latency is to take "advantage" of the false output dependency of sqrtsd xmm0, xmm1 - it and sqrtss leave the upper 64 / 32 bits (respectively) of the destination unmodified, thus the output register is also an input for that merging. I assume this is how your naive inline-asm attempt ended up bottlenecking on latency instead of throughput with the compiler picking a different register for the output so it could just re-read the same input in a loop. The inline asm you added to your question is totally broken and won't even compile, but perhaps your real code used "x" (xmm register) input and output constraints instead of "i" (immediate)?

用于静态可执行测试循环(在perf stat下运行)的此NASM源带有sqrtsd的非VEX编码的错误依赖项.

This NASM source for a static executable test loop (to run under perf stat) uses that false dependency with the non-VEX encoding of sqrtsd.

此ISA设计缺陷得益于Intel在Pentium III上使用SSE1进行了短期优化. P3在内部将128位寄存器分为两个64位一半.保留上半部分不变,让标量指令解码为单个uop. (但这仍然使PIII sqrtss具有错误的依赖性).最后,AVX让我们至少在寄存器源中使用vsqrtsd dst, src,src可以避免这种情况,对于类似近视设计的标量int-> fp转换指令,我们可以使用vcvtsi2sd dst, cold_reg, eax来避免这种情况. (GCC缺少优化报告: 80586 89071 80571 .)

This ISA design wart is thanks to Intel optimizing for the short term with SSE1 on Pentium III. P3 handled 128-bit registers internally as two 64-bit halves. Leaving the upper half unmodified let scalar instructions decode to a single uop. (But that still gives PIII sqrtss a false dependency). AVX finally lets us avoid this with vsqrtsd dst, src,src at least for register sources, and similarly vcvtsi2sd dst, cold_reg, eax for the similarly near-sightedly designed scalar int->fp conversion instructions. (GCC missed-optimization reports: 80586, 89071, 80571.)

在许多较早的CPU上,吞吐量甚至是可变的,但是Skylake加强了分频器,以使调度程序始终知道它可以在最后一次单精度输入后的3个周期内开始新的div/sqrt uop.

On many earlier CPUs even throughput was variable, but Skylake beefed up the dividers enough that the scheduler always knows it can start a new div/sqrt uop 3 cycles after the last single-precision input.

即使是Skylake的双精度吞吐量也是可变的:如果 Agner Fog的说明表是正确的. https://uops.info/显示的是6c的倒数吞吐量. (或者是256位向量的两倍长; 128位和标量可以使用宽SIMD分频器的一半,以提高吞吐量,但延迟相同.)另请参见

Even Skylake double-precision throughput is variable, though: 4 to 6 cycles after the last double-precision input uop, if Agner Fog's instruction tables are right. https://uops.info/ shows a flat 6c reciprocal throughput. (Or twice that long for 256-bit vectors; 128-bit and scalar can use separate halves of the wide SIMD dividers for more throughput but the same latency.) See also Floating point division vs floating point multiplication for some throughput/latency numbers extracted from Agner Fog's instruction tables.

这篇关于为什么sqrtsd指令的延迟会根据输入而变化?英特尔处理器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆