在Knights Landing上清除单个或几个ZMM寄存器的最有效方法是什么? [英] What is the most efficient way to clear a single or a few ZMM registers on Knights Landing?

查看:158
本文介绍了在Knights Landing上清除单个或几个ZMM寄存器的最有效方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

说,我要清除4个zmm寄存器.

Say, I want to clear 4 zmm registers.

以下代码会提供最快的速度吗?

Will the following code provide the fastest speed?

vpxorq  zmm0, zmm0, zmm0
vpxorq  zmm1, zmm1, zmm1
vpxorq  zmm2, zmm2, zmm2
vpxorq  zmm3, zmm3, zmm3

在AVX2上,如果我想清除ymm寄存器,则vpxor比vxorps更快,更快,因为vpxor可以在多个单元上运行.

On AVX2, if I wanted to clear ymm registers, vpxor was fastest, faster than vxorps, since vpxor could run on multiple units.

在AVX512上,我们没有用于zmm寄存器的vpxor,只有vpxorq和vpxord.这是清除寄存器的有效方法吗?用vpxorq清除zmm寄存器的先前值时,CPU是否足够聪明以至于不会错误地依赖于zmm寄存器的先前值?

On AVX512, we don't have vpxor for zmm registers, only vpxorq and vpxord. Is that an efficient way to clear a register? Is the CPU smart enough to not make false dependencies on previous values of the zmm registers when I clear them with vpxorq?

还没有用于测试的物理AVX512 CPU-也许有人在Knights Landing上进行了测试?是否有任何延迟发布?

In don't have yet a physical AVX512 CPU to test that - maybe somebody have tested on the Knights Landing? Are there any latencies published?

推荐答案

最有效的方法是利用AVX隐式归零到VLMAX(最大矢量寄存器宽度,由XCR0的当前值确定):

The most efficient way is to take advantage of AVX implicit zeroing out to VLMAX (the maximum vector register width, determined by the current value of XCR0):

vpxor  xmm6, xmm6, xmm6
vpxor  xmm7, xmm7, xmm7
vpxor  xmm8, xmm0, xmm0   # still a 2-byte VEX prefix as long as the source regs are in the low 8
vpxor  xmm9, xmm0, xmm0

这些只是4字节指令(2字节VEX前缀),而不是6字节(4字节EVEX前缀).请注意,即使目标是xmm8-xmm15,也使用低8位的源寄存器来允许2字节的VEX. (当第二个源reg是x/ymm8-15时,需要一个3字节的VEX前缀).是的,只要两个源操作数是同一个寄存器(我测试它在Skylake上不使用执行单元),它仍然被认为是清零习惯.

These are only 4-byte instructions (2-byte VEX prefix), instead of 6 bytes (4-byte EVEX prefix). Notice the use of source registers in the low 8 to allow a 2-byte VEX even when the destination is xmm8-xmm15. (A 3-byte VEX prefix is required when the second source reg is x/ymm8-15). And yes, this is still recognized as a zeroing idiom as long as both source operands are the same register (I tested that it doesn't use an execution unit on Skylake).

除代码大小效果外,其性能与Skylake-AVX512和KNL上的vpxord/q zmmvxorps zmm相同. (并且较小的代码几乎总是更好.)但请注意,KNL的前端非常弱,最大解码吞吐量只能勉强使向量执行单元饱和,并且通常是瓶颈,根据

Other than code-size effects, the performance is identical to vpxord/q zmm and vxorps zmm on Skylake-AVX512 and KNL. (And smaller code is almost always better.) But note that KNL has a very weak front-end, where max decode throughput can only barely saturate the vector execution units and is usually the bottleneck according to Agner Fog's microarch guide. (It has no uop cache or loop buffer, and max throughput of 2 instructions per clock. Also, average fetch throughput is limited to 16B per cycle.)

此外,在假设的将来将AVX512指令解码为两个256b或四个128b的AMD(或可能是Intel)CPU上,这将效率更高. 当前的AMD CPU(包括Ryzen)在将vpxor ymm0, ymm0, ymm0解码为2 oups之前不会检测到归零习惯, 所以这是真实的事情.不幸的是,编译器将其弄错了: gcc错误80636 clang bug 32862 .

Also, on hypothetical future AMD (or maybe Intel) CPUs that decode AVX512 instructions as two 256b uops (or four 128b uops), this is much more efficient. Current AMD CPUs (including Ryzen) don't detect zeroing idioms until after decoding vpxor ymm0, ymm0, ymm0 to 2 uops, so this is a real thing. Unfortunately compilers get it wrong: gcc bug 80636, clang bug 32862.

将zmm16-31归零确实需要EVEX编码的指令vpxordvpxorq同样是不错的选择. EVEX vxorps 由于某些原因(在KNL上不可用)需要AVX512DQ,但是 EVEX vpxord/q 是基准AVX512F.

Zeroing zmm16-31 does need an EVEX-encoded instruction; vpxord or vpxorq are equally good choices. EVEX vxorps requires AVX512DQ for some reason (unavailable on KNL), but EVEX vpxord/q is baseline AVX512F.

vpxor   xmm14, xmm0, xmm0
vpxor   xmm15, xmm0, xmm0
vpxord  zmm16, zmm16, zmm16     # or XMM if you already use AVX512VL for anything
vpxord  zmm17, zmm17, zmm17

EVEX前缀是固定宽度的,因此使用zmm0没有任何帮助.

EVEX prefixes are fixed-width, so there's nothing to be gained from using zmm0.

如果目标服务器支持AVX512VL(Skylake-AVX512但不支持KNL),那么您仍可以使用vpxord xmm31, ...在将来将512b指令解码为多个微指令的CPU上获得更好的性能.

If the target supports AVX512VL (Skylake-AVX512 but not KNL) then you can still use vpxord xmm31, ... for better performance on future CPUs that decode 512b instructions into multiple uops.

如果目标具有AVX512DQ(Skylake-AVX512,但没有KNL),则在为FP数学指令创建输入时使用vxorps或在其他任何情况下使用vpxord可能是一个好主意.对Skylake没有影响,但是将来某些CPU可能会在意.如果总是只使用vpxord会更容易,则不必担心.

If your target has AVX512DQ (Skylake-AVX512 but not KNL), it's probably a good idea to use vxorps when creating an input for an FP math instruction, or vpxord in any other case. No effect on Skylake, but some future CPU might care. Don't worry about this if it's easier to always just use vpxord.

相关:在zmm寄存器中生成全1的最佳方式似乎是 vpternlogd zmm0,zmm0,zmm0, 0xff . (对于全1的查找表,逻辑表中的每个条目均为1). vpcmpeqd same,same不起作用,因为AVX512版本将比较掩码寄存器而不是向量.

Related: the optimal way to generate all-ones in a zmm register appears to be vpternlogd zmm0,zmm0,zmm0, 0xff. (With a lookup-table of all-ones, every entry in the logic table is 1). vpcmpeqd same,same doesn't work, because the AVX512 version compares into a mask register, not a vector.

vpternlogd/q的这种特殊情况在KNL或Skylake-AVX512上不是独立的特殊情况,因此请尝试选择冷寄存器.不过,在SKL-avx512上,速度相当快:根据我的测试,每个时钟吞吐量为2个. (如果您需要多个所有的reg,请在vpternlogd上使用并复制结果,尤其是如果您的代码将在Skylake上运行,而不仅是在KNL上运行.)

This special-case of vpternlogd/q is not special-cased as independent on KNL or on Skylake-AVX512, so try to pick a cold register. It is pretty fast, though, on SKL-avx512: 2 per clock throughput according to my testing. (If you need multiple regs of all-ones, use on vpternlogd and copy the result, esp. if your code will run on Skylake and not just KNL).

我选择了32位元素大小(用vpxord代替vpxorq),因为32位元素大小被广泛使用,并且如果一个元素大小变慢,通常不是32位元素变慢.例如pcmpeqq xmm0,xmm0比Silvermont上的pcmpeqd xmm0,xmm0慢很多. pcmpeqw是生成全矢量的另一种方法(在AVX512之前),但是gcc选择pcmpeqd.我敢肯定,它不会对异或归零产生任何影响,特别是没有掩码寄存器时,但是如果您正在寻找选择vpxordvpxorq之一的原因,这也是一个很好的理由除非有人在任何AVX512硬件上发现真正的性能差异.

I picked 32-bit element size (vpxord instead of vpxorq) because 32-bit element size is widely used, and if one element size is going to be slower, it's usually not 32-bit that's slow. e.g. pcmpeqq xmm0,xmm0 is a lot slower than pcmpeqd xmm0,xmm0 on Silvermont. pcmpeqw is another way of generating a vector of all-ones (pre AVX512), but gcc picks pcmpeqd. I'm pretty sure it will never make a difference for xor-zeroing, especially with no mask-register, but if you're looking for a reason to pick one of vpxord or vpxorq, this is as good a reason as any unless someone finds a real perf difference on any AVX512 hardware.

有趣的是,gcc选择了vpxord,但是选择了vmovdqa64而不是vmovdqa32.

Interesting that gcc picks vpxord, but vmovdqa64 instead of vmovdqa32.

XOR-zeroing doesn't use an execution port at all on Intel SnB-family CPUs, including Skylake-AVX512. (TODO: incorporate some of this into that answer, and make some other updates to it...)

但是在KNL上,我很确定xor-zeroing需要执行端口.这两个向量执行单元通常可以跟上前端,因此在大多数情况下,在issue/rename阶段处理xor-zeroing不会造成性能差异.根据Agner Fog的测试,vmovdqa64/vmovaps需要一个端口(更重要的是具有非零延迟),因此我们知道它不能处理问题/重命名阶段的端口. (这可能就像Sandybridge并消除了Xor归零,但没有移动.但我对此表示怀疑,因为几乎没有好处.)

But on KNL, I'm pretty sure xor-zeroing needs an execution port. The two vector execution units can usually keep up with the front-end, so handling xor-zeroing in the issue/rename stage would make no perf difference in most situations. vmovdqa64 / vmovaps need a port (and more importantly have non-zero latency) according to Agner Fog's testing, so we know it doesn't handle those in the issue/rename stage. (It could be like Sandybridge and eliminate xor-zeroing but not moves. But I doubt it because there'd be little benefit.)

正如Cody所指出的,Agner Fog的表表明KNL在FP0/1上以相同的吞吐量和延迟运行vxorps/dvpxord/q,前提是它们确实需要端口.我认为这仅适用于xmm/ymm vxorps/d,除非英特尔的文档有误并且EVEX vxorps zmm可以在KNL上运行.

As Cody points out, Agner Fog's tables indicate that KNL runs both vxorps/d and vpxord/q on FP0/1 with the same throughput and latency, assuming they do need a port. I assume that's only for xmm/ymm vxorps/d, unless Intel's documentation is in error and EVEX vxorps zmm can run on KNL.

此外,在Skylake和更高版本上,非调零vpxorvxorps在同一端口上运行.向量整数布尔值的多端口运行优势仅是Intel Nehalem对Broadwell的优势,即不支持AVX512的CPU. (甚至对于在Nehalem上调零也很重要,即使该位置实际上需要一个ALU端口,即使它被认为独立于旧值也是如此.)

Also, on Skylake and later, non-zeroing vpxor and vxorps run on the same ports. The run-on-more-ports advantage for vector-integer booleans is only a thing on Intel Nehalem to Broadwell, i.e. CPUs that don't support AVX512. (It even matters for zeroing on Nehalem, where it actually needs an ALU port even though it is recognized as independent of the old value).

Skylake上的旁路延迟延迟取决于它碰巧选择的端口,而不是取决于您使用的指令.即vaddps如果将vandps安排为p0或p1而不是p5,则读取vandps的结果会有一个额外的延迟周期.有关表格,请参见英特尔的优化手册.更糟糕的是,即使结果在读取前位于寄存器中数百个周期,这种额外的延迟也会永远存在.它影响从其他输入到输出的dep链,因此在这种情况下仍然很重要. (待办事项:在此上写下我的实验结果,并将其发布到某处.)

The bypass-delay latency on Skylake depends on what port it happens to pick, rather than on what instruction you used. i.e. vaddps reading the result of a vandps has an extra cycle of latency if the vandps was scheduled to p0 or p1 instead of p5. See Intel's optimization manual for a table. Even worse, this extra latency applies forever, even if the result sits in a register for hundreds of cycles before being read. It affects the dep chain from the other input to the output, so it still matters in this case. (TODO: write up the results of my experiments on this and post them somewhere.)

这篇关于在Knights Landing上清除单个或几个ZMM寄存器的最有效方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆