如果您的程序+库不包含 SSE 指令,使用 VZEROUPPER 有用吗? [英] Is it useful to use VZEROUPPER if your program+libraries contain no SSE instructions?

查看:22
本文介绍了如果您的程序+库不包含 SSE 指令,使用 VZEROUPPER 有用吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道在混合 SSE 和 AVX 代码时使用 VZEROUPPER 很重要,但是如果我只使用 AVX(和普通的 x86-64 代码)而不使用任何旧的 SSE 指令怎么办?

如果我从不在我的代码中使用单个 SSE 指令,是否有任何性能原因我需要使用 VZEROUPPER?

这是假设我没有调用任何外部库(可能正在使用 SSE).

解决方案

如果你的整个程序不使用任何编写 xmm 的非 VEX 指令,你是对的code> 寄存器,您不需要 vzeroupper 来避免状态转换惩罚.

请注意,非 VEX 指令可能潜伏在 CRT 启动代码和/或动态链接器中,或其他高度不明显的地方.

也就是说,非 VEX 指令在运行时只能导致一次性惩罚.反之则不然:一条 VEX-256 指令通常可以生成非 VEX 指令(或仅使用该寄存器)程序的其余部分很慢.


混合 VEX 和EVEX,所以不需要在那里使用 vzeroupper.


在 Skylake-AVX512 上:vzerouppervzeroall 是在弄脏 ZMM 寄存器后恢复 max-turbo 的唯一方法,假设您的程序仍然在 xmm/ymm0..15 上使用任何 SSE*、AVX1 或 AVX2 指令.

另见 在仅读取 ZMM 寄存器、写入 ak 掩码的 512 位指令之后,Skylake 是否需要 vzeroupper 来恢复涡轮时钟? - 仅读取 zmm 不会导致这种情况.

<块引用>

@BeeOnRope 在聊天中发布:

<块引用>

AVX-512 指令对周围代码有一个新的、非常糟糕的影响:一旦执行了 512 位指令(可能不写入 zmm 寄存器的指令除外),内核会进入一个高256 脏状态".在这种状态下,任何后来的标量 FP/SSE/AVX 指令(任何使用 xmm 或 ymm regs 的指令)都将在内部扩展到 512 位.这意味着在发布 vzeroupper 或 vzeroall 之前,处理器将被锁定为不高于 AVX turbo(所谓的L1 许可证").

与之前的脏鞋帮 128"不同AVX 和传统非 VEX SSE(仍然存在于 Skylake Xeon 上)的问题,由于频率较低,这将减慢所有代码的速度,但没有合并 uops";或错误的依赖关系或类似的东西:只是为了实现零扩展行为,较小的操作被有效地视为 512 位宽.

关于写下半部分......"- 不,它是一个全局状态,只有 vzero 才能让你摆脱它*.即使您弄脏了 zmm 寄存器但对 ymm 和 xmm 使用了不同的寄存器,也会发生这种情况.即使唯一的脏指令是像 vpxord zmm0, zmm0, zmm0 这样的归零习语,它也会发生.虽然写入 zmm16-31 不会发生这种情况.

他关于实际上将所有向量操作扩展到 512 位的描述并不完全正确,因为他后来证实这不会降低 128 和 256 位指令的吞吐量.但是我们知道,当 512 位 uops 正在运行时,端口 1 上的向量 ALU 会关闭.(因此,通常可通过端口 0 和 1 访问的 256 位 FMA 单元可以组合成一个 512 位单元,用于所有 FP 数学、整数乘法和可能的其他一些东西.一些 SKX Xeon 在端口上有第二个 512 位 FMA 单元5,有些没有.)


对于仅使用 AVX1/AVX2 后的 max-turbo(包括在像 Haswell 这样的早期 CPU 上):如果有一段时间没有使用,则机会性地关闭执行单元的上半部分(和有时允许更高的 Turbo 时钟速度)取决于最近是否使用过 YMM 指令,而不是上半部分是否脏.因此,AFAIK,vzeroupper没有帮助 CPU 在使用 AVX1/AVX2 后更快地降低时钟速度,对于 256 位最大 turbo 较低的 CPU.>

这与英特尔的 Skylake-AVX512 (SKX/Skylake-SP) 不同,后者 AVX512 有点固定".


VZEROUPPER 可能使上下文切换稍微便宜

因为 CPU 仍然知道 ymm-upper 状态是干净的还是脏的.

如果它是干净的,我认为 xsaveoptxsavec 可以更紧凑地写出 FPU 状态,根本不存储全零的上半部分(只需设置一点表示它们是干净的).通知 在状态- SSE/AVX 的转换图xsave/xrstor 是图片的一部分.

一个额外的 vzeroupper 仅在您的代码在此之后 很长时间 内不使用任何 256b 指令时才值得考虑,因为理想情况下您不会在下次使用 256 位向量之前进行任何上下文切换/CPU 迁移.

这可能不适用于 AVX512 CPU:vzeroupper/vzeroall 不要碰ZMM16..31,只有ZMM0..15.所以在 vzeroall 之后你仍然可以有很多脏状态.


(理论上合理):脏的上半部分可能占用了物理寄存器(尽管 IDK 有任何证据表明这在任何真实的 CPU 上都是正确的).如果是这样,它将限制 CPU 查找指令级并行性的乱序窗口大小.(ROB 大小是另一个主要限制因素,但 PRF 大小可能是瓶颈.)

这在 Zen2 之前的 AMD CPU 上可能是正确的,其中 256b 操作分为两个 128b 操作.YMM 寄存器在内部作为两个 128 位寄存器处理,例如vmovaps ymm0, ymm1 以零延迟重命名低位 128,但上半部分需要 uop.(请参阅 Agner Fog 的 microarch pdf).不过,尚不清楚 vzeroupper 是否真的可以放弃对上半部分的重命名.AMD Zen 上的归零习惯用法(与 SnB 系列不同)仍然需要后端 uop 来写入寄存器值,即使是 128b 低半部分;只有 mov-elimination 避免了后端 uop.所以可能没有一个物理零寄存器可以重命名上层.

ROB 大小/PRF 大小的实验博文不过,显示 FP 物理寄存器文件条目在 Sandybridge 中是 256 位的.vzeroupper 不应在带有 AVX/AVX2 的主流 Intel CPU 上释放更多寄存器.Haswell 风格的转换惩罚足够慢,它可能会耗尽 ROB 以保存或恢复上层到未重命名的单独存储,而不是用完有价值的 PRF 条目.

Silvermont 不支持 AVX.并且它使用一个单独的退休注册文件作为架构状态,所以out-of-order PRF 只持有投机执行结果.因此,即使它确实支持具有 128 位一半的 AVX,具有脏上半部分的陈旧 YMM 寄存器可能不会在重命名寄存器文件中使用额外的空间.

KNL(Knight's Landing/Xeon Phi)是专门为运行 AVX512 而设计的,所以大概它的 FP 寄存器文件有 512 位条目.它基于 Silvermont,但核心的 SIMD 部分是不同的(例如,它可以重新排序 FP/vector 指令,而 Silvermont 只能推测性地执行它们,但不能在 FP/vector 管道内重新排序,根据 Agner Fog 的说法).尽管如此,KNL 也可能使用单独的退休寄存器文件,因此即使它能够拆分 512 位条目以存储两个 256 位向量,脏的 ZMM 上层也不会消耗额外的空间.这不太可能,因为 KNL 上仅 AVX1/AVX2 的更大乱序窗口不值得花费晶体管.

vzeroupper 在 KNL 上比主流 Intel CPU 慢得多(64 位模式下每 36 个周期一个),因此您可能不想使用,尤其是只是为了微小的上下文切换优势.


在 Skylake-AVX512 上,证据支持向量物理寄存器文件为 512 位宽的结论.

某些未来的 CPU 可能会将物理寄存器文件中的条目配对以存储宽向量,即使它们通常不会像 AMD 对 256 位向量那样解码以分离 uops.

@Mysticial 报告 具有 YMM 与 ZMM 的长 FP 依赖链的代码意外减速,但其他方面相同的代码,但后来的实验不同意 SKX 使用 2x 256 位寄存器文件条目作为 ZMM 寄存器的结论256 位是脏的.

I understand it's important to use VZEROUPPER when mixing SSE and AVX code but what if I only use AVX (and ordinary x86-64 code) without using any legacy SSE instructions?

If I never use a single SSE instruction in my code, is there any performance reason why I would ever need to use VZEROUPPER?

This is assuming I'm not calling into any external libraries (that might be using SSE).

解决方案

You're correct that if your whole program doesn't use any non-VEX instructions that write xmm registers, you don't need vzeroupper to avoid state-transition penalties.

Beware that non-VEX instructions can lurk in CRT startup code and/or the dynamic linker, or other highly non-obvious places.

That said, a non-VEX instruction can only cause a one-time penalty when it runs. The reverse isn't true: one VEX-256 instruction can make non-VEX instructions in general (or just with that register) slow for the rest of the program.


There's no penalty when mixing VEX and EVEX, so no need to use vzeroupper there.


On Skylake-AVX512: vzeroupper or vzeroall are the only way to restore max-turbo after dirtying a ZMM register, assuming your program still uses any SSE*, AVX1, or AVX2 instructions on xmm/ymm0..15.

See also Does Skylake need vzeroupper for turbo clocks to recover after a 512-bit instruction that only reads a ZMM register, writing a k mask? - merely reading a zmm doesn't cause this.

Posted by @BeeOnRope in chat:

There is a new, pretty bad effect with AVX-512 instructions on surrounding code: once a 512-bit instruction is executed (except perhaps for instructions that don't write to a zmm register) the core enters an "upper 256 dirty state". In this state, any later scalar FP/SSE/AVX instruction (anything using xmm or ymm regs) will internally be extended to 512 bits. This means the processor will be locked to no higher than the AVX turbo (the so-called "L1 license") until vzeroupper or vzeroall are issued.

Unlike the earlier "dirty upper 128" issue with AVX and legacy non-VEX SSE (which still exists on Skylake Xeon), this will slow down all code due to the lower frequency, but there are no "merging uops" or false dependencies or anything like that: it's just that the smaller operations are effectively treated as 512-bit wide in order to implement the zero-extending behavior.

about "writing the low halves ..." - no, it is a global state, and only vzero gets you out of it*. It occurs even if you dirty a zmm register but use different ones for ymm and xmm. It occurs even if the only dirtying instruction is a zeroing idiom like vpxord zmm0, zmm0, zmm0. It doesn't occur for writes to zmm16-31 though.

His description of actually extending all vector ops to 512 bits isn't quite right, because he later confirmed that it doesn't reduce throughput for 128 and 256-bit instructions. But we know that when 512-bit uops are in flight, the vector ALUs on port 1 are shut down. (So the 256-bit FMA units normally accessible via ports 0 and 1 can combine into a 512-bit unit for all FP math, integer multiply, and possibly some other stuff. Some SKX Xeons have a 2nd 512-bit FMA unit on port 5, some don't.)


For max-turbo after using only AVX1 / AVX2 (including on earlier CPUs like Haswell): Opportunistically powering down the upper halves of execution units if they haven't been used for a while (and sometimes allowing higher Turbo clock speeds) depends on whether YMM instructions have been used recently, not on whether the upper halves are dirty or not. So AFAIK, vzeroupper does not help the CPU un-throttle the clock speed sooner after using AVX1 / AVX2, for CPUs where max turbo is lower for 256-bit.

This is different from Intel's Skylake-AVX512 (SKX / Skylake-SP), where AVX512 is somewhat "bolted on".


VZEROUPPER might make context switches slightly cheaper

because the CPU still knows whether the ymm-upper state is clean or dirty.

If it's clean, I think xsaveopt or xsavec can write out the FPU state more compactly, without storing the all-zero upper halves at all (just setting a bit that says they're clean). Notice in the state-transition diagram for SSE/AVX that xsave / xrstor is part of the picture.

An extra vzeroupper just for this is only worth considering if your code won't use any 256b instructions for a long time after this, because ideally you won't have any context switches / CPU migrations before the next use of 256-bit vectors.

This may not apply as much on AVX512 CPUs: vzeroupper / vzeroall don't touch ZMM16..31, only ZMM0..15. So you can still have lots of dirty state after vzeroall.


(Plausible in theory): Dirty upper halves may be taking up physical registers (although IDK of any evidence for this being true on any real CPUs). If so, it would limit out-of-order window size for the CPU to find instruction-level parallelism. (ROB size is the other major limiting factor, but PRF size can be the bottleneck.)

This may be true on AMD CPUs before Zen2, where 256b ops are split into two 128b ops. YMM registers are handled internally as two 128-bit registers, and e.g. vmovaps ymm0, ymm1 renames the low 128 with zero latency, but needs a uop for the upper half. (See Agner Fog's microarch pdf). It's unknown whether vzeroupper can actually drop the renaming for the upper halves, though. Zeroing idioms on AMD Zen (unlike SnB-family) still need a back-end uop to write the register value, even for the 128b low half; only mov-elimination avoids a back-end uop. So there may not be a physical zero register that uppers can be renamed onto.

Experiments in that ROB size / PRF size blog post show that FP physical register file entries are 256-bit in Sandybridge, though. vzeroupper shouldn't free up more registers on mainstream Intel CPUs with AVX/AVX2. Haswell-style transition penalties are slow enough that it probably drains the ROB to save or restore uppers to separate storage that isn't renamed, not using up valuable PRF entries.

Silvermont doesn't support AVX. And it uses a separate retirement register file for the architectural state, so the out-of-order PRF only holds speculative execution results. So even if it did support AVX with 128-bit halves, a stale YMM register with a dirty upper half probably wouldn't be using up extra space in the rename register file.

KNL (Knight's Landing / Xeon Phi) is specifically designed to run AVX512, so presumably its FP register file has 512-bit entries. It's based on Silvermont, but the SIMD parts of the core are different (e.g. it can reorder FP/vector instructions, while Silvermont can only execute them speculatively but not reorder them within the FP/vector pipeline, according to Agner Fog). Still, KNL may also use a separate retirement register file, so dirty ZMM uppers wouldn't consume extra space even if it was able to split a 512-bit entry to store two 256-bit vectors. Which is unlikely, because a larger out-of-order window for only AVX1/AVX2 on KNL wouldn't be worth spending transistors on.

vzeroupper is much slower on KNL than mainstream Intel CPUs (one per 36 cycles in 64-bit mode), so you probably wouldn't want to use, especially just for the tiny context-switch advantage.


On Skylake-AVX512, the evidence supports the conclusion that the vector physical register file is 512-bits wide.

Some future CPU might pair up entries in a physical register file to store wide vectors, even if they don't normally decode to separate uops the way AMD does for 256-bit vectors.

@Mysticial reports unexpected slowdowns in code with long FP dependency chains with YMM vs. ZMM but otherwise identical code, but later experiments disagree with the conclusion that SKX uses 2x 256-bit register file entries for ZMM registers when the upper 256 bits are dirty.

这篇关于如果您的程序+库不包含 SSE 指令,使用 VZEROUPPER 有用吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆