如果您的程序+库不包含SSE指令,使用VZEROUPPER是否有用? [英] Is it useful to use VZEROUPPER if your program+libraries contain no SSE instructions?

查看:118
本文介绍了如果您的程序+库不包含SSE指令,使用VZEROUPPER是否有用?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道在混合SSE和AVX代码时使用VZEROUPPER很重要,但是如果我仅使用AVX(和普通的x86-64代码)而不使用任何旧版SSE指令怎么办?

I understand it's important to use VZEROUPPER when mixing SSE and AVX code but what if I only use AVX (and ordinary x86-64 code) without using any legacy SSE instructions?

如果我从不在代码中使用单个SSE指令,那么是否有性能上的原因使我需要使用VZEROUPPER?

If I never use a single SSE instruction in my code, is there any performance reason why I would ever need to use VZEROUPPER?

这是假设我没有调用任何外部库(可能正在使用SSE).

This is assuming I'm not calling into any external libraries (that might be using SSE).

推荐答案

您是正确的,如果您的整个程序不使用任何写入xmm寄存器的非VEX指令,则您不需要vzeroupper来避免状态转换处罚.

You're correct that if your whole program doesn't use any non-VEX instructions that write xmm registers, you don't need vzeroupper to avoid state-transition penalties.

请注意,非VEX指令可能会潜伏在CRT启动代码和/或动态链接器或其他非常不明显的地方.

Beware that non-VEX instructions can lurk in CRT startup code and/or the dynamic linker, or other highly non-obvious places.

也就是说,非VEX指令在运行时只能造成一次性罚款.事实并非如此:一条VEX-256指令通常可以制作非VEX指令(或仅使用该寄存器)

That said, a non-VEX instruction can only cause a one-time penalty when it runs. The reverse isn't true: one VEX-256 instruction can make non-VEX instructions in general (or just with that register) slow for the rest of the program.

将VEX和EVEX ,因此无需在那里使用vzeroupper.

There's no penalty when mixing VEX and EVEX, so no need to use vzeroupper there.

在Skylake-AVX512上:弄污ZMM寄存器后,vzerouppervzeroall是还原max-turbo的唯一方法,假设您的程序仍使用任何SSE *,AVX1或AVX2关于xmm/ymm0..15的说明.

On Skylake-AVX512: vzeroupper or vzeroall are the only way to restore max-turbo after dirtying a ZMM register, assuming your program still uses any SSE*, AVX1, or AVX2 instructions on xmm/ymm0..15.

另请参见 Skylake是否需要vzeroupper来使turbo时钟恢复到仅读取ZMM寄存器并写入ak mask的512位指令之后?-仅读取zmm不会导致这种情况.

See also Does Skylake need vzeroupper for turbo clocks to recover after a 512-bit instruction that only reads a ZMM register, writing a k mask? - merely reading a zmm doesn't cause this.

@BeeOnRope在聊天中发布:

AVX-512指令对周围的代码有一个新的非常不好的影响:一旦执行了512位指令(也许不写入zmm寄存器的指令除外),内核就会输入"upper". 256脏状态".在这种状态下,任何以后的标量FP/SSE/AVX指令(任何使用xmm或ymm regs的指令)都将在内部扩展到512位.这意味着在发布vzeroupper或vzeroall之前,处理器将被锁定为不高于AVX Turbo(所谓的"L1许可证").

There is a new, pretty bad effect with AVX-512 instructions on surrounding code: once a 512-bit instruction is executed (except perhaps for instructions that don't write to a zmm register) the core enters an "upper 256 dirty state". In this state, any later scalar FP/SSE/AVX instruction (anything using xmm or ymm regs) will internally be extended to 512 bits. This means the processor will be locked to no higher than the AVX turbo (the so-called "L1 license") until vzeroupper or vzeroall are issued.

与之前的脏污上限128"不同,对于AVX和旧版非VEX SSE(仍在Skylake Xeon上仍然存在)的问题,由于频率较低,这将减慢所有代码的速度,但是没有合并uops".或错误的依存关系或类似的东西:只是为了实现零扩展行为,较小的操作被有效地视为512位宽.

Unlike the earlier "dirty upper 128" issue with AVX and legacy non-VEX SSE (which still exists on Skylake Xeon), this will slow down all code due to the lower frequency, but there are no "merging uops" or false dependencies or anything like that: it's just that the smaller operations are effectively treated as 512-bit wide in order to implement the zero-extending behavior.

关于写低等分..."; - 否,这是一个全局状态,只有vzero 会让您摆脱困境 *.即使您弄脏了zmm寄存器,但对ymm和xmm使用了不同的寄存器,也会发生这种情况.即使唯一的脏指令是像vpxord zmm0, zmm0, zmm0这样的清零习惯,它也会发生. 不过,写入zmm16-31不会发生.

about "writing the low halves ..." - no, it is a global state, and only vzero gets you out of it*. It occurs even if you dirty a zmm register but use different ones for ymm and xmm. It occurs even if the only dirtying instruction is a zeroing idiom like vpxord zmm0, zmm0, zmm0. It doesn't occur for writes to zmm16-31 though.

他对实际上的描述将所有向量操作扩展到512位是不正确的,因为他后来证实,这不会降低128位和256位指令的吞吐量.但是我们知道,当512位uops处于运行状态时,端口1上的矢量ALU会关闭. (因此,通常可通过端口0和1访问的256位FMA单元可以合并为用于所有FP数学,整数移位和乘法以及其他一些操作的512位单元.)

His description of actually extending all vector ops to 512 bits isn't quite right, because he later confirmed that it doesn't reduce throughput for 128 and 256-bit instructions. But we know that when 512-bit uops are in flight, the vector ALUs on port 1 are shut down. (So the 256-bit FMA units normally accessible via ports 0 and 1 can combine into a 512-bit unit for all FP math, integer shifts and multiply, and some other stuff.)

对于仅使用AVX1/AVX2后的max-turbo(包括在Haswell等较早的CPU上):如果一段时间未使用执行单元,则有机会关闭上半部分的执行单元有时允许更高的Turbo时钟速度)取决于最近是否使用了YMM指令,而不取决于上半部是否脏了.因此AFAIK,vzeroupper不会帮助CPU在使用AVX1/AVX2后更快地降低时钟速度,对于最大Turbo值较低的256位CPU.

For max-turbo after using only AVX1 / AVX2 (including on earlier CPUs like Haswell): Opportunistically powering down the upper halves of execution units if they haven't been used for a while (and sometimes allowing higher Turbo clock speeds) depends on whether YMM instructions have been used recently, not on whether the upper halves are dirty or not. So AFAIK, vzeroupper does not help the CPU un-throttle the clock speed sooner after using AVX1 / AVX2, for CPUs where max turbo is lower for 256-bit.

这与英特尔的Skylake-AVX512(SKX/Skylake-SP)不同,后者在AVX512上有些固定".

This is different from Intel's Skylake-AVX512 (SKX / Skylake-SP), where AVX512 is somewhat "bolted on".

因为CPU仍然知道ymm-upper状态是干净还是脏的.

because the CPU still knows whether the ymm-upper state is clean or dirty.

如果很干净,我认为 xsaveopt xsavec可以更紧凑地写出FPU状态,而根本不存储全零的上半部分(只需设置一点就表示它们是干净的).请注意该州的 -SSE/AVX的过渡图,其中xsave/xrstor是图片的一部分.

If it's clean, I think xsaveopt or xsavec can write out the FPU state more compactly, without storing the all-zero upper halves at all (just setting a bit that says they're clean). Notice in the state-transition diagram for SSE/AVX that xsave / xrstor is part of the picture.

为此仅需考虑一个额外的vzeroupper,如果您的代码在此之后的很长时间内不会使用任何256b指令,因为理想情况下,您将没有任何上下文切换/在下一次使用256位向量之前进行CPU迁移.

An extra vzeroupper just for this is only worth considering if your code won't use any 256b instructions for a long time after this, because ideally you won't have any context switches / CPU migrations before the next use of 256-bit vectors.

这可能不适用于AVX512 CPU:vzeroupper/

This may not apply as much on AVX512 CPUs: vzeroupper / vzeroall don't touch ZMM16..31, only ZMM0..15. So you can still have lots of dirty state after vzeroall.

肮脏的上半部分可能占用了物理寄存器,从而限制了CPU查找指令级并行性的无序窗口大小. (ROB大小是另一个主要限制因素,但PRF大小可能是瓶颈.)

Dirty upper halves may be taking up physical registers, limiting out-of-order window size for the CPU to find instruction-level parallelism. (ROB size is the other major limiting factor, but PRF size can be the bottleneck.)

在Zen2之前的AMD CPU上绝对如此,其中256b的操作被拆分为两个128b的操作. YMM寄存器在内部作为两个128位寄存器处理,例如vmovaps ymm0, ymm1重命名具有低延迟的低位128,但需要对上半位进行加码. (请参阅 Agner Fog的microarch pdf )

This is definitely true on AMD CPUs before Zen2, where 256b ops are split into two 128b ops. YMM registers are handled internally as two 128-bit registers, and e.g. vmovaps ymm0, ymm1 renames the low 128 with zero latency, but needs a uop for the upper half. (See Agner Fog's microarch pdf)

该ROB大小/PRF大小的实验博客文章表明,在Sandybridge中FP物理寄存器文件条目为256位. vzeroupper不应在带有AVX/AVX2的主流英特尔CPU上释放更多的寄存器.

Experiments in that ROB size / PRF size blog post show that FP physical register file entries are 256-bit in Sandybridge, though. vzeroupper shouldn't free up more registers on mainstream Intel CPUs with AVX/AVX2.

Silvermont不支持AVX.而且它使用一个单独的退休注册文件作为架构状态,因此,订单PRF仅保留推测性执行结果.因此,即使它确实支持128位半的AVX,过时的YMM寄存器的上半部分脏了,也可能不会占用重命名寄存器文件中的多余空间.

Silvermont doesn't support AVX. And it uses a separate retirement register file for the architectural state, so the out-of-order PRF only holds speculative execution results. So even if it did support AVX with 128-bit halves, a stale YMM register with a dirty upper half probably wouldn't be using up extra space in the rename register file.

KNL(Knight's Landing/Xeon Phi)专为运行AVX512而设计,因此,其FP寄存器文件可能具有512位条目.它基于Silvermont,但核心的SIMD部分有所不同(例如,它可以对FP/矢量指令进行重新排序,而Silvermont只能以推测方式执行它们,而不能在FP/矢量管线内对它们进行重新排序).尽管如此,KNL也可能使用单独的报废寄存器文件,因此即使ZMM上层脏污能够拆分512位条目以存储两个256位向量,也不会占用额外的空间.这不太可能,因为仅KNL上仅AVX1/AVX2的较大乱序窗口将不值得在晶体管上花费.

KNL (Knight's Landing / Xeon Phi) is specifically designed to run AVX512, so presumably its FP register file has 512-bit entries. It's based on Silvermont, but the SIMD parts of the core are different (e.g. it can reorder FP/vector instructions, while Silvermont can only execute them speculatively but not reorder them within the FP/vector pipeline, according to Agner Fog). Still, KNL may also use a separate retirement register file, so dirty ZMM uppers wouldn't consume extra space even if it was able to split a 512-bit entry to store two 256-bit vectors. Which is unlikely, because a larger out-of-order window for only AVX1/AVX2 on KNL wouldn't be worth spending transistors on.

vzeroupper在KNL上比主流Intel CPU慢得多(在64位模式下,每36个周期一个),因此您可能不想使用它,尤其是在很小的情况下-switch优势.

vzeroupper is much slower on KNL than mainstream Intel CPUs (one per 36 cycles in 64-bit mode), so you probably wouldn't want to use, especially just for the tiny context-switch advantage.

在Skylake-AVX512上,证据支持矢量物理寄存器文件为512位宽的结论.

On Skylake-AVX512, the evidence supports the conclusion that the vector physical register file is 512-bits wide.

某些将来的CPU可能会配对物理寄存器文件中的条目以存储宽矢量,即使它们通常不像AMD对256位矢量的方式那样进行解码以分离出uops.

Some future CPU might pair up entries in a physical register file to store wide vectors, even if they don't normally decode to separate uops the way AMD does for 256-bit vectors.

@Mysticial报告带有FP依赖链较长且YMM与ZMM的代码意外减慢,但代码相同,但是后来的实验不同意以下结论:SKX使用2x 256位寄存器文件条目作为ZMM寄存器, 256位是脏的.

@Mysticial reports unexpected slowdowns in code with long FP dependency chains with YMM vs. ZMM but otherwise identical code, but later experiments disagree with the conclusion that SKX uses 2x 256-bit register file entries for ZMM registers when the upper 256 bits are dirty.

这篇关于如果您的程序+库不包含SSE指令,使用VZEROUPPER是否有用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆