有效地将 CPU 寄存器中的所有位设置为 1 [英] Set all bits in CPU register to 1 efficiently

查看:32
本文介绍了有效地将 CPU 寄存器中的所有位设置为 1的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

要清除所有位,您经常会在 XOR eax, eax 中看到异或.反面也有这样的把戏吗?

我能想到的就是用额外的指令反转零.

解决方案

对于大多数具有固定宽度指令的体系结构,答案可能是符号扩展或反转的单指令mov立即,或 mov lo/high 对.例如在 ARM 上,mvn r0, #0(不移动).看到的gcc ASM输出用于x86,ARM,ARM64,和MIPS,在 Godbolt 编译器浏览器上.IDK 任何关于 zseries asm 或机器代码的信息.

在 ARM 中,eor r0,r0,r0 明显比 mov-immediate 差.这取决于旧值,没有特殊情况处理.内存依赖排序规则 防止 ARM uarch 使用特殊外壳,即使 大多数其他具有弱排序内存的 RISC ISA 也是如此,但不需要 memory_order_consume 的障碍(在 C++11 术语中).

<小时>

x86 xor-zeroing 的特殊之处在于它的可变长度指令集.从历史上看,8086 xor ax,ax 直接速度很快,因为它很小.由于习语被广泛使用(归零比全一更常见),CPU 设计人员给予了它特殊的支持,现在 xor eax,eaxmov eax,0 在 Intel Sandybridge 系列和其他一些 CPU 上,即使不考虑直接和间接代码大小影响.见 在 x86 汇编中将寄存器设置为零的最佳方法是什么:xor、mov 或 and? 以获得尽可能多的微架构优势.

如果 x86 有一个固定宽度的指令集,我想知道 mov reg, 0 是否会得到像异或归零一样多的特殊处理?也许,因为在编写 low8 或 low16 之前打破依赖很重要.

<小时>

最佳性能的标准选项:

  • mov eax, -1:5 个字节,使用 mov r32, imm32 编码.(不幸的是,没有符号扩展 mov r32, imm8).在所有 CPU 上都具有出色的性能.r8-r15 为 6 个字节(REX 前缀).
  • mov rax, -1:7 字节,使用 mov r/m64, sign-extended-imm32 编码.(不是 eax 版本的 REX.W=1 版本.那将是 10 字节的 mov r64, imm64).在所有 CPU 上都具有出色的性能.
<小时>

节省一些代码大小的奇怪选项通常以牺牲性能为代价:

  • xor eax,eax/dec rax(或 not rax):5 字节(4 表示 32-bit eax).缺点:前端有两个 uops.最近英特尔的调度程序/执行单元仍然只有一个未融合域 uop,其中 xor-zeroing 在前端处理.mov-immediate 总是需要一个执行单元.(但整数 ALU 吞吐量很少成为可以使用任何端口的指令的瓶颈;额外的前端压力是问题所在)
  • xor ecx,ecx/lea eax, [rcx-1] 2 个常量总共 5 个字节(6 个字节用于rax):留下一个单独的归零寄存器.如果您已经想要一个归零的寄存器,那么这几乎没有任何缺点.在大多数 CPU 上,lea 可以在比 mov r,i 更少的端口上运行,但由于这是新依赖链的开始,CPU 可以在任何空闲执行中运行它- 发出后的端口周期.

    同样的技巧适用于任何两个附近的常量,如果你用 mov reg, imm32 第一个,第二个用 lea r32, [base + disp8].disp8 的范围是 -128 到 +127,否则你需要一个 disp32.

  • or eax, -1:3 个字节(rax 为 4 个),使用 或 r/m32, sign-extended-imm8 编码.缺点:对寄存器旧值的错误依赖.

  • push -1/pop rax:3 个字节.缓慢但很小.仅推荐用于漏洞利用/代码高尔夫.适用于任何 sign-extended-imm8,与其他大多数不同.

    缺点:

    • 使用存储和加载执行单元,而不是 ALU.(在 AMD Bulldozer 系列的极少数情况下可能具有吞吐量优势,其中只有两个整数执行管道,但解码/发布/退出吞吐量高于此.但不要在没有测试的情况下尝试.)
    • 存储/重新加载延迟意味着 rax 在 Skylake 上执行后大约 5 个周期内不会准备好,例如.
    • (英特尔):将堆栈引擎置于 rsp 修改模式,因此下次您直接读取 rsp 时,它将采用堆栈同步 uop.(例如,对于 add rsp, 28,或对于 mov eax, [rsp+8]).
    • 存储可能会在缓存中丢失,从而触发额外的内存流量.(如果您没有在长循环中触及堆栈,则可能会发生这种情况).
<小时>

矢量寄存器不同

使用 pcmpeqd xmm0,xm​​m0 将向量寄存器设置为全 1 是大多数 CPU 上的特殊情况,因为它破坏了依赖性(不是 Silvermont/KNL),但仍然需要一个执行单元来实际编写这些.pcmpeqb/w/d/q 一切正常,但 q 在某些 CPU 上速度较慢.

对于 AVX2ymm 等效的 vpcmpeqd ymm0, ymm0, ymm0 也是最佳选择.

对于没有 AVX2 的 AVX,选择不太明确:没有一种明显的最佳方法.编译器使用各种策略:gcc更喜欢用vmovdqa加载一个32字节的常量,而较旧的 clang 使用 128 位 vpcmpeqd 后跟一个跨车道的 vinsertf128 来填充高半部分.较新的 clang 使用 vxorps 将寄存器归零,然后使用 vcmptrueps 将其填充.这是 vpcmpeqd 方法的道德等价物,但是需要 vxorps 来打破对先前版本寄存器的依赖,并且 vcmptrueps 具有3 的延迟.这是一个合理的默认选择.

从 32 位值执行 vbroadcastss 可能严格来说比加载方法要好,但很难让编译器生成它.

最好的方法可能取决于周围的代码.

设置 __m256 的最快方法所有 1 位的值

<小时>

AVX512 比较仅适用于掩码寄存器(如 k0)作为目标,因此编译器当前使用 vpternlogd zmm0,zmm0,zmm0, 0xff 作为 512b 全 1 的习惯用法.(0xff 使 3 输入真值表的每个元素都成为 1).这并不是对 KNL 或 SKL 的依赖破坏的特殊情况,但它在 Skylake-AVX512 上具有每时钟 2 的吞吐量.这比使用更窄的破坏依赖性的 AVX all-one 和广播或改组它要好.

如果您需要在循环内重新生成全1,显然最有效的方法是使用vmov* 来复制全1 寄存器.这甚至不使用现代 CPU 上的执行单元(但仍然占用前端问题带宽).但是如果你没有向量寄存器,加载一个常量或 [v]pcmpeq[b/w/d] 是不错的选择.

对于 AVX512,值得尝试 VPMOVM2D zmm0, k0 或者 VPBROADCASTD zmm0, eax.每个都有只有 1c 吞吐量,但它们应该打破对旧值的依赖zmm0 的(不同于 vpternlogd).它们需要一个掩码或整数寄存器,您在循环外使用 kxnorw k1,k0,k0mov eax, -1 对其进行初始化.

<小时>

对于AVX512 掩码寄存器kxnorw k1,k0,k0 有效,但它不会破坏当前 CPU 的依赖性.Intel 的优化手册 建议使用它在收集指令之前生成全 1,但建议避免使用与输出相同的输入寄存器.这避免了使其他独立的收集依赖于循环中的前一个收集.由于 k0 经常未使用,因此通常是一个不错的选择.

我认为 vpcmpeqd k1, zmm0,zmm0 会起作用,但它可能不是特殊情况下不依赖 zmm0 的 k0=1 习语.(要设置所有 64 位而不仅仅是低 16 位,请使用 AVX512BW vpcmpeqb)

在 Skylake-AVX512 上,k 操作掩码寄存器的指令 仅在单个端口上运行,即使是像kandw.(另请注意,当管道中有任何 512b 操作时,Skylake-AVX512 不会在端口 1 上运行矢量 uop,因此执行单元吞吐量可能是一个真正的瓶颈.)

没有kmov k0, imm,只能从整数或内存中移动.可能没有 k 指令中的 same,same 被检测为特殊,所以在发布/重命名阶段的硬件不会寻找 k 寄存器.>

To clear all bits you often see an exclusive or as in XOR eax, eax. Is there such a trick for the opposite too?

All I can think of is to invert the zeroes with an extra instruction.

解决方案

For most architectures with fixed-width instructions, the answer will probably be a boring one instruction mov of a sign-extended or inverted immediate, or a mov lo/high pair. e.g. on ARM, mvn r0, #0 (move-not). See gcc asm output for x86, ARM, ARM64, and MIPS, on the Godbolt compiler explorer. IDK anything about zseries asm or machine code.

In ARM, eor r0,r0,r0 is significantly worse than a mov-immediate. It depends on the old value, with no special-case handling. Memory dependency-ordering rules prevent an ARM uarch from special-casing it even if they wanted to. Same goes for most other RISC ISAs with weakly-ordered memory but that don't require barriers for memory_order_consume (in C++11 terminology).


x86 xor-zeroing is special because of its variable-length instruction set. Historically, 8086 xor ax,ax was fast directly because it was small. Since the idiom became widely used (and zeroing is much more common than all-ones), CPU designers gave it special support, and now xor eax,eax is faster than mov eax,0 on Intel Sandybridge-family and some other CPUs, even without considering direct and indirect code-size effects. See What is the best way to set a register to zero in x86 assembly: xor, mov or and? for as many micro-architectural benefits as I've been able to dig up.

If x86 had a fixed-width instruction-set, I wonder if mov reg, 0 would have gotten as much special treatment as xor-zeroing has? Perhaps, because dependency-breaking before writing the low8 or low16 is important.


The standard options for best performance:

  • mov eax, -1: 5 bytes, using the mov r32, imm32 encoding. (There is no sign-extending mov r32, imm8, unfortunately). Excellent performance on all CPUs. 6 bytes for r8-r15 (REX prefix).
  • mov rax, -1: 7 bytes, using the mov r/m64, sign-extended-imm32 encoding. (Not the REX.W=1 version of the eax version. That would be 10-byte mov r64, imm64). Excellent performance on all CPUs.

The weird options that save some code-size usually at the expense of performance:

  • xor eax,eax/dec rax (or not rax): 5 bytes (4 for 32-bit eax). Downside: two uops for the front-end. Still only one unfused-domain uop for the scheduler/execution units on recent Intel where xor-zeroing is handled in the front-end. mov-immediate always needs an execution unit. (But integer ALU throughput is rarely a bottleneck for instructions that can use any port; the extra front-end pressure is the problem)
  • xor ecx,ecx / lea eax, [rcx-1] 5 bytes total for 2 constants (6 bytes for rax): leaves a separate zeroed register. If you already want a zeroed register, there is almost no downside to this. lea can run on fewer ports than mov r,i on most CPUs, but since this is the start of a new dependency chain, the CPU can run it in any spare execution-port cycle after it issues.

    The same trick works for any two nearby constants, if you do the first one with mov reg, imm32 and the second with lea r32, [base + disp8]. disp8 has a range of -128 to +127, otherwise you need a disp32.

  • or eax, -1: 3 bytes (4 for rax), using the or r/m32, sign-extended-imm8 encoding. Downside: false dependency on the old value of the register.

  • push -1 / pop rax: 3 bytes. Slow but small. Recommended only for exploits / code-golf. Works for any sign-extended-imm8, unlike most of the others.

    Downsides:

    • uses store and load execution units, not ALU. (Possibly a throughput advantage in a rare cases on AMD Bulldozer-family where there are only two integer execution pipes, but decode/issue/retire throughput is higher than that. But don't try it without testing.)
    • store/reload latency means rax won't be ready for ~5 cycles after this executes on Skylake, for example.
    • (Intel): puts the stack-engine into rsp-modified mode, so the next time you read rsp directly it will take a stack-sync uop. (e.g. for add rsp, 28, or for mov eax, [rsp+8]).
    • The store could miss in cache, triggering extra memory traffic. (Possible if you haven't touched the stack inside a long loop).

Vector regs are different

Setting vector registers to all-ones with pcmpeqd xmm0,xmm0 is special-cased on most CPUs as dependency-breaking (not Silvermont/KNL), but still needs an execution unit to actually write the ones. pcmpeqb/w/d/q all work, but q is slower on some CPUs.

For AVX2, the ymm equivalent vpcmpeqd ymm0, ymm0, ymm0 is also the best choice.

For AVX without AVX2 the choice is less clear: there is no one obvious best approach. Compilers use various strategies: gcc prefers to load a 32-byte constant with vmovdqa, while older clang uses 128-bit vpcmpeqd followed by a cross-lane vinsertf128 to fill the high half. Newer clang uses vxorps to zero a register then vcmptrueps to fill it with ones. This is the moral equivalent of the vpcmpeqd approach, but the vxorps is needed to break the dependency on the prior version of the register and vcmptrueps has a latency of 3. It makes a reasonable default choice.

Doing a vbroadcastss from a 32-bit value is probably strictly better than the load approach, but it is hard to get compilers to generate this.

The best approach probably depends on the surrounding code.

Fastest way to set __m256 value to all ONE bits


AVX512 compares are only available with a mask register (like k0) as the destination, so compilers are currently using vpternlogd zmm0,zmm0,zmm0, 0xff as the 512b all-ones idiom. (0xff makes every element of the 3-input truth-table a 1). This is not special-cased as dependency-breaking on KNL or SKL, but it has 2-per-clock throughput on Skylake-AVX512. This beats using a narrower dependency-breaking AVX all-ones and broadcasting or shuffling it.

If you need to re-generate all-ones inside a loop, obviously the most efficient way is to use a vmov* to copy an all-ones register. This doesn't even use an execution unit on modern CPUs (but still takes front-end issue bandwidth). But if you're out of vector registers, loading a constant or [v]pcmpeq[b/w/d] are good choices.

For AVX512, it's worth trying VPMOVM2D zmm0, k0 or maybe VPBROADCASTD zmm0, eax. Each has only 1c throughput, but they should break dependencies on the old value of zmm0 (unlike vpternlogd). They require a mask or integer register which you initialized outside the loop with kxnorw k1,k0,k0 or mov eax, -1.


For AVX512 mask registers, kxnorw k1,k0,k0 works, but it's not dependency-breaking on current CPUs. Intel's optimization manual suggests using it for generating an all-ones before a gather instruction, but recommends avoiding using the same input register as the output. This avoids making an otherwise-independent gather dependent on a previous one in a loop. Since k0 is often unused, it's usually a good choice to read from.

I think vpcmpeqd k1, zmm0,zmm0 would work, but it's probably not special-cased as a k0=1 idiom with no dependency on zmm0. (To set all 64 bits instead of just the low 16, use AVX512BW vpcmpeqb)

On Skylake-AVX512, k instructions that operate on mask registers only run on a single port, even simple ones like kandw. (Also note that Skylake-AVX512 won't run vector uops on port1 when there are any 512b operations in the pipe, so execution unit throughput can be a real bottleneck.)

There is no kmov k0, imm, only moves from integer or memory. Probably there are no k instructions where same,same is detected as special, so the hardware in the issue/rename stage doesn't look for it for k registers.

这篇关于有效地将 CPU 寄存器中的所有位设置为 1的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆