为什么 X86 中没有 NAND、NOR 和 XNOR 指令? [英] Why are there no NAND, NOR and XNOR instructions in X86?

查看:35
本文介绍了为什么 X86 中没有 NAND、NOR 和 XNOR 指令?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

  • 它们是最简单的说明"之一.您可以在计算机上执行(它们是我亲自实施的第一个)
  • 执行 NOT(AND(x, y)) 使执行时间和依赖链长度和代码大小加倍
  • BMI1 引入了andnot"这是一个有意义的添加,是一个独特的操作 - 为什么不是这个问题标题中的那些?
  • 您通常会在它们占用宝贵的操作码空间"这几行中阅读答案.但随后我查看了 AVX512 引入的所有 kmask 操作,顺便说一下,其中包括 NAND 和 XNOR ........
  • 优化编译器可以生成更好的代码
  • SIMD => 它变得更糟没有 NOT 指令,这需要将执行时间、依赖链长度(<= 不正确;感谢@Peter Cordes)和代码大小增加三倍,而不是加倍:
vpcmpeqd  xmm15, xmm15, xmm15
vpor      xmm0,  xmm0,  xmm1
vpandn    xmm0,  xmm0,  xmm15

推荐答案

那些指令不会像您想象的那么有价值,一旦创建了基础 ISA,架构师通常不会添加新指令,除非一些重要的用例取得了巨大的胜利.(例如,对于大多数代码而言,MMX 总体上并不是一个巨大的胜利,但对于作为早期用例之一的视频/音频编解码器来说却是一个巨大的加速.)

Those instructions wouldn't be as valuable as you imagine, and once a base ISA has been created, architects typically don't add new instructions unless there's a big win for some important use-case. (e.g. MMX isn't a big win overall for most code, but was a huge speedup for video/audio codecs as one of the early use-cases.)

请记住,大多数代码并没有进行无分支的比特黑客.这只是在 8086 之后的几十年中在 SIMD 中变得更加普遍.我怀疑大多数程序员宁愿使用 nor 而不是or(8086 没有空间留给更多遵循其正常模式的标准 ALU 指令编码1.)很多代码花费了大量时间进行比较和分支,循环数据结构(并暂停内存),或执行正常"操作数学.位操作代码当然存在,但很多代码并没有涉及太多.

Remember, most code isn't doing branchless bithacks. That only became much more common with SIMD, decades after 8086. I doubt most programmers would rather have nor than or (8086 had no space left for more standard ALU instruction encodings that follow its normal patterns1.) A lot of code spends a lot of it's time comparing-and-branching, looping over data structures (and stalling for memory), or doing "normal" math. Certainly bit-manipulation code exists, but a lot of code doesn't involve much of that.

在整个地方保存一两个指令会有所帮助,但前提是您可以使用这些新指令编译整个应用程序.(虽然大多数 BMI1 和 BMI2 实际上都是这样,例如 SHLX/SHRX 用于 1-uop copy-and-shift-by-variable,但英特尔仍然添加它们来修补非常糟糕的 3-uop shift-by-cl.) 如果您的目标是特定服务器,那很好(因此您可以使用 -march=native 构建),但是许多 x86 代码是提前编译的,以便在随机消费者机器上使用.像 SSE 这样的扩展可以极大地加速单个循环,因此通常可以将单个函数分派到不同版本以利用它,同时保持较低的基线要求.

Saving an instruction or two all over the place will help, but only if you can compile your whole application with these new instructions. (Although most of BMI1 and BMI2 are actually like that, e.g. SHLX/SHRX for 1-uop copy-and-shift-by-variable, but Intel still added them to patch over the really crappy 3-uop shift-by-cl.) That's fine if you're targeting a specific server (so you can build with -march=native), but a lot of x86 code is ahead-of-time compiled for use on random consumer machines. Extensions like SSE can greatly speed up single loops, so it's usually viable to dispatch to different versions of one single function to take advantage, while keeping the baseline requirement low.

但是对于您建议的新添加版本的说明,它不会以这种方式工作,因此添加它们的好处要低得多.他们还没有出现,因为 8086 非常拥挤.

But it wouldn't work that way for newly-added version of the instructions you're suggesting, so the benefit to adding them is significantly lower. And they weren't already present because 8086 is super cramped.

但大多数 ISAS 没有这些,ARM 没有,甚至 PowerPC 也没有,它选择在其 32 位指令字中使用编码空间来拥有大量操作码.(包括像 rlwinm 用位范围旋转和屏蔽之类的整洁的东西,以及其他位域插入/提取到任意位置的东西.)所以这不仅仅是 8086 遗留的 x86-64 再次拧紧的问题,大多数 CPU 架构师认为不值得为这些添加操作码,即使在具有大量空间的 RISC 中也是如此.

But most ISAS don't have these, not ARM, not even PowerPC which chooses to use the coding space in its 32-bit instruction words to have a lot of opcodes. (Including neat stuff like rlwinm rotate and mask with a bit-range, and other bitfield insert/extract to arbitrary position stuff.) So it's not just a matter of 8086 legacy screwing x86-64 yet again, it's that most CPU architects haven't considered it worth adding opcodes for these, even in a RISC with lots of space.

虽然MIPS 确实有一个nor,而不是一个not.(MIPS xori 对立即数进行零扩展,因此它不能用于非完整寄存器.)

Although MIPS does have a nor, instead of a not. (MIPS xori zero-extends the immediate so it couldn't be used to NOT a full register.)

请注意,一旦您创建了一次全 1 向量,就可以在循环中重复使用它.大多数 SIMD 代码都在循环中,尽管对单个结构小心使用 SIMD 会很好.

Note that once you've created an all-ones vector once, you can reuse it in a loop. Most SIMD code is in loops, although careful use of SIMD for a single struct can be good.

SIMD 不仅为关键路径增加了 1 个周期,而且为您的 NOR 实现总共增加了 2 个周期的延迟.在您的示例中, pcmpeqd 不在关键路径上,并且几乎所有 CPU 上都不依赖于 reg 的旧值.(不过仍然需要一个 SIMD 执行单元来编写这些).它消耗吞吐量而不是延迟.对于给定的代码块,执行时间可能取决于吞吐量或延迟.(需要多少 CPU 周期每条汇编指令?(没那么简单)/预测现代超标量处理器上的操作的延迟需要考虑哪些因素,我如何手动计算它们?)

SIMD NOT only adds 1 cycle to the critical path, for a total of 2 cycle latency for your NOR implementation. In your example, pcmpeqd is off the critical path and has no dependency on the old value of the reg on almost all CPUs. (Still needs a SIMD execution unit to write the ones, though). It costs throughput but not latency. Execution time might depend on either throughput or latency, for a given block of code. (How many CPU cycles are needed for each assembly instruction? (it's not that simple) / What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?)

顺便说一句,编译器经常使用 vpxor 和 all-ones 而不是 vpandn;唯一的优势是使用内存源操作数,您可以在其中使用 xor 进行 NOT-and-load,这与 vpandn 不同 其中可选内存操作数 (src2) 是未反转的操作数.dst = ~src1 &src2.

BTW, compilers often use vpxor with all-ones instead of vpandn; only advantage is with a memory source operand where you can NOT-and-load with xor, unlike vpandn where the optionally-memory operand (src2) is the one that's not inverted. dst = ~src1 & src2.

您通常可以将代码安排为不需要反转,例如在 OR 之后检查相反的 FLAG 条件. 并非总是如此;当然,当您执行一系列按位运算时,它可能会出现,SIMD 可能更是如此.

You can often arrange your code to not need an inversion, e.g. checking the opposite FLAG condition after an OR. Not always; of course when you're doing a chain of bitwise things it can come up, probably moreso with SIMD.

向 BMI1 或未来扩展中添加更多此类指令所带来的真正加速对于大多数一般工作负载(如 SPECint)来说可能(已经)相当小.

The real speedup from adding more such instructions to BMI1 or a future extension probably would (have been) quite small for most general workloads like SPECint.

比整数 xnor 等更有价值.可能是 常见 整数指令的非破坏性 VEX 版本,例如 sub与 LEA 无关.所以很多mov/sub 序列可以是vsub.也可能是imulorand,还有shl/shr​​/sar - 立即.但是确定如果您要添加东西,不妨使用 nand、nor 和 xnor.也许标量 abssetcc r/m32 以避免愚蠢的 xor-zeroing 或 movzx 你需要布尔化为 32 位整数.(当您在使用它时,mov r/m32, sign_extended_imm8 如果您能为它找到一个一字节的操作码,例如 64 位的操作码之一,那么它也适用于代码密度模式释放.)

More valuable than integer xnor etc. probably would be non-destructive VEX versions of common integer instructions like sub that can't be done with LEA. So lots of mov/sub sequences could be vsub. Also maybe imul, or, maybe and, and perhaps shl/shr/sar-immediate. But sure if you're adding stuff, might as well have nand, nor, and xnor. And maybe scalar abs, and setcc r/m32 to avoid the stupid xor-zeroing or movzx you need to booleanize into a 32-bit integer. (While you're at it, mov r/m32, sign_extended_imm8 would also be good for code-density if you could find a one-byte opcode for it, e.g. one of the ones that 64-bit mode freed up.)

有一整套糟糕的或短视的设计决策,最好能逆转(或者如果 AVX 修复会很好),例如cvtsi2sd xmm0, eax 合并到 XMM0 中,因此它具有错误的依赖关系,导致 GCC 花费额外的 insn 对目标进行异或清零.AVX 是为 VEX 版本改变这种行为的机会,也许可以通过为现有执行单元提供物理零注册作为合并目标来在内部处理.(它存在于 SnB 系列的物理寄存器文件中,这就是为什么在重命名时可以完全消除异或归零的原因,例如 mov-elimination.)但是不,英特尔尽可能地保留了旧版 SSE 版本的所有内容,保留那种目光短浅的奔腾 III 设计决策.:((PIII 将 xmm regs 分成两个 64 位一半:只写低半部分对 SSE1 cvtsi2ss 有好处.英特尔继续合并 SSE2 cvtsi2sd我猜是 P4 的一致性.)

There's a whole laundry list of bad or short-sighted design decisions it would be nice to reverse (or that it would have been nice if AVX fixed), e.g. that cvtsi2sd xmm0, eax merges into XMM0 so it has a false dependency, leading GCC to spend an extra insn on xor-zeroing the destination. AVX was a chance to change that behaviour for the VEX version, and maybe could have been handled internally by giving the existing execution unit the physical zero-reg as the merge target. (Which exists in the physical register file on SnB-family, that's why xor-zeroing can be fully eliminated in rename, like mov-elimination.) But nope, Intel kept everything as much like the legacy-SSE versions as the possibly could, preserving that short-sighted Pentium III design decision. :( (PIII split xmm regs into two 64-bit halves: only writing the low half was good for it for SSE1 cvtsi2ss. Intel continued with the merging for SSE2 cvtsi2sd in P4 for consistency I guess.)

在 AVX-512 之前的一些 SIMD 版本中添加否定布尔指令可能是有意义的,比如 SSE4.1(它添加了一堆杂项整数,并使事情更加正交,并且被添加了.并且只在 45nm Core2 中添加,因此晶体管预算比 MMX 或 SSE1/2 天高很多),或 AVX(这为 VEX 开辟了很多编码空间).

It might have made sense to add negated-boolean instruction in some SIMD version before AVX-512, like SSE4.1 (which added a bunch of miscellaneous integer stuff, and made things more orthogonal, and was added. And was only added in 45nm Core2, so transistor budgets were a lot higher than in MMX or SSE1/2 days), or AVX (which opened up a lot of coding space with VEX).

但由于它们没有,现在 vpternlogd 存在,添加它们就没什么意义了.除非英特尔打算创建新的传统 SSE 或仅 256 位的 VEX 扩展,而 AMD 可能希望实施这些扩展......

But since they didn't, there's little point adding them now that vpternlogd exists. Unless Intel is going to create new legacy-SSE or 256-bit-only VEX extensions that AMD might want to implement...

(即使在他们的 Silvermont 系列 CPU 和 Pentium/Celeron CPU 中,Legacy-SSE 也可以使用它,它们都不解码 VEX 前缀.这就是为什么不幸的是,即使 Skylake Pentiums 也禁用 BMI1/2 支持以及 AVX1/2/FMA.这真的很愚蠢,这意味着我们离能够使用 BMI1/2 作为应在现代桌面"上运行的提前编译的东西的基准还差得很远.)

(Legacy-SSE would make it usable even in their Silvermont-family CPUs, and in Pentium/Celeron CPUs, none of which decode VEX prefixes. That's why unfortunately even Skylake Pentiums disable BMI1/2 support along with AVX1/2/FMA. This is really dumb and means we're no closer to being able to use BMI1/2 as a baseline for ahead-of-time compiled stuff that should run on "modern desktops".)

VEX 有很多编码空间,掩码指令使用了它.此外,AVX-512 仅由高端 CPU 实现;英特尔的低功耗 Silvermont 系列 CPU 实施它需要很长时间.因此需要解码所有这些不同的 VEX 编码掩码指令是 AVX-512 CPU 必须处理的事情.

VEX has lots of coding space, and mask instructions use that. Also, AVX-512 is only implemented by high-end CPUs; it will be a long time if ever before Intel's low-power Silvermont family CPUs implement it. So needing to decode all those different VEX-coded mask instructions is something AVX-512 CPUs just have to deal with.

AVX-512(或前身)最初是为 Larrabee 设计的,一个 GPU 项目,它变成了 Xeon Phi 计算卡.因此,AVX-512 ISA 设计选择并不能完全反映您在设计时考虑了通用用途.尽管拥有大量相对较小的内核意味着您需要避免任何会导致解码器芯片面积膨胀或功耗过大的情况,因此这并非不合理.

AVX-512 (or a predecessor) was originally designed for Larrabee, a GPU project which turned into Xeon Phi compute cards. So AVX-512 ISA-design choices don't fully reflect what you might design with general-purpose usage in mind. Although having lots of relatively small cores would mean you'd want to avoid anything that inflated decoder die-area or power too much, so it's not unreasonable.

但没有 VEX,x86 操作码空间非常拥挤(实际上 32 位模式下没有 1 字节的操作码,而且还剩下很少的 0f xx.http://ref.x86asm.net/coder32.html).英特尔(与 AMD 不同)仍然出于某种原因喜欢制造一些无法解码 VEX 前缀的 CPU.当然,他们可以改变这一点并将 VEX 解码添加到 Silvermont,这样他们就可以在不支持 AVX(或所有 BMI2)的情况下使用 VEX 编码的整数指令.(BMI2 包括 pext/pdep,在专用执行单元中快速实现它们的成本很高.AMD 选择对它们进行微编码,因此它们非常慢,但这让代码可以有效地使用其他 BMI2 指令.)

But without VEX, x86 opcode space is very crowded (literally no 1-byte opcodes left in 32-bit mode, and few 0f xx left. http://ref.x86asm.net/coder32.html). Intel (unlike AMD) still for some reason likes to make some CPUs that can't decode VEX prefixes. Of course they could change that and add VEX decoding into Silvermont so they could have VEX-coded integer instructions without supporting AVX (or all of BMI2). (BMI2 includes pext/pdep which are expensive to implement fast in a dedicate execution unit. AMD chooses to micro-code them so they've very slow, but that lets code use other BMI2 instructions usefully.)

(不幸的是,CPU 无法(通过 CPUID)宣传它仅支持 128 位向量大小的 AVX 指令,这将允许更窄的 CPU 仍然获得非破坏性指令.OTOH,没有一些前向兼容代码在支持它的 CPU 上使用更广泛的指令的方式,使 128 位 AVX 代码优化当前的 CPU 可能最终被称为足够好"并且没有人费心为可以的 CPU 制作 256=位版本支持.)

(Unfortunately there's no way for a CPU to advertize (via CPUID) that it supports only 128-bit vector size AVX instructions, which would have allowed narrower CPUs to still get non-destructive instructions. OTOH, without some forward-compatible way for code to use wider instructions on CPUs that do support it, making 128-bit AVX code to optimize for current CPUs might end up being called "good enough" and not have anyone bother to make 256=bit versions for CPUs that can support it.)

脚注 1:原始 8086 指令的操作码

对 8086 来说,仅仅解码每一个不同的操作码是一个挑战,每条 ALU 指令有大约 8 个不同的操作码:内存目标、内存源、直接源和特殊情况无 modrm AL/AX 形式.对于 8 位和 16 位版本的每个版本,时间为 2.加上 xnor r/m16, sign_extended_imm8.当然直接形式可以使用 ModRM 中的 /r 字段作为额外的操作码位,但是 xnor r/m8, rxnor r, r/m8 和 16 位形式需要 4 个单独的操作码字节,xnor al, imm8xnor ax, imm16 也是如此,所以每个操作码字节为 6指令,加上一些重载的操作码/constant

Just getting every different opcode decoded was a challenge for 8086, and each ALU instruction has about 8 different opcodes: memory dest, memory source, immediate source, and special-case no modrm AL/AX forms. And times two for 8 and 16-bit versions of each of those. Plus xnor r/m16, sign_extended_imm8. Of course the immediate forms can use the /r field in ModRM as extra opcode bits, but xnor r/m8, r and xnor r, r/m8 and the 16-bit forms would need 4 separate opcode bytes, and so would xnor al, imm8 and xnor ax, imm16, so that's 6 whole opcode bytes per instruction, plus some overloaded opcode /constant

(半相关:https://codegolf.stackexchange.com/questions/132981/tips-for-golfing-in-x86-x64-machine-code/160739#160739 回复:简短格式的 AL,imm8 编码.)

(semi-related: https://codegolf.stackexchange.com/questions/132981/tips-for-golfing-in-x86-x64-machine-code/160739#160739 re: short-form AL,imm8 encodings.)

您可以在原始 8086 操作码中看到的部分模式是在 r/m 目标与 r/m 源之间选择一位,另一位8 到 16 位操作数大小(x86 操作码有模式吗?(除了方向和大小位)/x86 操作码是任意的吗?).因此,对于一些较少见的指令(例如通过省略 memory-dst 或 8 位形式)以不同的方式执行此操作可能会破坏模式,并且如果这样需要比标准模式更多的额外晶体管,以便在加载或寄存器获取后为 ALU 供电,或加载/铝/存储.

Part of the patterns you can see in the original-8086 opcodes is that one bit selects between r/m destination vs. r/m source, and another bit between 8 and 16-bit operand-size (Is there a pattern to x86 op codes? (other than direction and size bits) / Are x86 opcodes arbitrary?). So doing it differently for a few rarer instructions (by leaving out memory-dst or 8-bit forms for example) might have broken the pattern and if so needed more extra transistors than the standard patterns for feeding the ALU after a load or register fetch, or load/alu/store.

事实上,我认为 8086 甚至没有为更多支持所有标准形式(如 addor)的 ALU 指令留出足够的空间.并且 8086 没有解码任何 0f xx 操作码;后来用于扩展.

In fact, I don't think 8086 left enough room for even one more ALU instruction that supported all the standard forms like add or or. And 8086 didn't decode any 0f xx opcodes; that came later for extensions.

这篇关于为什么 X86 中没有 NAND、NOR 和 XNOR 指令?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆