可以使用哪些方法在现代x86上有效地扩展指令长度? [英] What methods can be used to efficiently extend instruction length on modern x86?

查看：113 发布时间：2020/5/21 20:22:50 performance assembly optimization x86 micro-optimization

本文介绍了可以使用哪些方法在现代x86上有效地扩展指令长度?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

想象一下您想将一系列x86汇编指令与某些边界对齐.例如，您可能想将循环对齐到16或32字节边界，或打包指令，以便将其有效地放置在uop缓存或任何其他内容中.

最简单的方法是单字节NOP指令，紧接着是多字节NOP .尽管后者通常更有效，但两种方法都不是免费的:NOP使用前端执行资源，并且也计入了现代x86上您的4-wide ¹重命名限制.

另一种选择是以某种方式延长一些说明以获取所需的对齐方式.如果在不引入新摊位的情况下做到这一点，那似乎比NOP方法更好.如何在最近的x86 CPU上有效地延长指令的时间?

在理想的世界中，加长技术将同时是:

适用于大多数说明
能够将指令延长可变数量
不会使解码器停顿或降低速度
在uop缓存中得到有效展示

不可能有一种方法可以同时满足上述所有要点，因此好的答案可能会解决各种折衷问题.

¹ AMD Ryzen的限制为5或6.

解决方案

考虑进行适度的代码搜索，以缩小而不是扩展代码，尤其是在循环之前.例如xor eax,eax/cdq(如果需要两个清零寄存器)，或者mov eax, 1/lea ecx, [rax+1]，以将寄存器设置为1和2(总共8个字节而不是10个字节)，请参见在x86/x64机器代码中打高尔夫球的技巧，以获得更一般的想法.不过，也许您仍然想避免错误的依赖关系.

或通过pmovsxbd，movddup或vpbroadcastd长于movaps. dword/qword广播负载是免费的(没有ALU uop，只有负载).

如果您完全担心代码对齐，那么您可能会担心它在L1I缓存中的位置或uop缓存边界的位置，因此仅计算总uops不再足够，还有一些额外的麻烦之前区域中的代码可能根本不是问题.

但是在某些情况下，您可能真的想针对要对齐的块之前的指令优化解码吞吐量/uop缓存使用率/总uops.

填充说明，例如询问的问题:

Agner Fog在其使用汇编语言优化子例程"指南. (lea，push r/m64和SIB的想法都来自那里，我复制了一两个句子/短语，否则，这是我自己的工作，或者是不同的想法，或者是在查阅Agner指南之前写的.)

但是，当前CPU尚未进行更新:lea eax, [rbx + dword 0]的缺点比对mov eax, ebx的缺点要多，因为您错过了"，如果inc 没有性能下降的情况，那么您已经在使用inc.

使用SIB字节.您可以通过使用单个寄存器作为索引来使NASM做到这一点，例如mov eax, [nosplit rbx*1](，例如取消分层并且不对商店使用port7 .

所以最好只使用ModR/M + SIB而不使用索引reg来编码base=rbx + disp0/8/32=0 . (无索引"的SIB编码是否则将意味着idx = RSP的编码). [rsp + x]寻址模式已经需要一个SIB(base = RSP是转义代码，表示有一个SIB)，并且始终在编译器生成的代码中出现.因此，有充分的理由期望现在和将来，这种方法在解码和执行(甚至对于RSP以外的基本寄存器)方面都是完全有效的. NASM语法无法表达这一点，因此您必须手动进行编码. Agner Fog的示例10.20中，来自objdump -d的GNU gas Intel语法为8b 04 23 mov eax,DWORD PTR [rbx+riz*1]. (riz是虚构的索引零符号，表示存在没有索引的SIB).我还没有测试GAS是否接受它作为输入.

使用只需要imm8或disp0/disp32的指令的imm32和/或disp32形式. Agner Fog对Sandybridge的uop缓存进行测试(在解码时确实会导致长度更改前缀停顿在Intel CPU上.幸运的是，如果您未明确指定32位地址大小，而是使用带有ModR/M + SIB +的7字节mov r32, r/m32，则默认情况下，没有NASM/YASM/gas/clang会执行此代码大小优化mov eax, [abs symbol]的disp32绝对寻址模式.

在与64位位置相关的代码中，绝对寻址是相对于RIP相对使用1个额外字节的一种廉价方法.但是请注意，与RIP相对+ imm8/16/32相比，即使它仍使用2条指令，它也需要2个周期才能从uop缓存中获取数据，而不是RIP相对+ imm8/16/32. (例如，对于mov -store或cmp).因此，尽管cmp [abs symbol], 123都需要两个条目，但从uop缓存中获取它们却比cmp [rel symbol], 123慢.没有立即的服务，
就没有额外的费用
请注意，PIE可执行文件甚至允许该可执行文件使用ASLR，通常会将其优化为5字节mov eax, 1 .
mov eax, 1 ; 5 bytes to encode (B8 imm32) mov rax, strict dword 1 ; 7 bytes: REX mov r/m64, sign-extended-imm32. mov rax, strict qword 1 ; 10 bytes to encode (REX B8 imm64). movabs mnemonic for AT&T.
您甚至可以使用mov reg, 0代替xor reg,reg.

mov r64, imm64当常量实际上很小(适合32位符号扩展)时，可以有效地放入uop缓存中. 1个uop-cache条目和加载时间= 1，相同至于mov r32, imm32.解码一条巨大的指令意味着在一个16字节的解码块中可能没有空间供其他3条指令在同一周期内进行解码，除非它们都是2字节.可能稍微延长多条其他指令可能比拥有一条长指令更好.

解码额外前缀的惩罚:

P5:前缀阻止配对，只有PMMX上的地址/操作数大小除外.

PPro到PIII:如果一条指令具有多个前缀，则始终会受到惩罚.每个额外的前缀通常要花费一个时钟.(Agner的微体系结构指南，第6.3节结尾)

Silvermont:如果您关心的话，这可能是对可以使用的前缀的最严格的限制.解码停滞在3个以上的前缀上，计数强制性前缀+ 0F转义字节. SSSE3和SSE4指令已经具有3个前缀，因此即使REX也会使它们解码变慢.

某些AMD:可能是3个前缀的限制，不包括转义字节，并且可能不包括SSE指令的强制前缀.

... TODO:完成本节.在此之前，请查阅Agner Fog的微体系结构指南.

手动编码内容后，请务必对二进制文件进行反汇编，以确保正确无误.不幸的是，NASM和其他汇编程序没有更好的支持，无法在指令区域上选择便宜的填充以达到给定的对齐边界.

汇编器语法

NASM具有一些编码覆盖语法:{vex3}和{evex}前缀，NOSPLIT和strict byte / dword，并在寻址模式中强制使用disp8/disp32.请注意，不允许使用[rdi + byte 0]，byte关键字必须排在第一位. [byte rdi + 0]是允许的，但我认为这很奇怪.

从nasm -l/dev/stdout -felf64 padding.asm
列出
line addr machine-code bytes source line num 4 00000000 0F57C0 xorps xmm0,xmm0 ; SSE1 *ps instructions are 1-byte shorter 5 00000003 660FEFC0 pxor xmm0,xmm0 6 7 00000007 C5F058DA vaddps xmm3, xmm1,xmm2 8 0000000B C4E17058DA {vex3} vaddps xmm3, xmm1,xmm2 9 00000010 62F1740858DA {evex} vaddps xmm3, xmm1,xmm2 10 11 12 00000016 FFC0 inc eax 13 00000018 83C001 add eax, 1 14 0000001B 4883C001 add rax, 1 15 0000001F 678D4001 lea eax, [eax+1] ; runs on fewer ports and doesn't set flags 16 00000023 67488D4001 lea rax, [eax+1] ; address-size and REX.W 17 00000028 0501000000 add eax, strict dword 1 ; using the EAX-only encoding with no ModR/M 18 0000002D 81C001000000 db 0x81, 0xC0, 1,0,0,0 ; add eax,0x1 using the ModR/M imm32 encoding 19 00000033 81C101000000 add ecx, strict dword 1 ; non-eax must use the ModR/M encoding 20 00000039 4881C101000000 add rcx, strict qword 1 ; YASM requires strict dword for the immediate, because it's still 32b 21 00000040 67488D8001000000 lea rax, [dword eax+1] 22 23 24 00000048 8B07 mov eax, [rdi] 25 0000004A 8B4700 mov eax, [byte 0 + rdi] 26 0000004D 3E8B4700 mov eax, [ds: byte 0 + rdi] 26 ****************** warning: ds segment base generated, but will be ignored in 64-bit mode 27 00000051 8B8700000000 mov eax, [dword 0 + rdi] 28 00000057 8B043D00000000 mov eax, [NOSPLIT dword 0 + rdi*1] ; 1c extra latency on SnB-family for non-simple addressing mode

GAS具有编码替代伪前缀，{evex}，{disp8}和{disp32} multi-byte NOPs. Although the latter is generally more efficient, neither method is free: NOPs use front-end execution resources, and also count against your 4-wide¹ rename limit on modern x86.

Another option is to somehow lengthen some instructions to get the alignment you want. If this is done without introducing new stalls, it seems better than the NOP approach. How can instructions be efficiently made longer on recent x86 CPUs?

In the ideal world lengthening techniques would simultaneously be:

Applicable to most instructions

Capable of lengthening the instruction by a variable amount

Not stall or otherwise slow down the decoders

Be efficiently represented in the uop cache

It isn't likely that there is a single method that satisfies all of the above points simultaneously, so good answers will probably address various tradeoffs.

¹The limit is 5 or 6 on AMD Ryzen.
解决方案
Consider mild code-golfing to shrink your code instead of expanding it, especially before a loop. e.g. xor eax,eax / cdq if you need two zeroed registers, or mov eax, 1 / lea ecx, [rax+1] to set registers to 1 and 2 in only 8 total bytes instead of 10. See Set all bits in CPU register to 1 efficiently for more about that, and Tips for golfing in x86/x64 machine code for more general ideas. Probably you still want to avoid false dependencies, though.

Or fill extra space by creating a vector constant on the fly instead of loading it from memory. (Adding more uop-cache pressure could be worse, though, for the larger loop that contains your setup + inner loop. But it avoids d-cache misses for constants, so it has an upside to compensate for running more uops.)

If you weren't already using them to load "compressed" constants, pmovsxbd, movddup, or vpbroadcastd are longer than movaps. dword / qword broadcast loads are free (no ALU uop, just a load).

If you're worried about code alignment at all, you're probably worried about how it sits in the L1I cache or where the uop-cache boundaries are, so just counting total uops is no longer sufficient, and a few extra uops in the block before the one you care about may not be a problem at all.

But in some situations, you might really want to optimize decode throughput / uop-cache usage / total uops for the instructions before the block you want aligned.

Padding instructions, like the question asked for:

Agner Fog has a whole section on this: "10.6 Making instructions longer for the sake of alignment" in his "Optimizing subroutines in assembly language" guide. (The lea, push r/m64, and SIB ideas are from there, and I copied a sentence / phrase or two, otherwise this answer is my own work, either different ideas or written before checking Agner's guide.)

It hasn't been updated for current CPUs, though: lea eax, [rbx + dword 0] has more downsides than it used to vs mov eax, ebx, because you miss out on zero-latency / no execution unit mov. If it's not on the critical path, go for it though. Simple lea has fairly good throughput, and an LEA with a large addressing mode (and maybe even some segment prefixes) can be better for decode / execute throughput than mov + nop.

Use the general form instead of the short form (no ModR/M) of instructions like push reg or mov reg,imm. e.g. use 2-byte push r/m64 for push rbx. Or use an equivalent instruction that is longer, like add dst, 1 instead of inc dst, in cases where there are no perf downsides to inc so you were already using inc.

Use SIB byte. You can get NASM to do that by using a single register as an index, like mov eax, [nosplit rbx*1] (see also), but that hurts the load-use latency vs. simply encoding mov eax, [rbx] with a SIB byte. Indexed addressing modes have other downsides on SnB-family, like un-lamination and not using port7 for stores.

So it's best to just encode base=rbx + disp0/8/32=0 using ModR/M + SIB with no index reg. (The SIB encoding for "no index" is the encoding that would otherwise mean idx=RSP). [rsp + x] addressing modes require a SIB already (base=RSP is the escape code that means there's a SIB), and that appears all the time in compiler-generated code. So there's very good reason to expect this to be fully efficient to decode and execute (even for base registers other than RSP) now and in the future. NASM syntax can't express this, so you'd have to encode manually. GNU gas Intel syntax from objdump -d says 8b 04 23 mov eax,DWORD PTR [rbx+riz*1] for Agner Fog's example 10.20. (riz is a fictional index-zero notation that means there's a SIB with no index). I haven't tested if GAS accepts that as input.

Use an imm32 and/or disp32 form of an instruction that only needed imm8 or disp0/disp32. Agner Fog's testing of Sandybridge's uop cache (microarch guide table 9.1) indicates that the actual value of an immediate / displacement is what matters, not the number of bytes used in the instruction encoding. I don't have any info on Ryzen's uop cache.

So NASM imul eax, [dword 4 + rdi], strict dword 13 (10 bytes: opcode + modrm + disp32 + imm32) would use the 32small, 32small category and take 1 entry in the uop cache, unlike if either the immediate or disp32 actually had more than 16 significant bits. (Then it would take 2 entries, and loading it from the uop cache would take an extra cycle.)

According to Agner's table, 8/16/32small are always equivalent for SnB. And addressing modes with a register are the same whether there's no displacement at all, or whether it's 32small, so mov dword [dword 0 + rdi], 123456 takes 2 entries, just like mov dword [rdi], 123456789. I hadn't realized [rdi] + full imm32 took 2 entries, but apparently that' is the case on SnB.

Use jmp / jcc rel32 instead of rel8. Ideally try to expand instructions in places that don't require longer jump encodings outside the region you're expanding. Pad after jump targets for earlier forward jumps, pad before jump targets for later backward jumps, if they're close to needing a rel32 somewhere else. i.e. try to avoid padding between a branch and its target, unless you want that branch to use a rel32 anyway.

You might be tempted to encode mov eax, [symbol] as 6-byte a32 mov eax, [abs symbol] in 64-bit code, using an address-size prefix to use a 32-bit absolute address. But this does cause a Length-Changing-Prefix stall when it decodes on Intel CPUs. Fortunately, none of NASM/YASM / gas / clang do this code-size optimization by default if you don't explicitly specify a 32-bit address-size, instead using 7-byte mov r32, r/m32 with a ModR/M+SIB+disp32 absolute addressing mode for mov eax, [abs symbol].

In 64-bit position-dependent code, absolute addressing is a cheap way to use 1 extra byte vs. RIP-relative. But note that 32-bit absolute + immediate takes 2 cycles to fetch from uop cache, unlike RIP-relative + imm8/16/32 which takes only 1 cycle even though it still uses 2 entries for the instruction. (e.g. for a mov-store or a cmp). So cmp [abs symbol], 123 is slower to fetch from the uop cache than cmp [rel symbol], 123, even though both take 2 entries each. Without an immediate, there's no extra cost for

Note that PIE executables allow ASLR even for the executable, and are the default in many Linux distro, so if you can keep your code PIC without any perf downsides, then that's preferable.

Use a REX prefix when you don't need one, e.g. db 0x40 / add eax, ecx.

It's not in general safe to add prefixes like rep that current CPUs ignore, because they might mean something else in future ISA extensions.

Repeating the same prefix is sometimes possible (not with REX, though). For example, db 0x66, 0x66 / add ax, bx gives the instruction 3 operand-size prefixes, which I think is always strictly equivalent to one copy of the prefix. Up to 3 prefixes is the limit for efficient decoding on some CPUs. But this only works if you have a prefix you can use in the first place; you usually aren't using 16-bit operand-size, and generally don't want 32-bit address-size (although it's safe for accessing static data in position-dependent code).

A ds or ss prefix on an instruction that accesses memory is a no-op, and probably doesn't cause any slowdown on any current CPUs. (@prl suggested this in comments).

In fact, Agner Fog's microarch guide uses a ds prefix on a movq [esi+ecx],mm0 in Example 7.1. Arranging IFETCH blocks to tune a loop for PII/PIII (no loop buffer or uop cache), speeding it up from 3 iterations per clock to 2.

Some CPUs (like AMD) decode slowly when instructions have more than 3 prefixes. On some CPUs, this includes the mandatory prefixes in SSE2 and especially SSSE3 / SSE4.1 instructions. In Silvermont, even the 0F escape byte counts.

AVX instructions can use a 2 or 3-byte VEX prefix. Some instructions require a 3-byte VEX prefix (2nd source is x/ymm8-15, or mandatory prefixes for SSSE3 or later). But an instruction that could have used a 2-byte prefix can always be encoded with a 3-byte VEX. NASM or GAS {vex3} vxorps xmm0,xmm0. If AVX512 is available, you can use 4-byte EVEX as well.

Use 64-bit operand-size for mov even when you don't need it, for example mov rax, strict dword 1 forces the 7-byte sign-extended-imm32 encoding in NASM, which would normally optimize it to 5-byte mov eax, 1.
mov eax, 1 ; 5 bytes to encode (B8 imm32) mov rax, strict dword 1 ; 7 bytes: REX mov r/m64, sign-extended-imm32. mov rax, strict qword 1 ; 10 bytes to encode (REX B8 imm64). movabs mnemonic for AT&T.
You could even use mov reg, 0 instead of xor reg,reg.

mov r64, imm64 fits efficiently in the uop cache when the constant is actually small (fits in 32-bit sign extended.) 1 uop-cache entry, and load-time = 1, the same as for mov r32, imm32. Decoding a giant instruction means there's probably not room in a 16-byte decode block for 3 other instructions to decode in the same cycle, unless they're all 2-byte. Possibly lengthening multiple other instructions slightly can be better than having one long instruction.

Decode penalties for extra prefixes:

P5: prefixes prevent pairing, except for address/operand-size on PMMX only.

PPro to PIII: There is always a penalty if an instruction has more than one prefix. This penalty is usually one clock per extra prefix. (Agner's microarch guide, end of section 6.3)

Silvermont: it's probably the tightest constraint on which prefixes you can use, if you care about it. Decode stalls on more than 3 prefixes, counting mandatory prefixes + 0F escape byte. SSSE3 and SSE4 instructions already have 3 prefixes so even a REX makes them slow to decode.

some AMD: maybe a 3-prefix limit, not including escape bytes, and maybe not including mandatory prefixes for SSE instructions.

... TODO: finish this section. Until then, consult Agner Fog's microarch guide.

After hand-encoding stuff, always disassemble your binary to make sure you got it right. It's unfortunate that NASM and other assemblers don't have better support for choosing cheap padding over a region of instructions to reach a given alignment boundary.

Assembler syntax

NASM has some encoding override syntax: {vex3} and {evex} prefixes, NOSPLIT, and strict byte / dword, and forcing disp8/disp32 inside addressing modes. Note that [rdi + byte 0] isn't allowed, the byte keyword has to come first. [byte rdi + 0] is allowed, but I think that looks weird.

Listing from nasm -l/dev/stdout -felf64 padding.asm
line addr machine-code bytes source line num 4 00000000 0F57C0 xorps xmm0,xmm0 ; SSE1 *ps instructions are 1-byte shorter 5 00000003 660FEFC0 pxor xmm0,xmm0 6 7 00000007 C5F058DA vaddps xmm3, xmm1,xmm2 8 0000000B C4E17058DA {vex3} vaddps xmm3, xmm1,xmm2 9 00000010 62F1740858DA {evex} vaddps xmm3, xmm1,xmm2 10 11 12 00000016 FFC0 inc eax 13 00000018 83C001 add eax, 1 14 0000001B 4883C001 add rax, 1 15 0000001F 678D4001 lea eax, [eax+1] ; runs on fewer ports and doesn't set flags 16 00000023 67488D4001 lea rax, [eax+1] ; address-size and REX.W 17 00000028 0501000000 add eax, strict dword 1 ; using the EAX-only encoding with no ModR/M 18 0000002D 81C001000000 db 0x81, 0xC0, 1,0,0,0 ; add eax,0x1 using the ModR/M imm32 encoding 19 00000033 81C101000000 add ecx, strict dword 1 ; non-eax must use the ModR/M encoding 20 00000039 4881C101000000 add rcx, strict qword 1 ; YASM requires strict dword for the immediate, because it's still 32b 21 00000040 67488D8001000000 lea rax, [dword eax+1] 22 23 24 00000048 8B07 mov eax, [rdi] 25 0000004A 8B4700 mov eax, [byte 0 + rdi] 26 0000004D 3E8B4700 mov eax, [ds: byte 0 + rdi] 26 ****************** warning: ds segment base generated, but will be ignored in 64-bit mode 27 00000051 8B8700000000 mov eax, [dword 0 + rdi] 28 00000057 8B043D00000000 mov eax, [NOSPLIT dword 0 + rdi*1] ; 1c extra latency on SnB-family for non-simple addressing mode

GAS has encoding-override pseudo-prefixes {vex3}, {evex}, {disp8}, and {disp32} These replace the now-deprecated .s, .d8 and .d32 suffixes.

GAS doesn't have an override to immediate size, only displacements.

GAS does let you add an explicit ds prefix, with ds mov src,dst

gcc -g -c padding.S && objdump -drwC padding.o -S, with hand-editting:
# no CPUs have separate ps vs. pd domains, so there's no penalty for mixing ps and pd loads/shuffles 0: 0f 28 07 movaps (%rdi),%xmm0 3: 66 0f 28 07 movapd (%rdi),%xmm0 7: 0f 58 c8 addps %xmm0,%xmm1 # not equivalent for SSE/AVX transitions, but sometimes safe to mix with AVX-128 a: c5 e8 58 d9 vaddps %xmm1,%xmm2, %xmm3 # default {vex2} e: c4 e1 68 58 d9 {vex3} vaddps %xmm1,%xmm2, %xmm3 13: 62 f1 6c 08 58 d9 {evex} vaddps %xmm1,%xmm2, %xmm3 19: ff c0 inc %eax 1b: 83 c0 01 add $0x1,%eax 1e: 48 83 c0 01 add $0x1,%rax 22: 67 8d 40 01 lea 1(%eax), %eax # runs on fewer ports and doesn't set flags 26: 67 48 8d 40 01 lea 1(%eax), %rax # address-size and REX # no equivalent for add eax, strict dword 1 # no-ModR/M .byte 0x81, 0xC0; .long 1 # add eax,0x1 using the ModR/M imm32 encoding 2b: 81 c0 01 00 00 00 add $0x1,%eax # manually encoded 31: 81 c1 d2 04 00 00 add $0x4d2,%ecx # large immediate, can't get GAS to encode this way with $1 other than doing it manually 37: 67 8d 80 01 00 00 00 {disp32} lea 1(%eax), %eax 3e: 67 48 8d 80 01 00 00 00 {disp32} lea 1(%eax), %rax mov 0(%rdi), %eax # the 0 optimizes away 46: 8b 07 mov (%rdi),%eax {disp8} mov (%rdi), %eax # adds a disp8 even if you omit the 0 48: 8b 47 00 mov 0x0(%rdi),%eax {disp8} ds mov (%rdi), %eax # with a DS prefix 4b: 3e 8b 47 00 mov %ds:0x0(%rdi),%eax {disp32} mov (%rdi), %eax 4f: 8b 87 00 00 00 00 mov 0x0(%rdi),%eax {disp32} mov 0(,%rdi,1), %eax # 1c extra latency on SnB-family for non-simple addressing mode 55: 8b 04 3d 00 00 00 00 mov 0x0(,%rdi,1),%eax
GAS is strictly less powerful than NASM for expressing longer-than-needed encodings.

这篇关于可以使用哪些方法在现代x86上有效地扩展指令长度?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

可以使用哪些方法在现代x86上有效地扩展指令长度? [英] What methods can be used to efficiently extend instruction length on modern x86?

问题描述

填充说明，例如询问的问题:

解码额外前缀的惩罚:

汇编器语法

Padding instructions, like the question asked for:

Decode penalties for extra prefixes:

Assembler syntax

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

可以使用哪些方法在现代x86上有效地扩展指令长度? [英] What methods can be used to efficiently extend instruction length on modern x86?

问题描述

填充说明，例如询问的问题:

解码额外前缀的惩罚:

汇编器语法

Padding instructions, like the question asked for:

Decode penalties for extra prefixes:

Assembler syntax

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭