可以使用哪些方法有效地扩展现代 x86 上的指令长度? [英] What methods can be used to efficiently extend instruction length on modern x86?

查看:32
本文介绍了可以使用哪些方法有效地扩展现代 x86 上的指令长度?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

想象一下,您希望将一系列 x86 汇编指令与某些边界对齐.例如,您可能希望将循环对齐到 16 或 32 字节的边界,或者将指令打包以便有效地将它们放置在 uop 缓存或其他任何内容中.

实现此目的的最简单方法是单字节 NOP 指令,紧随其后的是多字节 NOP.虽然后者通常更有效,但两种方法都不是免费的:NOP 使用前端执行资源,并且还计入现代 x86 上的 4-wide1 重命名限制.

另一种选择是以某种方式延长一些指令以获得您想要的对齐方式.如果这在不引入新档位的情况下完成,似乎比 NOP 方法更好.如何在最近的 x86 CPU 上有效地延长指令时间?

在理想的世界中,延长技术应该同时是:

  • 适用于大多数指令
  • 能够以可变数量延长指令
  • 不会拖延或减慢解码器的速度
  • 在 uop 缓存中有效表示

不可能有一种方法同时满足以上所有点,因此好的答案可能会解决各种权衡问题.

<小时>

1AMD Ryzen 上的限制为 5 或 6.

解决方案

考虑温和的代码高尔夫以缩小代码而不是扩展它,尤其是在循环之前.例如xor eax,eax/cdq 如果你需要两个归零的寄存器,或者 mov eax, 1/lea ecx, [rax+1] 将寄存器设置为 1 和 2,总共只有 8 个字节,而不是 10 个.参见 将 CPU 寄存器中的所有位有效地设置为 1 了解更多相关信息,以及 有关在 x86/x64 机器代码中打高尔夫球的提示 以获得更一般的想法.不过,您可能仍然希望避免错误的依赖关系.

或者通过 动态创建向量常量 而不是从内存中加载它.(不过,对于包含设置 + 内循环的较大循环,增加更多的 uop 缓存压力可能会更糟.但它避免了常量的 d-cache 未命中,因此它有一个好处来补偿运行更多的 uops.)

如果您还没有使用它们来加载压缩"常量,pmovsxbdmovddupvpbroadcastd 长>movaps.dword/qword 广播加载是免费的(没有 ALU uop,只是加载).

如果您完全担心代码对齐,您可能会担心它在 L1I 缓存中的位置或 uop 缓存边界在哪里,因此仅计算总 uop 不再足够,还有一些额外的之前块中的 uops 可能根本不是问题.

但在某些情况下,您可能真的希望在要对齐的块之前优化解码吞吐量/uop-cache 使用/总 uops.

<小时>

填充说明,如所问的问题:

Agner Fog 在他的 "noreferrer">"优化汇编语言子程序"指南.(leapush r/m64 和 SIB 的想法都是从那里来的,我抄了一两句,否则这个答案是我自己的作品,要么不同在查看 Agner 的指南之前提出想法或编写.)

它尚未针对当前 CPU 进行更新:lea eax, [rbx + dword 0]mov eax, ebx 相比,它比以前有更多的缺点,因为你错过了 零延迟/无执行单元 mov.如果它不在关键路径上,那就去吧.简单的 lea 具有相当好的吞吐量,具有大寻址模式(甚至可能有一些段前缀)的 LEA 可以比 mov + 更好的解码/执行吞吐量>nop.

使用一般形式而不是像push regmov reg,imm 这样的指令的简短形式(无ModR/M).例如对 push rbx 使用 2 字节的 push r/m64.或者使用更长的等效指令,例如 add dst, 1 而不是 inc dst如果 inc 没有性能缺点,所以您已经在使用 inc.

使用 SIB 字节.您可以通过使用单个寄存器作为索引来让 NASM 做到这一点,例如 mov eax, [nosplit rbx*1] (另见),但与简单地编码 mov eax, [rbx] 相比,这会损害负载使用延迟一个 SIB 字节.索引寻址模式在 SnB 系列上还有其他缺点,如非层压和不使用 port7 存储.

所以最好只使用 ModR/M + SIB 编码 base=rbx + disp0/8/32=0 而没有索引 reg.(无索引"的 SIB 编码是否则意味着 idx=RSP 的编码).[rsp + x] 寻址模式已经需要 SIB(base=RSP 是表示存在 SIB 的转义码),并且一直出现在编译器生成的代码中.因此,我们有充分的理由期待现在和将来的解码和执行(甚至对于 RSP 以外的基址寄存器)都非常有效.NASM 语法无法表达这一点,因此您必须手动编码.来自 objdump -d 的 GNU gas Intel 语法说 8b 04 23 mov eax,DWORD PTR [rbx+riz*1] 用于 Agner Fog 的示例 10.20.(riz 是一个虚构的索引零符号,这意味着有一个没有索引的 SIB).我还没有测试 GAS 是否接受它作为输入.

使用 imm32 和/或 disp32 形式的指令,只需要 imm8disp0/disp32. Agner Fog 对 Sandybridge 的 uop 缓存(微架构指南表 9.1)的测试表明立即数/位移的实际值才是重要的,而不是指令编码中使用的字节数.我没有关于 Ryzen 的 uop 缓存的任何信息.

所以 NASM imul eax, [dword 4 + rdi], strict dword 13 (10 bytes: opcode + modrm + disp32 + imm32) 将使用 32small, 32small 类别并在uop 缓存,不像立即数或 disp32 实际上有超过 16 个有效位.(那么它将需要 2 个条目,并且从 uop 缓存中加载它需要一个额外的周期.)

根据 Agner 的表格,8/16/32small 对于 SnB 总是等效的.并且一个寄存器的寻址方式不管是完全没有位移,还是32small都是一样的,所以mov dword [dword 0 + rdi], 123456需要2个条目,就像mov dword[rdi], 123456789.我没有意识到 [rdi] + 完整的 imm32 需要 2 个条目,但显然 SnB 就是这种情况.

使用 jmp/jcc rel32 而不是 rel8.理想情况下,尝试在您要扩展的区域之外不需要更长跳转编码的地方扩展指令.在较早的向前跳跃时在跳跃目标之后填充,在向后跳跃的跳跃目标之前填充,如果他们在其他地方接近需要 rel32.即尽量避免在分支与其目标之间进行填充,除非您希望该分支无论如何都使用 rel32.

<小时>

您可能想使用地址将 mov eax, [symbol] 编码为 64 位代码中的 6 字节 a32 mov eax, [abs symbol]-size 前缀以使用 32 位绝对地址.但是 这在解码时确实会导致 Length-Changing-Prefix 停顿英特尔 CPU.幸运的是,如果您没有明确指定 32 位地址大小,而是使用 7 字节 mov r32, r/m32code> 带有用于 mov eax, [abs 符号] 的 ModR/M+SIB+disp32 绝对寻址模式.

在 64 位位置相关代码中,绝对寻址是使用 1 个额外字节(与 RIP 相关)相比的廉价方式.但请注意,32 位绝对值 + 立即数需要 2 个周期才能从 uop 缓存中获取,这与 RIP 相对 + imm8/16/32 不同,尽管它仍然使用 2 个指令条目,但它只需要 1 个周期.(例如,对于 mov-store 或 cmp).所以 cmp [abs symbol], 123 从 uop 缓存中获取比 cmp [rel symbol], 123 慢,即使两者都需要 2 个条目.没有立即,没有额外的费用

请注意,即使对于可执行文件,PIE 可执行文件也允许 ASLR,并且是许多 Linux 发行版中的默认设置,因此如果您可以将代码保持在 PIC 上而没有任何性能方面的缺点,那就更可取了.

<小时>

当您不需要时使用 REX 前缀,例如db 0x40/添加eax, ecx.

添加像 rep 这样当前 CPU 忽略的前缀通常是不安全的,因为它们在未来的 ISA 扩展中可能意味着其他东西.

重复相同的前缀有时是可能的(不过 REX 不行).例如,db 0x66, 0x66/add ax, bx 给出了指令 3 个操作数大小的前缀,我认为这总是严格等同于前缀的一个副本.在某些 CPU 上,最多 3 个前缀是有效解码的限制.但这只有在您有一个可以首先使用的前缀时才有效;您通常不使用 16 位操作数大小,并且通常不想要 32 位地址大小(尽管在与位置相关的代码中访问静态数据是安全的).

访问内存的指令上的 dsss 前缀是无操作,并且可能不会导致任何减慢任何当前的 CPU.(@prl 在评论中建议了这一点).

事实上,Agner Fog 的微架构指南在 movq 上使用了 ds 前缀[esi+ecx],mm0例 7.1 中.安排 IFETCH 块 以调整 PII/PIII 的循环(无循环缓冲区或 uop 缓存),将其从每时钟 3 次迭代加速到 2 次.

当指令的前缀超过 3 个时,某些 CPU(如 AMD)的解码速度很慢.在某些 CPU 上,这包括 SSE2 中的强制前缀,尤其是 SSSE3/SSE4.1 指令.在 Silvermont,即使是 0F 转义字节也算在内.

AVX 指令可以使用 2 或 3 字节的 VEX 前缀.一些指令需要一个 3 字节的 VEX 前缀(第二个来源是 x/ymm8-15,或 SSSE3 或更高版本的强制性前缀).但是,可以使用 2 字节前缀的指令始终可以使用 3 字节 VEX 进行编码.NASM 或 GAS {vex3} vxorps xmm0,xm​​m0.如果 AVX512 可用,您也可以使用 4 字节 EVEX.

<小时>

即使在不需要时也为 mov 使用 64 位操作数大小,例如 mov rax, strict dword 1在 NASM 中强制使用 7 字节符号扩展 imm32 编码,通常会将其优化为 5 字节 mov eax, 1.

mov eax, 1 ;5 个字节进行编码 (B8 imm32)mov rax, 严格双字 1 ;7 个字节:REX mov r/m64,符号扩展-imm32.mov rax, 严格的 qword 1 ;10 个字节进行编码 (REX B8 imm64).AT&T 的 movabs 助记符.

你甚至可以使用 mov reg, 0 而不是 xor reg,reg.

mov r64, imm64 当常量实际上很小(适合 32 位符号扩展)时,可以有效地适合 uop 缓存. 1 个 uop 缓存条目,以及load-time = 1,与 mov r32, imm32 相同.解码一条巨大的指令意味着 16 字节的解码块中可能没有空间供 3 条其他指令在同一周期内解码,除非它们都是 2 字节的.稍微延长多条其他指令可能比使用一条长指令要好.

<小时>

解码额外前缀的惩罚:

  • P5:前缀阻止配对,仅 PMMX 上的地址/操作数大小除外.
  • PPro 到 PIII:如果一条指令有多个前缀,总会有惩罚.这种惩罚通常是每个额外前缀一个时钟.(Agner 的微架构指南,第 6.3 节结束)
  • Silvermont:如果您关心的话,这可能是您可以使用的前缀的最严格限制.超过 3 个前缀的解码停顿,计算强制前缀 + 0F 转义字节.SSSE3 和 SSE4 指令已经有 3 个前缀,所以即使是 REX 也会使它们解码很慢.
  • 某些 AMD:可能有 3 个前缀限制,包括转义字节,并且可能不包括 SSE 指令的强制前缀.

... TODO:完成本节.在此之前,请参阅 Agner Fog 的微拱指南.

<小时>

手动编码后,请务必反汇编二进制文件以确保正确.不幸的是,NASM 和其他汇编程序没有更好地支持在指令区域上选择廉价填充以达到给定的对齐边界.

<小时>

汇编语法

NASM 有一些编码覆盖语法:{vex3}{evex} 前缀,NOSPLITstrict byte/dword,并强制在寻址模式中使用 disp8/disp32.请注意,不允许使用 [rdi + byte 0]byte 关键字必须排在第一位.[byte rdi + 0] 是允许的,但我认为这看起来很奇怪.

列表来自 nasm -l/dev/stdout -felf64 padding.asm

 line addr machine-code bytes source line数量4 00000000 0F57C0 xorps xmm0,xm​​m0 ;SSE1 *ps 指令短 1 个字节5 00000003 660FEFC0 像素或 xmm0,xm​​m067 00000007 C5F058DA vaddps xmm3、xmm1、xmm28 0000000B C4E17058DA {vex3} vaddps xmm3、xmm1、xmm29 00000010 62F1740858DA {evex} vaddps xmm3、xmm1、xmm2101112 00000016 FFC0 Inc eax13 00000018 83C001 添加 eax, 114 0000001B 4883C001 添加 rax,115 0000001F 678D4001 lea eax, [eax+1] ;在更少的端口上运行并且不设置标志16 00000023 67488D4001 lea rax,[eax+1];地址大小和 REX.W17 00000028 0501000000 添加 eax,严格双字 1 ;使用没有 ModR/M 的仅 EAX 编码18 0000002D 81C001000000 db 0x81, 0xC0, 1,0,0,0 ;使用 ModR/M imm32 编码添加 eax,0x119 00000033 81C101000000 添加 ecx,严格 dword 1 ;非 eax 必须使用 ModR/M 编码20 00000039 4881C101000000 添加 rcx,严格 qword 1 ;YASM 对立即数要求严格的 dword,因为它仍然是 32b21 00000040 67488D8001000000 lea rax,[dword eax+1]222324 00000048 8B07 mov eax, [rdi]25 0000004A 8B4700 mov eax,[字节 0 + rdi]26 0000004D 3E8B4700 mov eax, [ds: 字节 0 + rdi]26 ******************** 警告:ds 段基数已生成,但在 64 位模式下将被忽略27 00000051 8B8700000000 mov eax, [dword 0 + rdi]28 00000057 8B043D00000000 mov eax,[NOSPLIT dword 0 + rdi*1];SnB 系列上的 1c 额外延迟,用于非简单寻址模式

<小时>

GAS 具有 编码覆盖伪前缀 {vex3}{evex}{disp8}{disp32} 这些替换了现已弃用的 .s.d8.d32 后缀.

GAS 没有覆盖直接大小,只有位移.

GAS 确实允许您添加显式的 ds 前缀,使用 ds mov src,dst

gcc -g -c padding.S &&objdump -drwC padding.o -S,手工

 # 没有 CPU 有单独的 ps 和 pd 域,所以混合 ps 和 pd 加载/洗牌没有任何惩罚0: 0f 28 07 movaps (%rdi),%xmm03: 66 0f 28 07 movapd (%rdi),%xmm07: 0f 58 c8 addps %xmm0,%xmm1 # 不等同于 SSE/AVX 转换,但有时与 AVX-128 混合是安全的a: c5 e8 58 d9 vaddps %xmm1,%xmm2, %xmm3 # 默认 {vex2}e: c4 e1 68 58 d9 {vex3} vaddps %xmm1,%xmm2, %xmm313: 62 f1 6c 08 58 d9 {evex} vaddps %xmm1,%xmm2, %xmm319: ff c0 inc %eax1b: 83 c0 01 添加 $0x1,%eax1e: 48 83 c0 01 添加 $0x1,%rax22: 67 8d 40 01 lea 1(%eax), %eax # 在更少的端口上运行并且不设置标志26: 67 48 8d 40 01 lea 1(%eax), %rax # 地址大小和 REX# 没有等效的 add eax, strict dword 1 # no-ModR/M.byte 0x81, 0xC0;.long 1 # 使用 ModR/M imm32 编码添加 eax,0x12b: 81 c0 01 00 00 00 add $0x1,%eax # 手动编码31: 81 c1 d2 04 00 00 add $0x4d2,%ecx # 立即大,不能让 GAS 用 $1 以这种方式编码,除了手动进行37: 67 8d 80 01 00 00 00 {disp32} lea 1(%eax), %eax3e: 67 48 8d 80 01 00 00 00 {disp32} lea 1(%eax), %raxmov 0(%rdi), %eax # 0 优化掉46: 8b 07 mov (%rdi),%eax{disp8} mov (%rdi), %eax # 即使省略了 0,也会添加一个 disp848: 8b 47 00 mov 0x0(%rdi),%eax{disp8} ds mov (%rdi), %eax # 带 DS 前缀4b: 3e 8b 47 00 mov %ds:0x0(%rdi),%eax{disp32} mov (%rdi), %eax4f: 8b 87 00 00 00 00 mov 0x0(%rdi),%eax{disp32} mov 0(,%rdi,1), %eax # 1c SnB 系列上非简单寻址模式的额外延迟55: 8b 04 3d 00 00 00 00 mov 0x0(,%rdi,1),%eax

GAS 在表达长于所需的编码方面的功能严格不如 NASM.

Imagine you want to align a series of x86 assembly instructions to certain boundaries. For example, you may want to align loops to a 16 or 32-byte boundary, or pack instructions so they are efficiently placed in the uop cache or whatever.

The simplest way to achieve this is single-byte NOP instructions, followed closely by multi-byte NOPs. Although the latter is generally more efficient, neither method is free: NOPs use front-end execution resources, and also count against your 4-wide1 rename limit on modern x86.

Another option is to somehow lengthen some instructions to get the alignment you want. If this is done without introducing new stalls, it seems better than the NOP approach. How can instructions be efficiently made longer on recent x86 CPUs?

In the ideal world lengthening techniques would simultaneously be:

  • Applicable to most instructions
  • Capable of lengthening the instruction by a variable amount
  • Not stall or otherwise slow down the decoders
  • Be efficiently represented in the uop cache

It isn't likely that there is a single method that satisfies all of the above points simultaneously, so good answers will probably address various tradeoffs.


1The limit is 5 or 6 on AMD Ryzen.

解决方案

Consider mild code-golfing to shrink your code instead of expanding it, especially before a loop. e.g. xor eax,eax / cdq if you need two zeroed registers, or mov eax, 1 / lea ecx, [rax+1] to set registers to 1 and 2 in only 8 total bytes instead of 10. See Set all bits in CPU register to 1 efficiently for more about that, and Tips for golfing in x86/x64 machine code for more general ideas. Probably you still want to avoid false dependencies, though.

Or fill extra space by creating a vector constant on the fly instead of loading it from memory. (Adding more uop-cache pressure could be worse, though, for the larger loop that contains your setup + inner loop. But it avoids d-cache misses for constants, so it has an upside to compensate for running more uops.)

If you weren't already using them to load "compressed" constants, pmovsxbd, movddup, or vpbroadcastd are longer than movaps. dword / qword broadcast loads are free (no ALU uop, just a load).

If you're worried about code alignment at all, you're probably worried about how it sits in the L1I cache or where the uop-cache boundaries are, so just counting total uops is no longer sufficient, and a few extra uops in the block before the one you care about may not be a problem at all.

But in some situations, you might really want to optimize decode throughput / uop-cache usage / total uops for the instructions before the block you want aligned.


Padding instructions, like the question asked for:

Agner Fog has a whole section on this: "10.6 Making instructions longer for the sake of alignment" in his "Optimizing subroutines in assembly language" guide. (The lea, push r/m64, and SIB ideas are from there, and I copied a sentence / phrase or two, otherwise this answer is my own work, either different ideas or written before checking Agner's guide.)

It hasn't been updated for current CPUs, though: lea eax, [rbx + dword 0] has more downsides than it used to vs mov eax, ebx, because you miss out on zero-latency / no execution unit mov. If it's not on the critical path, go for it though. Simple lea has fairly good throughput, and an LEA with a large addressing mode (and maybe even some segment prefixes) can be better for decode / execute throughput than mov + nop.

Use the general form instead of the short form (no ModR/M) of instructions like push reg or mov reg,imm. e.g. use 2-byte push r/m64 for push rbx. Or use an equivalent instruction that is longer, like add dst, 1 instead of inc dst, in cases where there are no perf downsides to inc so you were already using inc.

Use SIB byte. You can get NASM to do that by using a single register as an index, like mov eax, [nosplit rbx*1] (see also), but that hurts the load-use latency vs. simply encoding mov eax, [rbx] with a SIB byte. Indexed addressing modes have other downsides on SnB-family, like un-lamination and not using port7 for stores.

So it's best to just encode base=rbx + disp0/8/32=0 using ModR/M + SIB with no index reg. (The SIB encoding for "no index" is the encoding that would otherwise mean idx=RSP). [rsp + x] addressing modes require a SIB already (base=RSP is the escape code that means there's a SIB), and that appears all the time in compiler-generated code. So there's very good reason to expect this to be fully efficient to decode and execute (even for base registers other than RSP) now and in the future. NASM syntax can't express this, so you'd have to encode manually. GNU gas Intel syntax from objdump -d says 8b 04 23 mov eax,DWORD PTR [rbx+riz*1] for Agner Fog's example 10.20. (riz is a fictional index-zero notation that means there's a SIB with no index). I haven't tested if GAS accepts that as input.

Use an imm32 and/or disp32 form of an instruction that only needed imm8 or disp0/disp32. Agner Fog's testing of Sandybridge's uop cache (microarch guide table 9.1) indicates that the actual value of an immediate / displacement is what matters, not the number of bytes used in the instruction encoding. I don't have any info on Ryzen's uop cache.

So NASM imul eax, [dword 4 + rdi], strict dword 13 (10 bytes: opcode + modrm + disp32 + imm32) would use the 32small, 32small category and take 1 entry in the uop cache, unlike if either the immediate or disp32 actually had more than 16 significant bits. (Then it would take 2 entries, and loading it from the uop cache would take an extra cycle.)

According to Agner's table, 8/16/32small are always equivalent for SnB. And addressing modes with a register are the same whether there's no displacement at all, or whether it's 32small, so mov dword [dword 0 + rdi], 123456 takes 2 entries, just like mov dword [rdi], 123456789. I hadn't realized [rdi] + full imm32 took 2 entries, but apparently that' is the case on SnB.

Use jmp / jcc rel32 instead of rel8. Ideally try to expand instructions in places that don't require longer jump encodings outside the region you're expanding. Pad after jump targets for earlier forward jumps, pad before jump targets for later backward jumps, if they're close to needing a rel32 somewhere else. i.e. try to avoid padding between a branch and its target, unless you want that branch to use a rel32 anyway.


You might be tempted to encode mov eax, [symbol] as 6-byte a32 mov eax, [abs symbol] in 64-bit code, using an address-size prefix to use a 32-bit absolute address. But this does cause a Length-Changing-Prefix stall when it decodes on Intel CPUs. Fortunately, none of NASM/YASM / gas / clang do this code-size optimization by default if you don't explicitly specify a 32-bit address-size, instead using 7-byte mov r32, r/m32 with a ModR/M+SIB+disp32 absolute addressing mode for mov eax, [abs symbol].

In 64-bit position-dependent code, absolute addressing is a cheap way to use 1 extra byte vs. RIP-relative. But note that 32-bit absolute + immediate takes 2 cycles to fetch from uop cache, unlike RIP-relative + imm8/16/32 which takes only 1 cycle even though it still uses 2 entries for the instruction. (e.g. for a mov-store or a cmp). So cmp [abs symbol], 123 is slower to fetch from the uop cache than cmp [rel symbol], 123, even though both take 2 entries each. Without an immediate, there's no extra cost for

Note that PIE executables allow ASLR even for the executable, and are the default in many Linux distro, so if you can keep your code PIC without any perf downsides, then that's preferable.


Use a REX prefix when you don't need one, e.g. db 0x40 / add eax, ecx.

It's not in general safe to add prefixes like rep that current CPUs ignore, because they might mean something else in future ISA extensions.

Repeating the same prefix is sometimes possible (not with REX, though). For example, db 0x66, 0x66 / add ax, bx gives the instruction 3 operand-size prefixes, which I think is always strictly equivalent to one copy of the prefix. Up to 3 prefixes is the limit for efficient decoding on some CPUs. But this only works if you have a prefix you can use in the first place; you usually aren't using 16-bit operand-size, and generally don't want 32-bit address-size (although it's safe for accessing static data in position-dependent code).

A ds or ss prefix on an instruction that accesses memory is a no-op, and probably doesn't cause any slowdown on any current CPUs. (@prl suggested this in comments).

In fact, Agner Fog's microarch guide uses a ds prefix on a movq [esi+ecx],mm0 in Example 7.1. Arranging IFETCH blocks to tune a loop for PII/PIII (no loop buffer or uop cache), speeding it up from 3 iterations per clock to 2.

Some CPUs (like AMD) decode slowly when instructions have more than 3 prefixes. On some CPUs, this includes the mandatory prefixes in SSE2 and especially SSSE3 / SSE4.1 instructions. In Silvermont, even the 0F escape byte counts.

AVX instructions can use a 2 or 3-byte VEX prefix. Some instructions require a 3-byte VEX prefix (2nd source is x/ymm8-15, or mandatory prefixes for SSSE3 or later). But an instruction that could have used a 2-byte prefix can always be encoded with a 3-byte VEX. NASM or GAS {vex3} vxorps xmm0,xmm0. If AVX512 is available, you can use 4-byte EVEX as well.


Use 64-bit operand-size for mov even when you don't need it, for example mov rax, strict dword 1 forces the 7-byte sign-extended-imm32 encoding in NASM, which would normally optimize it to 5-byte mov eax, 1.

mov    eax, 1                ; 5 bytes to encode (B8 imm32)
mov    rax, strict dword 1   ; 7 bytes: REX mov r/m64, sign-extended-imm32.
mov    rax, strict qword 1   ; 10 bytes to encode (REX B8 imm64).  movabs mnemonic for AT&T.

You could even use mov reg, 0 instead of xor reg,reg.

mov r64, imm64 fits efficiently in the uop cache when the constant is actually small (fits in 32-bit sign extended.) 1 uop-cache entry, and load-time = 1, the same as for mov r32, imm32. Decoding a giant instruction means there's probably not room in a 16-byte decode block for 3 other instructions to decode in the same cycle, unless they're all 2-byte. Possibly lengthening multiple other instructions slightly can be better than having one long instruction.


Decode penalties for extra prefixes:

  • P5: prefixes prevent pairing, except for address/operand-size on PMMX only.
  • PPro to PIII: There is always a penalty if an instruction has more than one prefix. This penalty is usually one clock per extra prefix. (Agner's microarch guide, end of section 6.3)
  • Silvermont: it's probably the tightest constraint on which prefixes you can use, if you care about it. Decode stalls on more than 3 prefixes, counting mandatory prefixes + 0F escape byte. SSSE3 and SSE4 instructions already have 3 prefixes so even a REX makes them slow to decode.
  • some AMD: maybe a 3-prefix limit, not including escape bytes, and maybe not including mandatory prefixes for SSE instructions.

... TODO: finish this section. Until then, consult Agner Fog's microarch guide.


After hand-encoding stuff, always disassemble your binary to make sure you got it right. It's unfortunate that NASM and other assemblers don't have better support for choosing cheap padding over a region of instructions to reach a given alignment boundary.


Assembler syntax

NASM has some encoding override syntax: {vex3} and {evex} prefixes, NOSPLIT, and strict byte / dword, and forcing disp8/disp32 inside addressing modes. Note that [rdi + byte 0] isn't allowed, the byte keyword has to come first. [byte rdi + 0] is allowed, but I think that looks weird.

Listing from nasm -l/dev/stdout -felf64 padding.asm

 line  addr    machine-code bytes      source line
 num

 4 00000000 0F57C0                         xorps  xmm0,xmm0    ; SSE1 *ps instructions are 1-byte shorter
 5 00000003 660FEFC0                       pxor   xmm0,xmm0
 6                                  
 7 00000007 C5F058DA                       vaddps xmm3, xmm1,xmm2
 8 0000000B C4E17058DA              {vex3} vaddps xmm3, xmm1,xmm2
 9 00000010 62F1740858DA            {evex} vaddps xmm3, xmm1,xmm2
10                                  
11                                  
12 00000016 FFC0                        inc  eax
13 00000018 83C001                      add  eax, 1
14 0000001B 4883C001                    add  rax, 1
15 0000001F 678D4001                    lea  eax, [eax+1]     ; runs on fewer ports and doesn't set flags
16 00000023 67488D4001                  lea  rax, [eax+1]     ; address-size and REX.W
17 00000028 0501000000                  add  eax, strict dword 1   ; using the EAX-only encoding with no ModR/M 
18 0000002D 81C001000000                db 0x81, 0xC0, 1,0,0,0     ; add    eax,0x1  using the ModR/M imm32 encoding
19 00000033 81C101000000                add  ecx, strict dword 1   ; non-eax must use the ModR/M encoding
20 00000039 4881C101000000              add  rcx, strict qword 1   ; YASM requires strict dword for the immediate, because it's still 32b
21 00000040 67488D8001000000            lea  rax, [dword eax+1]
22                                  
23                                  
24 00000048 8B07                        mov  eax, [rdi]
25 0000004A 8B4700                      mov  eax, [byte 0 + rdi]
26 0000004D 3E8B4700                    mov  eax, [ds: byte 0 + rdi]
26          ******************       warning: ds segment base generated, but will be ignored in 64-bit mode
27 00000051 8B8700000000                mov  eax, [dword 0 + rdi]
28 00000057 8B043D00000000              mov  eax, [NOSPLIT dword 0 + rdi*1]  ; 1c extra latency on SnB-family for non-simple addressing mode


GAS has encoding-override pseudo-prefixes {vex3}, {evex}, {disp8}, and {disp32} These replace the now-deprecated .s, .d8 and .d32 suffixes.

GAS doesn't have an override to immediate size, only displacements.

GAS does let you add an explicit ds prefix, with ds mov src,dst

gcc -g -c padding.S && objdump -drwC padding.o -S, with hand-editting:

  # no CPUs have separate ps vs. pd domains, so there's no penalty for mixing ps and pd loads/shuffles
  0:   0f 28 07                movaps (%rdi),%xmm0
  3:   66 0f 28 07             movapd (%rdi),%xmm0

  7:   0f 58 c8                addps  %xmm0,%xmm1        # not equivalent for SSE/AVX transitions, but sometimes safe to mix with AVX-128

  a:   c5 e8 58 d9             vaddps %xmm1,%xmm2, %xmm3  # default {vex2}
  e:   c4 e1 68 58 d9          {vex3} vaddps %xmm1,%xmm2, %xmm3
 13:   62 f1 6c 08 58 d9       {evex} vaddps %xmm1,%xmm2, %xmm3

 19:   ff c0                   inc    %eax
 1b:   83 c0 01                add    $0x1,%eax
 1e:   48 83 c0 01             add    $0x1,%rax
 22:   67 8d 40 01             lea  1(%eax), %eax     # runs on fewer ports and doesn't set flags
 26:   67 48 8d 40 01          lea  1(%eax), %rax     # address-size and REX
         # no equivalent for  add  eax, strict dword 1   # no-ModR/M

         .byte 0x81, 0xC0; .long 1    # add    eax,0x1  using the ModR/M imm32 encoding
 2b:   81 c0 01 00 00 00       add    $0x1,%eax     # manually encoded
 31:   81 c1 d2 04 00 00       add    $0x4d2,%ecx   # large immediate, can't get GAS to encode this way with $1 other than doing it manually

 37:   67 8d 80 01 00 00 00      {disp32} lea  1(%eax), %eax
 3e:   67 48 8d 80 01 00 00 00   {disp32} lea  1(%eax), %rax


        mov  0(%rdi), %eax      # the 0 optimizes away
  46:   8b 07                   mov    (%rdi),%eax
{disp8}  mov  (%rdi), %eax      # adds a disp8 even if you omit the 0
  48:   8b 47 00                mov    0x0(%rdi),%eax
{disp8}  ds mov  (%rdi), %eax   # with a DS prefix
  4b:   3e 8b 47 00             mov    %ds:0x0(%rdi),%eax
{disp32} mov  (%rdi), %eax
  4f:   8b 87 00 00 00 00       mov    0x0(%rdi),%eax
{disp32} mov  0(,%rdi,1), %eax    # 1c extra latency on SnB-family for non-simple addressing mode
  55:   8b 04 3d 00 00 00 00    mov    0x0(,%rdi,1),%eax

GAS is strictly less powerful than NASM for expressing longer-than-needed encodings.

这篇关于可以使用哪些方法有效地扩展现代 x86 上的指令长度?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆