向x86-64 ABI的指针添加32位偏移量时是否需要符号或零扩展? [英] Is a sign or zero extension required when adding a 32bit offset to a pointer for the x86-64 ABI?

查看:144
本文介绍了向x86-64 ABI的指针添加32位偏移量时是否需要符号或零扩展?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

摘要:我正在查看汇编代码以指导优化,并在将int32添加到指针时看到许多符号或零扩展名.

Summary: I was looking at assembly code to guide my optimizations and see lots of sign or zero extensions when adding int32 to a pointer.

void Test(int *out, int offset)
{
    out[offset] = 1;
}
-------------------------------------
movslq  %esi, %rsi
movl    $1, (%rdi,%rsi,4)
ret

起初,我以为我的编译器在将32位与64位整数相加方面遇到了挑战,但是我已经通过Intel ICC 11,ICC 14和GCC 5.3确认了这种行为.

At first, I thought my compiler was challenged at adding 32bit to 64bit integers, but I've confirmed this behavior with Intel ICC 11, ICC 14, and GCC 5.3.

线程确认我的发现,但尚不清楚是否需要符号或零扩展名.仅当尚未设置高32位时才需要此符号/零扩展名.但是x86-64 ABI是否足够聪明以至于不能要求它?

This thread confirms my findings, but it's not clear if the sign or zero extension is necessary. This sign/zero extension would only be necessary if the upper 32bits aren't already set. But wouldn't the x86-64 ABI be smart enough to require that?

我有点不愿意将所有指针偏移量更改为ssize_t,因为寄存器溢出会增加代码的缓存占用空间.

I'm kind of reluctant to change all my pointer offsets to ssize_t because register spills will increase the cache footprint of the code.

推荐答案

是的,您必须假定arg或返回值寄存器的高32位包含垃圾.另一方面,允许您在致电或返回自己时将垃圾留在高32位.也就是说,负担是在接收方忽略高位,而不是在传递方清理高位.

Yes, you have to assume that the high 32 bits of an arg or return-value register contains garbage. On the flip side, you are allowed to leave garbage in the high 32 when calling or returning yourself. i.e. the burden is on the receiving side to ignore the high bits, not on the passing side to clean the high bits.

您需要对64位进行签名或零扩展以使用64位有效地址中的值.在 x32 ABI 中,gcc经常使用32位有效地址,而不是使用64位操作数,修改可能用作数组索引的可能为负整数的每条指令的大小.

You need to sign or zero extend to 64 bits to use the value in a 64-bit effective address. In the x32 ABI, gcc frequently uses 32-bit effective addresses instead of using 64-bit operand-size for every instruction modifying a potentially-negative integer used as an array index.

x86-64 SysV ABI 只说出寄存器的哪些部分为_Bool(又名bool)清零.第20页:

The x86-64 SysV ABI only says anything about which parts of a register are zeroed for _Bool (aka bool). Page 20:

当类型_Bool的值返回或传递到寄存器中或 在堆栈中,第0位包含真值,第1至第7位应为 零(脚注14:其他位未指定,因此这些值的使用方在被截断为8位时可以依靠它为0或1)

When a value of type _Bool is returned or passed in a register or on the stack, bit 0 contains the truth value and bits 1 to 7 shall be zero (footnote 14: Other bits are left unspecified, hence the consumer side of those values can rely on it being 0 or 1 when truncated to 8 bit)

此外,关于%al的内容包含varargs函数的FP寄存器args的数量,而不是整个%rax.

Also, the stuff about %al holding the number of FP register args for varargs functions, not the whole %rax.

关于这个确切问题,有一个打开github问题 href ="https://github.com/hjl-tools/x86-psABI/wiki/X86-psABI" rel ="noreferrer"> x32和x86-64 ABI文档的github页面 .

There's an open github issue about this exact question on the github page for the x32 and x86-64 ABI documents.

ABI对包含args或返回值的整数或向量寄存器的高位部分的内容没有任何进一步的要求或保证,因此没有任何内容.我已经通过Michael Matz(ABI维护者之一)的电子邮件确认了这一事实:通常,如果ABI没有说指定了某些内容,则您不能依靠它."

The ABI doesn't place any further requirements or guarantees on the contents of the high parts of integer or vector registers holding args or return values, so there aren't any. I have confirmation of this fact via email from Michael Matz (one of the ABI maintainers): "Generally, if the ABI doesn't say something is specified, you cannot rely on it."

他还证实,例如

He also confirmed that e.g. clang >= 3.6's use of an addps that could slow down or raise extra FP exceptions with garbage in high elements is a bug (which reminds me I should report that). He adds that this was an issue once with an AMD implementation of a glibc math function. Normal C code can leave garbage in high elements of vector regs when passing scalar double or float args.

狭窄的函数参数,甚至_Bool/bool,都被符号化或零扩展为32位. clang甚至生成了依赖于此行为的代码(显然,自2007年以来). ICC17

Narrow function arguments, even _Bool/bool, are sign or zero-extended to 32 bits. clang even makes code that depends on this behaviour (since 2007, apparently). ICC17 doesn't do it, so ICC and clang are not ABI-compatible, even for C. Don't call clang-compiled functions from ICC-compiled code for the x86-64 SysV ABI, if any of the first 6 integer args are narrower than 32-bit.

这不适用于返回值,仅适用于args:gcc和clang都假定它们接收到的返回值仅具有有效的数据,且该数据不超过类型的宽度.例如,gcc将使返回char的函数在%eax的高24位中留下垃圾.

This doesn't apply to return values, only args: gcc and clang both assume that return-values they receive only have valid data up to the width of the type. gcc will make functions returning char that leave garbage in the high 24 bits of %eax, for example.

ABI讨论组上的最近的主题是一项建议,用于阐明将8位和16位args扩展到32位的规则,并可能实际上修改了ABI以要求这样做.主要的编译器(ICC除外)已经做到了,但这将改变调用者和被调用者之间的合同.

A recent thread on the ABI discussion group was a proposal to clarify the rules for extending 8 and 16-bit args to 32 bits, and maybe actually modify the ABI to require this. The major compilers (except ICC) already do it, but it would be a change to the contract between callers and callees.

这是一个示例(与其他编译器一起检查或调整代码在Godbolt编译器资源管理器中,其中我包括了许多简单的示例,这些示例仅演示了一个难题,并且还演示了很多):

Here's an example (check it out with other compilers or tweak the code on the Godbolt Compiler Explorer, where I've included many simple examples that only demonstrate one piece of the puzzle, as well as this that demonstrates a lot):

extern short fshort(short a);
extern unsigned fuint(unsigned int a);

extern unsigned short array_us[];
unsigned short lookupu(unsigned short a) {
  unsigned int a_int = a + 1234;
  a_int += fshort(a);                 // NOTE: not the same calls as the signed lookup
  return array_us[a + fuint(a_int)];
}

# clang-3.8 -O3  for x86-64.    arg in %rdi.  (Actually in %di, zero-extended to %edi by our caller)
lookupu(unsigned short):
    pushq   %rbx                      # save a call-preserved reg for out own use.  (Also aligns the stack for another call)
    movl    %edi, %ebx                # If we didn't assume our arg was already zero-extended, this would be a movzwl (aka movzx)
    movswl  %bx, %edi                 # sign-extend to call a function that takes signed short instead of unsigned short.
    callq   fshort(short)
    cwtl                              # Don't trust the upper bits of the return value.  (This is cdqe, Intel syntax.  eax = sign_extend(ax))
    leal    1234(%rbx,%rax), %edi     # this is the point where we'd get a wrong answer if our arg wasn't zero-extended.  gcc doesn't assume this, but clang does.
    callq   fuint(unsigned int)
    addl    %ebx, %eax                # zero-extends eax to 64bits
    movzwl  array_us(%rax,%rax), %eax # This zero-extension (instead of just writing ax) is *not* for correctness, just for performance: avoid partial-register slowdowns if the caller reads eax
    popq    %rbx
    retq

注意:movzwl array_us(,%rax,2)等效,但不能更小.如果我们可以依赖fuint()的返回值中%rax的高位为零,则编译器可以使用array_us(%rbx, %rax, 2)而不是使用add insn.

Note: movzwl array_us(,%rax,2) would be equivalent, but no smaller. If we could depend on the high bits of %rax being zeroed in fuint()'s return value, the compiler could have used array_us(%rbx, %rax, 2) instead of using the add insn.

保留high32的不确定性是有意的,我认为这是一个不错的设计决定.

Leaving the high32 undefined is intentional, and I think it's a good design decision.

进行32位运算时可以忽略高32.

Ignoring the high 32 is free when doing 32-bit ops. A 32-bit operation zero-extends its result to 64-bit for free, so you only need an extra mov edx, edi or something if you could have used the reg directly in a 64-bit addressing mode or 64-bit operation.

某些函数不会将其args扩展到64位来节省任何insn,因此对于调用者而言,始终必须这样做会造成潜在的浪费.某些函数使用其args的方式要求与arg的符号相反的扩展,因此将其留给被调用方来决定如何做才能很好地发挥作用.

Some functions won't save any insns from having their args already extended to 64-bit, so it's a potential waste for callers to always have to do it. Some functions use their args in a way that requires the opposite extension from the signedness of the arg, so leaving it up to the callee to decide what to do works well.

对于大多数调用者而言,无论签名如何,将零扩展到64位都是免费的,并且可能是ABI设计的不错选择.由于arg regs无论如何都会被破坏,因此如果调用方希望在仅通过低32位的调用中保持完整的64位值,则调用者已经需要做一些额外的事情.因此,通常仅在需要64位时才花费额外的费用结果,然后再将截断的版本传递给函数.在x86-64 SysV中,您可以在RDI中生成结果并使用它,然后在call foo中仅查看EDI.

Zero-extending to 64-bit regardless of signedness would be free for most callers, though, and might have been a good choice ABI design choice. Since arg regs are clobbered anyway, the caller already needs to do something extra if it wants to keep a full 64-bit value across a call where it only passes the low 32. Thus it usually only costs extra when you need a 64-bit result for something before the call, and then pass a truncated version to a function. In x86-64 SysV, you can generate your result in RDI and use it, and then call foo which will only look at EDI.

16位和8位操作数大小通常会导致错误的依赖性(AMD,P4或Silvermont,以及后来的SnB系列),或部分寄存器停顿(SnB之前)或较小的速度减慢(Sandybridge),因此要求将8和16b类型扩展为arg-pass的32b的无证行为是有道理的.有关这些微体系结构的更多详细信息,请参见为什么GCC不使用部分寄存器?

16-bit and 8-bit operand-sizes often lead to false dependencies (AMD, P4, or Silvermont, and later SnB-family), or partial-register stalls (pre SnB) or minor slowdowns (Sandybridge), so the undocumented behaviour of requiring 8 and 16b types to be extended to 32b for arg-passing makes some sense. See Why doesn't GCC use partial registers? for more details on those microarchitectures.

对于实际代码中的代码大小而言,这可能不是什么大问题,因为微小的函数应该是/c static inline,而arg处理insns只是较大函数的一小部分.当编译器可以看到两个定义时,即使没有内联,过程间优化也可以消除调用之间的开销. (IDK在实践中,编译器在此方面做得如何.)

This probably not a big deal for code-size in real code, since tiny functions are / should be static inline, and arg-handling insns are a small part of bigger functions. Inter-procedural optimization can remove overhead between calls when the compiler can see both definitions, even without inlining. (IDK how well compilers do at this in practice.)

我不确定更改函数签名以使用uintptr_t是否会帮助或损害64位指针的整体性能.我不会担心标量的堆栈空间.在大多数函数中,编译器会推入/弹出足够多的调用保留寄存器(例如%rbx%rbp),以将其自身的变量保留在寄存器中.用于8B溢出而不是4B的少量额外空间可以忽略不计.

I'm not sure whether changing function signatures to use uintptr_t will help or hurt overall performance with 64-bit pointers. I wouldn't worry about stack space for scalars. In most functions, the compiler pushes/pops enough call-preserved registers (like %rbx and %rbp) to keep its own variables live in registers. A tiny bit extra space for 8B spills instead of 4B is negligible.

就代码大小而言,使用64位值需要在某些insns上使用REX前缀,而在其他情况下则不需要REX前缀.如果在将32位值用作数组索引之前需要对32位值进行任何操作,则可以免费将零扩展到64位.如果需要,符号扩展始终需要一条额外的指令.但是,编译器可以从一开始就进行符号扩展并将其作为64位带符号值使用,以保存指令,但需要更多的REX前缀. (有符号的溢出是UB,没有定义为环绕,因此编译器通常可以避免在使用arr[i]int i的循环内重做符号扩展.)

As far as code-size, working with 64-bit values requires a REX prefix on some insns that wouldn't have otherwise needed one. Zero-extending to 64-bit happens for free if any operations are required on a 32-bit value before it gets used as an array index. Sign-extension always takes an extra instruction if it's required. But compilers can sign-extend and work with it as a 64-bit signed value from the start to save instructions, at the cost of needing more REX prefixes. (Signed overflow is UB, not defined to wrap around, so compilers can often avoid redoing sign-extension inside a loop with an int i that uses arr[i].)

在合理范围内,现代CPU通常更关心insn计数而不是insn大小.热代码通常会从具有热代码的CPU中的uop缓存中运行.尽管如此,较小的代码仍可以提高uop缓存中的密度.如果您可以节省代码大小而不使用更多或更慢的insns,那么这是一个胜利,但是通常不值得牺牲其他任何东西,除非它是代码大小的 .

Modern CPUs usually care more about insn count than insn size, within reason. Hot code will often be running from the uop cache in CPUs that have them. Still, smaller code can improve density in the uop cache. If you can save code size without using more or slower insns, then it's a win, but not usually worth sacrificing anything else for unless it's a lot of code size.

可能像一条额外的LEA指令,而不是disp32,而允许[reg + disp8]用于随后的十几条指令.或者在多个mov [rdi+n], 0指令之前使用xor eax,eax将imm32 = 0替换为寄存器源. (特别是如果允许微融合,那么相对RIP +立即数是不可能的,因为真正重要的是前端uop计数,而不是指令计数.)

Like maybe one extra LEA instruction to allow [reg + disp8] addressing for a dozen later instructions, instead of disp32. Or xor eax,eax before multiple mov [rdi+n], 0 instructions to replace the imm32=0 with a register source. (Especially if that allows micro-fusion where it wouldn't be possible with a RIP-relative + immediate, because what really matters is front-end uop count, not instruction count.)

这篇关于向x86-64 ABI的指针添加32位偏移量时是否需要符号或零扩展?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆