解决 Windows 调用约定保留 xmm 寄存器? [英] Work around windows calling convention preserving xmm registers?

查看:61
本文介绍了解决 Windows 调用约定保留 xmm 寄存器?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 Windows 上有什么方法可以解决在函数调用中保留 XMM 寄存器的要求吗?(除了全部用汇编编写)

Is there any way on Windows to work around the requirement that the XMM registers are preserved within a function call?(Aside from writing it all in assembly)

我有许多 AVX2 内在函数,不幸的是,它们因此而臃肿.

I have many AVX2 intrinsic functions that are unfortunately bloated by this.

举个例子,这将被编译器(MSVC)放置在函数的顶部:

As an example this will be placed at the top of the function by the compiler(MSVC):

00007FF9D0EBC602 vmovaps xmmword ptr [rsp+1490h],xmm6
00007FF9D0EBC60B vmovaps xmmword ptr [rsp+1480h],xmm7
00007FF9D0EBC614 vmovaps xmmword ptr [rsp+1470h],xmm8
00007FF9D0EBC61D vmovaps xmmword ptr [rsp+1460h],xmm9
00007FF9D0EBC626 vmovaps xmmword ptr [rsp+1450h],xmm10
00007FF9D0EBC62F vmovaps xmmword ptr [rsp+1440h],xmm11
00007FF9D0EBC638 vmovaps xmmword ptr [rsp+1430h],xmm12
00007FF9D0EBC641 vmovaps xmmword ptr [rsp+1420h],xmm13
00007FF9D0EBC64A vmovaps xmmword ptr [rsp+1410h],xmm14
00007FF9D0EBC653 vmovaps xmmword ptr [rsp+1400h],xmm15

00007FF9D0EBC602 vmovaps xmmword ptr [rsp+1490h],xmm6
00007FF9D0EBC60B vmovaps xmmword ptr [rsp+1480h],xmm7
00007FF9D0EBC614 vmovaps xmmword ptr [rsp+1470h],xmm8
00007FF9D0EBC61D vmovaps xmmword ptr [rsp+1460h],xmm9
00007FF9D0EBC626 vmovaps xmmword ptr [rsp+1450h],xmm10
00007FF9D0EBC62F vmovaps xmmword ptr [rsp+1440h],xmm11
00007FF9D0EBC638 vmovaps xmmword ptr [rsp+1430h],xmm12
00007FF9D0EBC641 vmovaps xmmword ptr [rsp+1420h],xmm13
00007FF9D0EBC64A vmovaps xmmword ptr [rsp+1410h],xmm14
00007FF9D0EBC653 vmovaps xmmword ptr [rsp+1400h],xmm15

然后在函数的末尾..

00007FF9D0EBD6E6 vmovaps xmm6,xmmword ptr [r11-10h]
00007FF9D0EBD6EC vmovaps xmm7,xmmword ptr [r11-20h]
00007FF9D0EBD6F2 vmovaps xmm8,xmmword ptr [r11-30h]
00007FF9D0EBD6F8 vmovaps xmm9,xmmword ptr [r11-40h]
00007FF9D0EBD6FE vmovaps xmm10,xmmword ptr [r11-50h]
00007FF9D0EBD704 vmovaps xmm11,xmmword ptr [r11-60h]
00007FF9D0EBD70A vmovaps xmm12,xmmword ptr [r11-70h]
00007FF9D0EBD710 vmovaps xmm13,xmmword ptr [r11-80h]
00007FF9D0EBD716 vmovaps xmm14,xmmword ptr [r11-90h]
00007FF9D0EBD71F vmovaps xmm15,xmmword ptr [r11-0A0h]

00007FF9D0EBD6E6 vmovaps xmm6,xmmword ptr [r11-10h]
00007FF9D0EBD6EC vmovaps xmm7,xmmword ptr [r11-20h]
00007FF9D0EBD6F2 vmovaps xmm8,xmmword ptr [r11-30h]
00007FF9D0EBD6F8 vmovaps xmm9,xmmword ptr [r11-40h]
00007FF9D0EBD6FE vmovaps xmm10,xmmword ptr [r11-50h]
00007FF9D0EBD704 vmovaps xmm11,xmmword ptr [r11-60h]
00007FF9D0EBD70A vmovaps xmm12,xmmword ptr [r11-70h]
00007FF9D0EBD710 vmovaps xmm13,xmmword ptr [r11-80h]
00007FF9D0EBD716 vmovaps xmm14,xmmword ptr [r11-90h]
00007FF9D0EBD71F vmovaps xmm15,xmmword ptr [r11-0A0h]

那是 20 条什么都不做的指令,因为我不需要保留 XMM 的状态.我有 100 个这样的函数,编译器像这样膨胀了.它们都是通过函数指针从同一个调用点调用的.

That is 20 instructions that do nothing since I have no need to preserve the state of XMM. I have 100's of these functions that the compiler is bloating up like this. They are all invoked from the same call site via function pointers.

我尝试更改调用约定(__vectorcall/cdecl/fastcall),但这似乎没有任何作用.

I tried changing the calling convention(__vectorcall/cdecl/fastcall) but that doesn't appear to do anything.

推荐答案

将 x86-64 System V 调用约定用于您希望通过函数指针拼凑起来的辅助函数.在该调用约定中,所有 xmm/ymm0..15 和 zmm0..31 都被调用破坏,因此即使是需要 5 个以上向量寄存器的辅助函数也不必保存/恢复任何.

Use the x86-64 System V calling convention for your helper functions that you want to piece together via function pointers. In that calling convention, all of xmm/ymm0..15 and zmm0..31 are call-clobbered so even helper functions that need more than 5 vector registers don't have to save/restore any.

调用它们的外部解释器函数仍应使用 Windows x64 fastcall 或 vectorcall,因此从外部看,它完全遵守该调用约定.

The outer interpreter function that calls them should still use Windows x64 fastcall or vectorcall, so from the outside it fully respects that calling convention.

这会将 XMM6..15 的所有保存/恢复提升到该调用者中,而不是每个辅助函数.这减少了静态代码大小,并通过函数指针分摊了多次调用的运行时成本.

This will hoist all the save/restore of XMM6..15 into that caller, instead of each helper function. This reduces static code size and amortizes the runtime cost over multiple calls through your function pointers.

AFAIK,MSVC 不支持将函数标记为使用 x86-64 System V 调用约定,仅支持 fastcall 与 vectorcall,因此您必须使用 clang.

AFAIK, MSVC doesn't support marking functions as using the x86-64 System V calling convention, only fastcall vs. vectorcall, so you'll have to use clang.

(ICC 有问题,无法在调用 System V ABI 函数时保存/恢复 XMM6..15).

(ICC is buggy and fails to save/restore XMM6..15 around a call to a System V ABI function).

Windows GCC 有 32 个错误-byte 堆栈对齐 用于溢出 __m256,因此将 GCC 与 -march= 结合使用通常不安全,任何包含 AVX 的内容.

Windows GCC is buggy with 32-byte stack alignment for spilling __m256, so it's not in general safe to use GCC with -march= with anything that includes AVX.

在函数和函数指针声明中使用 __attribute__((sysv_abi))__attribute__((ms_abi)).

Use __attribute__((sysv_abi)) or __attribute__((ms_abi)) on function and function-pointer declarations.

我认为 ms_abi__fastcall,而不是 __vectorcall.Clang 可能也支持 __attribute__((vectorcall)) ,但我还没有尝试过.Google 结果主要是功能请求/讨论.

I think ms_abi is __fastcall, not __vectorcall. Clang may support __attribute__((vectorcall)) as well, but I haven't tried it. Google results are mostly feature requests/discussion.

void (*helpers[10])(float *, float*) __attribute__((sysv_abi));

__attribute__((ms_abi))
void outer(float *p) {
    helpers[0](p, p+10);
    helpers[1](p, p+10);
    helpers[2](p+20, p+30);
}

编译如下 on Godbolt with clang 8.0 -O3 -march=skylake.(Godbolt 目标 Linux 上的 gcc/clang,但我在函数和函数指针上都使用了显式 ms_abisysv_abi,因此代码生成不依赖于以下事实默认值为 sysv_abi.显然,您希望使用 Windows gcc 或 clang 构建您的函数,因此对其他函数的调用将使用正确的调用约定.以及有用的目标文件格式等)

compiles as follows on Godbolt with clang 8.0 -O3 -march=skylake. (gcc/clang on Godbolt target Linux, but I used explicit ms_abi and sysv_abi on both the function and function-pointers so the code gen doesn't depend on the fact that the default is sysv_abi. Obviously you'd want to build your function with a Windows gcc or clang so calls to other functions would use the right calling convention. And a useful object-file format, etc.)

请注意,gcc/clang 为 outer() 发出代码,该代码期望 RCX (Windows x64) 中的传入指针 arg,但将其传递给 RDI 和 RSI (x86-64 System V) 中的被调用者).

Notice that gcc/clang emit code for outer() that expects the incoming pointer arg in RCX (Windows x64), but pass it to the callees in RDI and RSI (x86-64 System V).

outer:                                  # @outer
        push    r14
        push    rsi
        push    rdi
        push    rbx
        sub     rsp, 168
        vmovaps xmmword ptr [rsp + 144], xmm15 # 16-byte Spill
        vmovaps xmmword ptr [rsp + 128], xmm14 # 16-byte Spill
        vmovaps xmmword ptr [rsp + 112], xmm13 # 16-byte Spill
        vmovaps xmmword ptr [rsp + 96], xmm12 # 16-byte Spill
        vmovaps xmmword ptr [rsp + 80], xmm11 # 16-byte Spill
        vmovaps xmmword ptr [rsp + 64], xmm10 # 16-byte Spill
        vmovaps xmmword ptr [rsp + 48], xmm9 # 16-byte Spill
        vmovaps xmmword ptr [rsp + 32], xmm8 # 16-byte Spill
        vmovaps xmmword ptr [rsp + 16], xmm7 # 16-byte Spill
        vmovaps xmmword ptr [rsp], xmm6 # 16-byte Spill
        mov     rbx, rcx                            # save p 
        lea     r14, [rcx + 40]
        mov     rdi, rcx
        mov     rsi, r14
        call    qword ptr [rip + helpers]
        mov     rdi, rbx
        mov     rsi, r14
        call    qword ptr [rip + helpers+8]
        lea     rdi, [rbx + 80]
        lea     rsi, [rbx + 120]
        call    qword ptr [rip + helpers+16]
        vmovaps xmm6, xmmword ptr [rsp] # 16-byte Reload
        vmovaps xmm7, xmmword ptr [rsp + 16] # 16-byte Reload
        vmovaps xmm8, xmmword ptr [rsp + 32] # 16-byte Reload
        vmovaps xmm9, xmmword ptr [rsp + 48] # 16-byte Reload
        vmovaps xmm10, xmmword ptr [rsp + 64] # 16-byte Reload
        vmovaps xmm11, xmmword ptr [rsp + 80] # 16-byte Reload
        vmovaps xmm12, xmmword ptr [rsp + 96] # 16-byte Reload
        vmovaps xmm13, xmmword ptr [rsp + 112] # 16-byte Reload
        vmovaps xmm14, xmmword ptr [rsp + 128] # 16-byte Reload
        vmovaps xmm15, xmmword ptr [rsp + 144] # 16-byte Reload
        add     rsp, 168
        pop     rbx
        pop     rdi
        pop     rsi
        pop     r14
        ret

GCC 生成的代码基本相同.但是 Windows GCC 在 AVX 上有问题.

GCC makes basically the same code. But Windows GCC is buggy with AVX.

ICC19 生成类似的代码,但没有 xmm6..15 的保存/恢复.这是一个令人窒息的错误;如果任何被调用者确实像他们被允许的那样破坏这些注册,那么从这个函数返回将违反其调用约定.

ICC19 makes similar code, but without the save/restore of xmm6..15. This is a showstopper bug; if any of the callees do clobber those regs like they're allowed to, then returning from this function will violate its calling convention.

这使得 clang 成为您可以使用的唯一编译器.没关系;叮当非常好.

This leaves clang as the only compiler that you can use. That's fine; clang is very good.

如果您的被调用者不需要所有 YMM 寄存器,那么在外部函数中保存/恢复所有这些寄存器是多余的.但是现有的工具链没有中间立场.例如,您必须在 asm 中手写 outer 以利用知道您可能的被调用者都不会破坏 XMM15 的优势.

If your callees don't need all the YMM registers, saving/restoring all of them in the outer function is overkill. But there's no middle ground with existing toolchains; you'd have to hand-write outer in asm to take advantage of knowing that none of your possible callees ever clobber XMM15 for example.

请注意,从 outer() 内部调用其他 MS-ABI 函数完全没问题.GCC/clang 也会(除了 bug 外)为此发出正确的代码,如果被调用的函数选择不销毁 xmm6..15 也没关系.

Note that calling other MS-ABI functions from inside outer() is totally fine. GCC / clang will (barring bugs) emit correct code for that too, and it's fine if a called function chooses not to destroy xmm6..15.

这篇关于解决 Windows 调用约定保留 xmm 寄存器?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆