描述未传递给rax的xmm寄存器中浮点参数数量的整数 [英] Integer describing number of floating point arguments in xmm registers not passed to rax

查看:81
本文介绍了描述未传递给rax的xmm寄存器中浮点参数数量的整数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个声明如下的函数:

I have got a function which is declared as follows:

double foo(int ** buffer, int size, ...);

该功能是程序cpp实现的一部分.

The function is a part of cpp implementation of a program.

我使用last参数将多个double变量传递给函数.

I use last parameter to pass multiple double variables to the function.

问题是,在Mac上,我在rax寄存器中没有收到有效的号码,而在ubuntu上,它却按预期工作.

The problem is that on Mac I do not receive valid number in rax register, on the other hand on ubuntu it works as expected.

一个简单的例子:

CPP

#include <iostream>
extern "C" double foo(int ** buffer, int buffer_size, ...);

int main() {
    int* buffer [] = {new int(2), new int(3), new int(4)};
    std::cout<< foo(buffer, 2, 1.0, 2.0, 3.0) << '\n';
    std::cout<< foo(buffer, 3, 2.0, 3.0) << '\n';
    std::cout<< foo(buffer, 3) << '\n';
}

NASM2组件,

global foo

section .text

foo:
    cvtsi2sd xmm0, rax
    ret

Mac输出:

1.40468e+14
1.40736e+14
1.40736e+14

Ubuntu输出:

3
2
0

程序是64位的

推荐答案

x86-64系统V ABI表示FP寄存器arg计数在AL中传递,并且RAX的高位字节允许包含垃圾.(与任何窄整数或FP参数相同.但另请参见

The x86-64 System V ABI says the FP register arg count is passed in AL, and that the upper bytes of RAX are allowed to contain garbage. (Same as any narrow integer or FP arg. But see also this Q&A about clang assuming zero- or sign-extension of narrow integer args to 32 bit. This only applies to function args proper, not al.)

使用 movzx eax,al 将AL零扩展到RAX.(与写8位或16位寄存器不同,将EAX隐式扩展到RAX中是零).

Use movzx eax, al to zero-extend AL into RAX. (Writing EAX implicitly zero-extends into RAX, unlike writing an 8 or 16-bit register.)

如果还有另一个整数寄存器,您可以破坏,使用 movzx ecx,al ,这样就可以在Intel CPU上执行mov消除,使其零延迟并且不需要执行港口.当src和dst是同一寄存器的一部分时,Intel的消除运动失败.

If there's another integer register you can clobber, use movzx ecx,al so mov-elimination on Intel CPUs can work, making it zero latency and not needing an execution port. Intel's mov-elimination fails when the src and dst are parts of the same register.

使用64位源转换为FP的好处也为零. cvtsi2sd xmm0,eax 短了一个字节(没有REX前缀),并且零扩展到EAX后,您知道 cvtsi2sd 使用的EAX和RAX的带符号2的补码解释完全相同.

There's also zero benefit to using a 64-bit source for conversion to FP. cvtsi2sd xmm0, eax is one byte shorter (no REX prefix), and after zero-extension into EAX you know that the signed 2's complement interpretation of EAX and RAX that cvtsi2sd uses are identical.

在Mac上,clang/LLVM选择在RAX的高字节中保留垃圾.与gcc相比,LLVM的优化器在避免错误依赖方面不太谨慎,因此有时会写入部分寄存器.(有时即使不保存代码大小,但在这种情况下,它确实可以保存).

On your Mac, clang/LLVM chose to leave garbage in the upper bytes of RAX. LLVM's optimizer is less careful about avoiding false dependencies than gcc's, so it will sometimes write partial registers. (Sometimes even when it doesn't save code size, but in this case it does).

从您的结果中,我们可以得出结论,您在Mac上使用clang,在Ubuntu上使用gcc或ICC.

从一个简化的示例( new std :: cout :: operator<< 导致大量代码)中查看编译器生成的asm会更容易.).

It's easier to look at the compiler-generate asm from a simplified example (new and std::cout::operator<< result in a lot of code).

extern "C" double foo(int, ...);
int main() {
    foo(123, 1.0, 2.0);
}

Compiles to this asm on the Godbolt compiler explorer, with gcc and clang -O3:

### clang7.0 -O3
.section .rodata
.LCPI0_0:
    .quad   4607182418800017408     # double 1
.LCPI0_1:
    .quad   4611686018427387904     # double 2

.text
main:                                   # @main
    push    rax                  # align the stack by 16 before a call
    movsd   xmm0, qword ptr [rip + .LCPI0_0] # xmm0 = mem[0],zero
    movsd   xmm1, qword ptr [rip + .LCPI0_1] # xmm1 = mem[0],zero
    mov     edi, 123
    mov     al, 2                # leave the rest of RAX unmodified
    call    foo
    xor     eax, eax             # return 0
    pop     rcx
    ret

GCC发出的基本上是一样的东西,但是

GCC emits basically the same thing, but with

 ## gcc8.2 -O3
    ...
    mov     eax, 2               # AL = RAX = 2   FP args in regs
    mov     edi, 123
    call    foo
    ...

在不单独重命名AL的CPU上,

mov eax,2 而不是 mov al,2 避免了对RAX旧值的错误依赖.来自其他RAX .(只有Intel P6-family和Sandybridge才这样做,而不是IvyBridge和更高版本.并且没有任何AMD CPU,Pentium 4或Silvermont.)

mov eax,2 instead of mov al,2 avoids a false dependency on the old value of RAX, on CPUs that don't rename AL separately from the rest of RAX. (Only Intel P6-family and Sandybridge do that, not IvyBridge and later. And not any AMD CPUs, or Pentium 4, or Silvermont.)

请参见如何精确Haswell/Skylake上的部分寄存器的性能如何?编写AL似乎对RAX有错误的依赖,并且AH不一致以了解有关IvB和更高版本与Core2/Nehalem有何不同的更多信息.

See How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent for more about how IvB and later are different from Core2 / Nehalem.

这篇关于描述未传递给rax的xmm寄存器中浮点参数数量的整数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆