当您具有大输入量的函数时,在装配级会发生什么 [英] What happens at assembly level when you have functions with large inputs

查看:69
本文介绍了当您具有大输入量的函数时,在装配级会发生什么的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

虽然JavaScript不能直接成为程序集,但它应该演示一个普遍的问题,即如果在程序集中实现了高级功能,则高级功能的外观会如果函数的输入很大.例如说这种情况:

While JavaScript doesn't directly become assembly, it should demonstrate the general question which is how a high-level function would look if it were implemented in assembly if the function's inputs are large. Say for example this case:

myfunc(1, 2, 3)

变量有小整数,因此可以将它们放在单独的寄存器中.但是说你有:

The variables there are small integers so they could be placed on individual registers. But say you have:

var a = 'some markdown readme...'
myfunc('my really long string', a, 'etc.')

想知道组装过程中该如何做(高水平).

Wondering how that would be done in assembly (at a high level).

似乎并没有使用汇编调用堆栈来存储这些值,因为它们很大.也许它存储了内存地址及其偏移量(但是如果它是动态的……).有兴趣知道它是如何工作的.

It doesn't seem that the assembly call stack would be used to store these values, because they are large. Maybe it stores the memory address and the offset of it (but if it's dynamic...). Am interested to know how this works.

推荐答案

大多数高级语言通过引用传递数组(包括字符串). int foo(char*)只是获得一个指针值作为arg,而一个指针通常是一个机器字(即适合一个寄存器).在良好的现代调用约定中,前几个整数/指针args通常在寄存器中传递.

Arrays (including strings) are passed by reference in most high level languages. int foo(char*) just gets a pointer value as an arg, and a pointer typically one machine word (i.e. fits in a register). In good modern calling conventions, the first few integer/pointer args are typically passed in registers.

在C/C ++中,您不能按值传递裸数组.给定int arr[16]; func(arr);,函数func仅获得一个指针(指向第一个元素).

In C/C++, you can't pass a bare array by value. Given int arr[16]; func(arr);, the function func only gets a pointer (to the first element).

在其他一些高级语言中,数组可能更像C ++ std::vector,因此被调用方可能能够增大/缩小数组并找出其长度而无需单独的arg.通常,这意味着会有一个控制块".

In some other higher level languages, arrays might be more like C++ std::vector so the callee might be able to grow/shrink the array and find out its length without a separate arg. That would typically mean there's a "control block".

在C和C ++中,您可以按值传递结构,然后由调用约定规则指定如何传递结构.

x86-64系统V将16字节或更少字节的结构传递给最多2个整数寄存器.较大的结构将复制到堆栈中,无论它们包含多大的数组成员(

x86-64 System V for example passes structs of 16-byte or less packed into up to 2 integer registers. Larger structs are copied onto the stack, regardless of how large an array member they contain (What kind of C11 data type is an array according to the AMD64 ABI). (So don't pass giant objects by value to non-inline functions!)

Windows x64调用约定通过隐藏引用传递大型结构.

The Windows x64 calling convention passes large structs by hidden reference.

示例:

typedef struct {
    // too big makes the asm output cluttered with loops or memcpy
    // int Big_McLargeHuge[1024*1024];
    int arr[4];
    long long a,b; //,c,d;
} bigobj;
// total 32 bytes with int=4, long long=8 bytes

int func(bigobj a);
int foo(bigobj a) {
    a.arr[3]++;
    return func(a);
}

您可以使用其标准调用约定在Godbolt上尝试其他体系结构,例如ARM或AArch64.我之所以选择x86-64,是因为我偶然知道在该平台上进行结构传递的两种主要调用约定之间的有趣区别.

You can try other architectures on Godbolt with their standard calling conventions, like ARM or AArch64. I picked x86-64 because I happened to know of an interesting difference in the two major calling conventions on that one platform for struct-passing.

x86-64系统V(gcc7.3 -O3):foo具有其arg的真实值副本(由调用者完成),可以对其进行修改,因此这样做,并将其用作尾部调用的arg. (如果它不能进行尾部调用,则必须再制作一份完整副本.此示例人为地使System V看起来非常不错.)

x86-64 System V (gcc7.3 -O3): foo has a real by-value copy of its arg (done by its caller) that it can modify, so it does so and uses it as the arg for the tail-call. (If it can't tailcall, it would have to make yet another full copy. This example artificially makes System V look really good).

foo(bigobj):
    add     DWORD PTR [rsp+20], 1   # increment the struct member in the arg on the stack
    jmp     func(bigobj)            # tailcall func(a)

x86-64 Windows(MSVC CL19 /Ox):请注意,我们通过RCX(第一个整数/指针arg)寻址a.arr [3].因此,有一个隐藏的引用,但它不是const引用. 此函数是按值调用的,但是它正在修改通过引用获取的数据.因此,调用方必须进行复制,或者至少假定被调用方销毁了它得到的指向的arg. (如果此后对象已死,则不需要复制,但这仅适用于本地结构对象,而不能用于将指针传递给全局对象或其他对象.)

x86-64 Windows (MSVC CL19 /Ox): note that we address a.arr[3] via RCX, the first integer/pointer arg. So there is a hidden reference, but it's not a const-reference. This function was called by value, but it's modifying the data it got by reference. So the caller has to make a copy, or at least assume that a callee destroyed the arg it got a pointer to. (No copy required if the object is dead after that, but that's only possible for local struct objects, not for passing a pointer to a global or something).

$T1 = 32    ; offset of the tmp copy in this function's stack frame
foo PROC
    sub      rsp, 72              ; 00000048H     ; 32B of shadow space + 32B bigobj + 8 to align
    inc      DWORD PTR [rcx+12]
    movups   xmm0, XMMWORD PTR [rcx]              ; load modified `a`
    movups   xmm1, XMMWORD PTR [rcx+16]           ; apparently alignment wasn't required
    lea      rcx, QWORD PTR $T1[rsp]
    movaps   XMMWORD PTR $T1[rsp], xmm0
    movaps   XMMWORD PTR $T1[rsp+16], xmm1         ; store a copy
    call     int __cdecl func(struct bigobj)
    add      rsp, 72              ; 00000048H
    ret      0
foo ENDP

制作对象的另一个副本似乎是错过的优化.我认为对于相同的调用约定,这将是foo的有效实现:

Making another copy of the object appears to be a missed optimization. I think this would be valid implementation of foo for the same calling convention:

foo:
    add      DWORD PTR [rcx+12], 1       ; more efficient than INC because of the memory dst, on Intel CPUs
    jmp      func                        ; tailcall with pointer still in RCX

SysV ABI的

x86-64叮当声也错过了gcc7.3发现的优化,并且确实像MSVC一样进行复制.

x86-64 clang for the SysV ABI also misses the optimization that gcc7.3 found, and does copy like MSVC.

因此,ABI的区别没有我想象的那么有趣;在这两种情况下,被调用者都拥有" arg,即使对于Windows也不保证它在堆栈中.我猜想这可以实现动态分配,以便按值传递非常大的对象而不会引起堆栈溢出,但这是没有意义的.只是首先不要这样做.

So the ABI difference is less interesting than I thought; in both cases the callee "owns" the arg, even though for Windows it's not guaranteed to be on the stack. I guess this enables dynamic allocation for passing very large objects by value without a stack overflow, but that's kind of pointless. Just don't do it in the first place.

x86-64系统V将打包的小对象传递到寄存器中.如果您将long long成员注释掉,那么Clang就会找到一种整洁的优化方法,这样您就可以拥有

x86-64 System V passes small objects packed into registers. Clang finds a neat optimization if you comment out the long long members so you just have

typedef struct {
    int arr[4];
    //    long long a,b; //,c,d;
} bigobj;

# clang6.0 -O3
foo(bigobj):                          # @foo(bigobj)
    movabs  rax, 4294967296    # 0x100000000 = 1ULL << 32
    add     rsi, rax
    jmp     func(bigobj)          # TAILCALL

(arr[0..1]被打包到RDI中,而arr[2..3]被打包到RSI中,x86-64 SysV ABI中的前2个整数/指针arg-passing寄存器).

(arr[0..1] is packed into RDI, and arr[2..3] is packed into RSI, the first 2 integer/pointer arg-passing registers in the x86-64 SysV ABI).

gcc单独将arr[3]解压缩到一个寄存器中,在其中可以递增它.

gcc unpacks arr[3] into a register by itself where it can increment it.

但是clang通过添加1ULL<<32来增加RSI的高32位,而不是解压缩和重新打包.

But clang, instead of unpacking and repacking, increments the high 32 bits of RSI by adding 1ULL<<32.

MSVC仍然通过隐藏引用传递,并且仍然复制整个对象.

MSVC still passes by hidden reference, and still copies the whole object.

这篇关于当您具有大输入量的函数时,在装配级会发生什么的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆