带有堆栈操作的GCC内联组件 [英] GCC inline assembly with stack operation
问题描述
我需要这样的内联汇编代码:
I am in need of such a inline assembly code:
- 我在装配体中有一对(因此很平衡)推/弹出操作
- 我的内存中还有一个变量(因此,不是寄存器)作为输入
- I have a pair(so, it is balanced) of push/pop operation inside the assembly
- I also have a variable in memory (so, not register) as input
像这样:
__asm__ __volatile__ ("push %%eax\n\t"
// ... some operations that use ECX as a temporary
"mov %0, %%ecx\n\t"
// ... some other operation
"pop %%eax"
: : "m"(foo));
// foo is my local variable, that is to say, on stack
反汇编编译后的代码时,编译器给出的内存地址类似于0xc(%esp)
,它是相对于esp
的,因此,由于我在mov
之前进行了push
操作,因此该代码段无法正常工作.
因此,如何告诉编译器,我不喜欢相对于esp
的foo
,而是相对于ebp的-8(%ebp)
之类的东西.
When disassembling the compiled code, the compiler give the memory address like 0xc(%esp)
, it is relative to esp
, hence, this fragment of code will not works correctly since I have a push
operation before mov
.
Therefore, how can I tell the compile I do not like the foo
relative to esp
, but any thing like -8(%ebp)
relative to ebp.
P.S.您可能会建议我可以将eax
放在Clobbers中,但这只是示例代码.我不想显示不接受此解决方案的全部原因.
P.S. You may suggest that I can put eax
inside the Clobbers, but it is just a sample code. I don't like to show the full reason why I don't accept this solution.
推荐答案
当您有任何内存输入/输出时,通常应该避免在inline-asm中修改ESP,因此不必禁用优化或强制编译器执行以下操作:用EBP用其他方法制作一个堆栈框架.一个主要优点是,您(或编译器)随后可以将EBP用作额外的免费寄存器;如果您已经不得不溢出/重新装载东西,则可能会显着提高速度.如果您正在编写内联汇编,则大概是一个热点,因此值得花额外的代码大小来使用ESP相对寻址模式.
Modifying ESP inside inline-asm should generally be avoided when you have any memory inputs / outputs, so you don't have to disable optimizations or force the compiler to make a stack-frame with EBP some other way. One major advantage is that you (or the compiler) can then use EBP as an extra free register; potentially a significant speedup if you're already having to spill/reload stuff. If you're writing inline asm, presumably this is a hotspot so it's worth spending the extra code-size to use ESP-relative addressing modes.
在x86-64代码中,安全使用push/pop还有一个障碍,因为类似,您可以在其中破坏编译器的数据在堆栈上.但是,没有32位x86 ABI带有红色区域,因此这仅适用于x86-64 SystemV.(或带有红色区域的非x86 ISA).
In x86-64 code, there's an added obstacle to using push/pop safely, because you can't tell the compiler you want to clobber the red-zone below RSP. (You can compile with -mno-red-zone
, but there's no way to disable it from the C source.) You can get problems like this where you clobber the compiler's data on the stack. No 32-bit x86 ABI has a red-zone, though, so this only applies to x86-64 System V. (Or non-x86 ISAs with a red-zone.)
如果您想将像push
这样的仅asm的东西用作堆栈数据结构,则只需要该函数的-fno-omit-frame-pointer
,因此存在可变数量的推送.或者也许是针对代码大小进行优化.
You only need -fno-omit-frame-pointer
for that function if you want to do asm-only stuff like push
as a stack data structure, so there's a variable amount of push. Or maybe if optimizing for code-size.
您始终可以在asm中编写整个非内联函数并将其放在单独的文件中,这样您就可以完全控制.但是只有在您的函数足够大以值得调用/重载开销的情况下才这样做,例如如果包含整个循环;不要在C内部循环中使编译器call
成为一个简短的非循环函数,而销毁所有被调用阻塞的寄存器,并且必须确保全局同步.
You can always write a whole non-inline function in asm and put it in a separate file, then you have full control. But only do that if your function is large enough to be worth the call/ret overhead, e.g. if it includes a whole loop; don't make the compiler call
a short non-looping function inside a C inner loop, destroying all the call-clobbered registers and having to make sure globals are in sync.
似乎您正在内联汇编中使用push
/pop
,因为您没有足够的寄存器,并且需要保存/重新加载某些内容. 您无需使用推入/弹出进行保存/恢复.相反,请使用具有"=m"
约束的伪输出操作数来使编译器为您分配堆栈空间,并在这些插槽之间使用mov
. (当然,您不仅限于mov
;如果您只需要一次或两次使用该值,那么将存储源操作数用于ALU指令可能是一个胜利.)
It seems you're using push
/ pop
inside inline asm because you don't have enough registers, and need to save/reload something. You don't need to use push/pop for save/restore. Instead, use dummy output operands with "=m"
constraints to get the compiler to allocate stack space for you, and use mov
to/from those slots. (Of course you're not limited to mov
; it can be a win to use a memory source operand for an ALU instruction if you only need the value once or twice.)
对于代码大小而言,这可能会稍差一些,但通常不会使性能变差(并且可能会更好).如果那还不够好,请在asm中编写整个函数(或整个循环),这样您就不必为编译器费力了.
This may be slightly worse for code-size, but is usually not worse for performance (and can be better). If that's not good enough, write the whole function (or the whole loop) in asm so you don't have to wrestle with the compiler.
int foo(char *p, int a, int b) {
int t1,t2; // dummy output spill slots
int r1,r2; // dummy output tmp registers
int res;
asm ("# operands: %0 %1 %2 %3 %4 %5 %6 %7 %8\n\t"
"imull $123, %[b], %[res]\n\t"
"mov %[res], %[spill1]\n\t"
"mov %[a], %%ecx\n\t"
"mov %[b], %[tmp1]\n\t" // let the compiler allocate tmp regs, unless you need specific regs e.g. for a shift count
"mov %[spill1], %[res]\n\t"
: [res] "=&r" (res),
[tmp1] "=&r" (r1), [tmp2] "=&r" (r2), // early-clobber
[spill1] "=m" (t1), [spill2] "=&rm" (t2) // allow spilling to a register if there are spare regs
, [p] "+&r" (p)
, "+m" (*(char (*)[]) p) // dummy in/output instead of memory clobber
: [a] "rmi" (a), [b] "rm" (b) // a can be an immediate, but b can't
: "ecx"
);
return res;
// p unused in the rest of the function
// so it's really just an input to the asm,
// which the asm is allowed to destroy
}
This compiles to the following asm with gcc7.3 -O3 -m32
on the Godbolt compiler explorer. Note the asm-comment showing what the compiler picked for all the template operands: it picked 12(%esp)
for %[spill1]
and %edi
for %[spill2]
(because I used "=&rm"
for that operand, so the compiler saved/restore %edi
outside the asm, and gave it to us for that dummy operand).
foo(char*, int, int):
pushl %ebp
pushl %edi
pushl %esi
pushl %ebx
subl $16, %esp
movl 36(%esp), %edx
movl %edx, %ebp
#APP
# 19 "/tmp/compiler-explorer-compiler118120-55-w92ge8.v797i/example.cpp" 1
# operands: %eax %ebx %esi 12(%esp) %edi %ebp (%edx) 40(%esp) 44(%esp)
imull $123, 44(%esp), %eax
mov %eax, 12(%esp)
mov 40(%esp), %ecx
mov 44(%esp), %ebx
mov 12(%esp), %eax
# 0 "" 2
#NO_APP
addl $16, %esp
popl %ebx
popl %esi
popl %edi
popl %ebp
ret
嗯,用来告诉编译器我们修改了哪个内存的伪内存操作数似乎导致专用于该寄存器,我想是因为p
操作数是早期缓冲区,因此它不能使用相同的寄存器.我想如果您确信其他输入都不会使用与p
相同的寄存器,则可能会冒着丢掉早期消息的风险. (即它们没有相同的值).
Hmm, the dummy memory operand to tell the compiler which memory we modify seems to have resulted in dedicating a register to that, I guess because the p
operand is early-clobber so it can't use the same register. I guess you could risk leaving off the early-clobber if you're confident none of the other inputs will use the same register as p
. (i.e. that they don't have the same value).
这篇关于带有堆栈操作的GCC内联组件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!