如何使GCC结合“移动r10,r3;商店r10"进入"store r3"? [英] How to have GCC combine "move r10, r3; store r10" into a "store r3"?
问题描述
我正在使用Power9并利用称为DARN的硬件随机数生成器指令.我有以下内联汇编:
I'm working Power9 and utilizing the hardware random number generator instruction called DARN. I have the following inline assembly:
uint64_t val;
__asm__ __volatile__ (
"xor 3,3,3 \n" // r3 = 0
"addi 4,3,-1 \n" // r4 = -1, failure
"1: \n"
".byte 0xe6, 0x05, 0x61, 0x7c \n" // r3 = darn 3, 1
"cmpd 3,4 \n" // r3 == -1?
"beq 1b \n" // retry on failure
"mr %0,3 \n" // val = r3
: "=g" (val) : : "r3", "r4", "cc"
);
我必须在"=g" (val)
中添加一个mr %0,3
,因为我无法让GCC使用"=r3" (val)
生成预期的代码.另请参见错误:匹配约束在输出操作数中无效.
I had to add a mr %0,3
with "=g" (val)
because I could not get GCC to produce expected code with "=r3" (val)
. Also see Error: matching constraint not valid in output operand.
反汇编显示:
(gdb) b darn.cpp : 36
(gdb) r v
...
Breakpoint 1, DARN::GenerateBlock (this=<optimized out>,
output=0x7fffffffd990 "\b", size=0x100) at darn.cpp:77
77 DARN64(output+i*8);
Missing separate debuginfos, use: debuginfo-install glibc-2.17-222.el7.ppc64le libgcc-4.8.5-28.el7_5.1.ppc64le libstdc++-4.8.5-28.el7_5.1.ppc64le
(gdb) disass
Dump of assembler code for function DARN::GenerateBlock(unsigned char*, unsigned long):
...
0x00000000102442b0 <+48>: addi r10,r8,-8
0x00000000102442b4 <+52>: rldicl r10,r10,61,3
0x00000000102442b8 <+56>: addi r10,r10,1
0x00000000102442bc <+60>: mtctr r10
=> 0x00000000102442c0 <+64>: xor r3,r3,r3
0x00000000102442c4 <+68>: addi r4,r3,-1
0x00000000102442c8 <+72>: darn r3,1
0x00000000102442cc <+76>: cmpd r3,r4
0x00000000102442d0 <+80>: beq 0x102442c8 <DARN::GenerateBlock(unsigned char*, unsigned long)+72>
0x00000000102442d4 <+84>: mr r10,r3
0x00000000102442d8 <+88>: stdu r10,8(r9)
通知GCC忠实地复制了以下内容:
Notice GCC faithfully reproduces the:
0x00000000102442d4 <+84>: mr r10,r3
0x00000000102442d8 <+88>: stdu r10,8(r9)
我如何让GCC将两条说明折叠成:
How do I get GCC to fold the two instructions into:
0x00000000102442d8 <+84>: stdu r3,8(r9)
推荐答案
GCC永远不会删除属于asm模板的文本;除了替换%operand
外,它甚至不解析它.实际上,这只是在将asm发送到汇编程序之前的文本替换.
GCC will never remove text that's part of the asm template; it doesn't even parse it other than substituting in for %operand
. It's literally just a text substitution before the asm is sent to the assembler.
您必须从内联asm模板中省略mr
,并告诉gcc您的输出在r3
中(或使用内存目标输出操作数,但不要这样做).如果您的inline-asm模板以mov
指令开头或结尾,则通常做错了.
You have to leave out the mr
from your inline asm template, and tell gcc that your output is in r3
(or use a memory-destination output operand, but don't do that). If your inline-asm template ever starts or ends with mov
instructions, you're usually doing it wrong.
在没有特定注册限制的平台上,使用register uint64_t foo asm("r3");
强制"=r"(foo)
选择r3
.
Use register uint64_t foo asm("r3");
to force "=r"(foo)
to pick r3
on platforms that don't have specific-register constraints.
(尽管ISO C ++ 17删除了register
关键字,但此GNU扩展名仍可与-std=c++17
一起使用.如果要避免使用asm
关键字,也可以使用register uint64_t foo __asm__("r3");
.您可能仍需要将register
视为使用此扩展名的源中的保留字;这很好.ISOC ++从基础语言中删除它不会强制实现不将其用作扩展名.)
(Despite ISO C++17 removing the register
keyword, this GNU extension still works with -std=c++17
. You can also use register uint64_t foo __asm__("r3");
if you want to avoid the asm
keyword. You probably still need to treat register
as a reserved word in source that uses this extension; that's fine. ISO C++ removing it from the base language doesn't force implementations to not use it as part of an extension.)
或者更好的是,不要对注册号进行硬编码.使用支持DARN指令的汇编器. (但是显然,它是如此新,以至于最新的clang也没有它,并且您只希望将此内联asm作为gcc的后备版本,以至于无法支持
Or better, don't hard-code a register number. Use an assembler that supports the DARN instruction. (But apparently it's so new that even up-to-date clang lacks it, and you'd only want this inline asm as a fallback for gcc too old to support the __builtin_darn()
intrinsic)
使用这些约束也可以删除寄存器设置,并在内联asm语句之前使用foo=0
/bar=-1
,并使用"+r"(foo)
.
Using these constraints will let you remove the register setup, too, and use foo=0
/ bar=-1
before the inline asm statement, and use "+r"(foo)
.
但是请注意,darn
的输出寄存器是只写的.无需先将r3
设为零.我在这里找到了足以包含darn
的IBM POWER ISA指令集手册的副本:
But note that darn
's output register is write-only. There's no need to zero r3
first. I found a copy of IBM's POWER ISA instruction set manual that is new enough to include darn
here: https://wiki.raptorcs.com/w/images/c/cb/PowerISA_public.v3.0B.pdf#page=96
实际上,您根本不需要在asm中循环,您可以将其留给C并且仅包装一条asm指令,就像设计inline-asm一样.
In fact, you don't need to loop inside the asm at all, you can leave that to the C and only wrap the one asm instruction, like inline-asm is designed for.
uint64_t random_asm() {
register uint64_t val asm("r3");
do {
//__asm__ __volatile__ ("darn 3, 1");
__asm__ __volatile__ (".byte 0x7c, 0x61, 0x05, 0xe6 # gcc asm operand = %0\n" : "=r" (val));
} while(val == -1ULL);
return val;
}
干净地编译(只需更少的设置就可以与您的循环一样紧密. (您确定您甚至需要在asm指令之前将r3
设为零吗?)
Just as tight as your loop, with less setup. (Are you sure you even need to zero r3
before the asm instruction?)
此函数可以内联到您想要的任何位置,从而使gcc发出直接读取r3
的存储指令.
This function can inline anywhere you want it to, allowing gcc to emit a store instruction that reads r3
directly.
在实践中,您将按照手册中的建议使用重试计数器:如果硬件RNG损坏,则可能永远导致失败,因此您应该使用PRNG. (与x86的rdrand
相同)
In practice, you'll want to use a retry counter, as advised in the manual: if the hardware RNG is broken, it might give you failure forever so you should have a fallback to a PRNG. (Same for x86's rdrand
)
提供随机数(
darn
)-编程说明
获得错误值时,软件为 希望重复该操作.如果没有错误 经过几次尝试仍未获得价值, 软件随机数生成方法 应该使用.推荐数量 尝试可能是特定于实现的.在里面 在没有其他指导的情况下,应尝试十次 足够.
When the error value is obtained, software is expected to repeat the operation. If a non-error value has not been obtained after several attempts, a software random number generation method should be used. The recommended number of attempts may be implementation specific. In the absence of other guidance, ten attempts should be adequate.
xor
-归零在大多数固定指令宽度的ISA上效率不高,因为立即移动非常短,因此无需检测和特殊化xor. (因此,CPU设计不会在其上花费晶体管).而且,与C ++ 11 std::memory_order_consume
require 等效的PPC asm依赖规则对输入寄存器具有依赖关系,因此即使设计人员想要它也不能破坏依赖关系.到.异或归零只是x86上的一件事,也许还有其他一些可变宽度的ISA.
xor
-zeroing is not efficient on most fixed-instruction-width ISAs, because a mov-immediate is just as short so there's no need to detect and special-case an xor. (And thus CPU designs don't spend transistors on it). Moreover, dependency rules for the PPC asm equivalent of C++11 std::memory_order_consume
require it to carry a dependency on the input register, so it couldn't be dependency-breaking even if the designers wanted it to. xor-zeroing is only a thing on x86 and maybe a few other variable-width ISAs.
像gcc一样将li r3, 0
用于int foo(){return 0;}
https://godbolt.org/z/-gHI4C .
这篇关于如何使GCC结合“移动r10,r3;商店r10"进入"store r3"?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!