如何使GCC结合“移动r10,r3;商店r10"进入"store r3"? [英] How to have GCC combine "move r10, r3; store r10" into a "store r3"?

查看:144
本文介绍了如何使GCC结合“移动r10,r3;商店r10"进入"store r3"?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Power9并利用称为DARN的硬件随机数生成器指令.我有以下内联汇编:

I'm working Power9 and utilizing the hardware random number generator instruction called DARN. I have the following inline assembly:

uint64_t val;
__asm__ __volatile__ (
    "xor 3,3,3                     \n"  // r3 = 0
    "addi 4,3,-1                   \n"  // r4 = -1, failure
    "1:                            \n"
    ".byte 0xe6, 0x05, 0x61, 0x7c  \n"  // r3 = darn 3, 1
    "cmpd 3,4                      \n"  // r3 == -1?
    "beq 1b                        \n"  // retry on failure
    "mr %0,3                       \n"  // val = r3
    : "=g" (val) : : "r3", "r4", "cc"
);

我必须在"=g" (val)中添加一个mr %0,3,因为我无法让GCC使用"=r3" (val)生成预期的代码.另请参见错误:匹配约束在输出操作数中无效.

I had to add a mr %0,3 with "=g" (val) because I could not get GCC to produce expected code with "=r3" (val). Also see Error: matching constraint not valid in output operand.

反汇编显示:

(gdb) b darn.cpp : 36
(gdb) r v
...

Breakpoint 1, DARN::GenerateBlock (this=<optimized out>,
    output=0x7fffffffd990 "\b", size=0x100) at darn.cpp:77
77              DARN64(output+i*8);
Missing separate debuginfos, use: debuginfo-install glibc-2.17-222.el7.ppc64le libgcc-4.8.5-28.el7_5.1.ppc64le libstdc++-4.8.5-28.el7_5.1.ppc64le
(gdb) disass
Dump of assembler code for function DARN::GenerateBlock(unsigned char*, unsigned long):
   ...
   0x00000000102442b0 <+48>:    addi    r10,r8,-8
   0x00000000102442b4 <+52>:    rldicl  r10,r10,61,3
   0x00000000102442b8 <+56>:    addi    r10,r10,1
   0x00000000102442bc <+60>:    mtctr   r10
=> 0x00000000102442c0 <+64>:    xor     r3,r3,r3
   0x00000000102442c4 <+68>:    addi    r4,r3,-1
   0x00000000102442c8 <+72>:    darn    r3,1
   0x00000000102442cc <+76>:    cmpd    r3,r4
   0x00000000102442d0 <+80>:    beq     0x102442c8 <DARN::GenerateBlock(unsigned char*, unsigned long)+72>
   0x00000000102442d4 <+84>:    mr      r10,r3
   0x00000000102442d8 <+88>:    stdu    r10,8(r9)

通知GCC忠实地复制了以下内容:

Notice GCC faithfully reproduces the:

0x00000000102442d4 <+84>:    mr      r10,r3
0x00000000102442d8 <+88>:    stdu    r10,8(r9)

我如何让GCC将两条说明折叠成:

How do I get GCC to fold the two instructions into:

0x00000000102442d8 <+84>:    stdu    r3,8(r9)

推荐答案

GCC永远不会删除属于asm模板的文本;除了替换%operand 外,它甚至不解析它.实际上,这只是在将asm发送到汇编程序之前的文本替换.

GCC will never remove text that's part of the asm template; it doesn't even parse it other than substituting in for %operand. It's literally just a text substitution before the asm is sent to the assembler.

您必须从内联asm模板中省略mr,并告诉gcc您的输出在r3中(或使用内存目标输出操作数,但不要这样做).如果您的inline-asm模板以mov指令开头或结尾,则通常做错了.

You have to leave out the mr from your inline asm template, and tell gcc that your output is in r3 (or use a memory-destination output operand, but don't do that). If your inline-asm template ever starts or ends with mov instructions, you're usually doing it wrong.

在没有特定注册限制的平台上,使用register uint64_t foo asm("r3");强制"=r"(foo)选择r3.

Use register uint64_t foo asm("r3"); to force "=r"(foo) to pick r3 on platforms that don't have specific-register constraints.

(尽管ISO C ++ 17删除了register关键字,但此GNU扩展名仍可与-std=c++17一起使用.如果要避免使用asm关键字,也可以使用register uint64_t foo __asm__("r3");.您可能仍需要将register视为使用此扩展名的源中的保留字;这很好.ISOC ++从基础语言中删除它不会强制实现将其用作扩展名.)

(Despite ISO C++17 removing the register keyword, this GNU extension still works with -std=c++17. You can also use register uint64_t foo __asm__("r3"); if you want to avoid the asm keyword. You probably still need to treat register as a reserved word in source that uses this extension; that's fine. ISO C++ removing it from the base language doesn't force implementations to not use it as part of an extension.)

或者更好的是,不要对注册号进行硬编码.使用支持DARN指令的汇编器. (但是显然,它是如此新,以至于最新的clang也没有它,并且您只希望将此内联asm作为gcc的后备版本,以至于无法支持

Or better, don't hard-code a register number. Use an assembler that supports the DARN instruction. (But apparently it's so new that even up-to-date clang lacks it, and you'd only want this inline asm as a fallback for gcc too old to support the __builtin_darn() intrinsic)

使用这些约束也可以删除寄存器设置,并在内联asm语句之前使用foo=0/bar=-1,并使用"+r"(foo).

Using these constraints will let you remove the register setup, too, and use foo=0 / bar=-1 before the inline asm statement, and use "+r"(foo).

但是请注意,darn的输出寄存器是只写的.无需先将r3设为零.我在这里找到了足以包含darn的IBM POWER ISA指令集手册的副本:

But note that darn's output register is write-only. There's no need to zero r3 first. I found a copy of IBM's POWER ISA instruction set manual that is new enough to include darn here: https://wiki.raptorcs.com/w/images/c/cb/PowerISA_public.v3.0B.pdf#page=96

实际上,您根本不需要在asm中循环,您可以将其留给C并且包装一条asm指令,就像设计inline-asm一样.

In fact, you don't need to loop inside the asm at all, you can leave that to the C and only wrap the one asm instruction, like inline-asm is designed for.

uint64_t random_asm() {
  register uint64_t val asm("r3");
  do {
    //__asm__ __volatile__ ("darn 3, 1");
      __asm__ __volatile__ (".byte 0x7c, 0x61, 0x05, 0xe6  # gcc asm operand = %0\n" : "=r" (val));
  } while(val == -1ULL);
  return val;
}

干净地编译(只需更少的设置就可以与您的循环一样紧密. (您确定您甚至需要在asm指令之前将r3设为零吗?)

Just as tight as your loop, with less setup. (Are you sure you even need to zero r3 before the asm instruction?)

此函数可以内联到您想要的任何位置,从而使gcc发出直接读取r3的存储指令.

This function can inline anywhere you want it to, allowing gcc to emit a store instruction that reads r3 directly.

在实践中,您将按照手册中的建议使用重试计数器:如果硬件RNG损坏,则可能永远导致失败,因此您应该使用PRNG. (与x86的rdrand相同)

In practice, you'll want to use a retry counter, as advised in the manual: if the hardware RNG is broken, it might give you failure forever so you should have a fallback to a PRNG. (Same for x86's rdrand)

提供随机数(darn)-编程说明

获得错误值时,软件为 希望重复该操作.如果没有错误 经过几次尝试仍未获得价值, 软件随机数生成方法 应该使用.推荐数量 尝试可能是特定于实现的.在里面 在没有其他指导的情况下,应尝试十次 足够.

When the error value is obtained, software is expected to repeat the operation. If a non-error value has not been obtained after several attempts, a software random number generation method should be used. The recommended number of attempts may be implementation specific. In the absence of other guidance, ten attempts should be adequate.


xor-归零在大多数固定指令宽度的ISA上效率不高,因为立即移动非常短,因此无需检测和特殊化xor. (因此,CPU设计不会在其上花费晶体管).而且,与C ++ 11 std::memory_order_consume require 等效的PPC asm依赖规则对输入寄存器具有依赖关系,因此即使设计人员想要它也不能破坏依赖关系.到.异或归零只是x86上的一件事,也许还有其他一些可变宽度的ISA.


xor-zeroing is not efficient on most fixed-instruction-width ISAs, because a mov-immediate is just as short so there's no need to detect and special-case an xor. (And thus CPU designs don't spend transistors on it). Moreover, dependency rules for the PPC asm equivalent of C++11 std::memory_order_consume require it to carry a dependency on the input register, so it couldn't be dependency-breaking even if the designers wanted it to. xor-zeroing is only a thing on x86 and maybe a few other variable-width ISAs.

像gcc一样将li r3, 0 用于int foo(){return 0;} https://godbolt.org/z/-gHI4C .

这篇关于如何使GCC结合“移动r10,r3;商店r10"进入"store r3"?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆