如何用扩展的gcc程序集指定x87 FPU堆栈的底部底部? [英] How to specify clobbered bottom of the x87 FPU stack with extended gcc assembly?

查看:139
本文介绍了如何用扩展的gcc程序集指定x87 FPU堆栈的底部底部?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我们的代码库中,我发现这个代码片段用于x87上的快速,朝向负无穷的 1 舍入:

  inline int my_int(double x)
{
int r;
#ifdef _GCC_
asm(fldl%1 \\\

fistpl%0 \\\

:= m(r)
:的 m(X));
#else
// ...
#endif
return r;
}

我不是很熟悉GCC扩展汇编语法,但是从我从文档中收集:


  • r 必须是内存位置, m写回东西;
  • x 也必须是一个内存位置,数据从哪里来。

  • 没有clobber规范,所以编译器可以放心,在代码片段的最后,寄存器就是他留下的。
  • 现在,来回答我的问题:确实最终FPU堆栈是平衡的,但如果所有8个位置都已经被使用并且我正在使用它,那该怎么办?编译器如何知道它不能相信 ST(7)是它离开它的地方?我试着指定 st(7) 在clobber列表中,它似乎会影响codegen,现在我会等待一些确认这个事实。






    附注:在glibc和MinGW中查看准系统 lrint 的实现我看到类似于

      __ asm__ __volatile__(fistpl%0
    := m(retval)
    :t(x)
    :st);

    我们要求输入直接放在 ST(0 )(它避免了可能无用的 fldl );那是什么st clobber?文档似乎只提及 t (即堆栈的顶部)。





    1. 是的,它取决于当前的舍入模式,在我们的应用程序中应该总是朝向负无穷大。


    解决方案


    查看glibc中的准系统 lrint 的执行情况在MinGW中,我看到类似于

      __ asm__ __volatile__(fistpl%0
    := m(retval)
    :t(x)
    :st);

    我们要求输入直接放在 ST(0 )(它避免了可能无用的 fldl

    这实际上是代表你想要的内联汇编代码的正确方法。



    为了获得最佳可能的代码生成,你想要使用输入和输出。而不是硬编码必要的加载/存储指令,让编译器生成它们。这不仅会引入消除可能不必要的指令的可能性,还意味着编译器可以在需要时更好地调度这些指令(即,它可以在先前的代码序列内交错指令,通常会降低成本)。


    这是什么st clobber?这些文档似乎只提及 t (即堆栈顶部)。

    st clobber引用 st(0)寄存器, ,x87 FPU堆栈的顶部。 AT& T / GAS表示法通常指的是简单的 st ,其中英特尔/ MASM表示法称为 st(0)。并且,根据GCC关于 clobbers 的文档,该clobber列表是注册名称或特殊clobbers(cc(条件代码/标志)和内存)。所以这意味着内联汇编clobbers(覆盖) st(0)寄存器。为什么这个clobber是必要的,因为 fistpl 指令弹出堆栈的顶部,因此会破坏 st(0)的原始内容。 code>。



    有关此代码的唯一问题是文档中的以下段落:


    Clobber描述可能不会与输入或输出操作数重叠。例如,在clobber列表中列出该寄存器时,您可能没有描述具有一个成员的寄存器类的操作数。声明在特定寄存器中的变量(参见 Explicit寄存器变量)并用作asm输入或输出操作数时,不得在clobber描述中提及任何部分。特别是,没有办法指定输入操作数在没有将输入操作数指定为输出操作数的情况下进行修改。



    当编译器选择用哪个寄存器来表示输入和输出操作数,它不使用任何破坏寄存器。因此,破坏寄存器可用于汇编代码中的任何用途。

    正如你所知道的那样, t 约束意味着x87 FPU堆栈的顶部。问题是,这与 st 寄存器是一样的,文档非常清楚地表示我们不能有一个clobber,它指定与输入/输出操作数。此外,由于文档声明编译器被禁止使用任何破坏寄存器来表示输入/输出操作数,所以这个内联汇编程序产生了一个不可能的请求 - 在x87 FPU堆栈的顶部加载这个值,而不把它放在<$现在,我认为glibc的作者知道他们在做什么,并且更熟悉编译器的功能内联汇编的实现比你或我的,所以这段代码可能是合法和合法的。

    实际上,似乎不寻常的情况下x87的堆栈式寄存器强制异常处理clobbers和操作数之间的正常交互。 官方文档说:


    在x86目标上,关于asm操作数中堆栈式寄存器的使用有几条规则。这些规则仅适用于堆栈式寄存器的操作数:


    1. 给定一组输入寄存器, ,有必要知道哪些是由asm隐式弹出的,哪些必须由GCC显式弹出。



      一个由asm隐式弹出的输入寄存器必须是除非它被限制为匹配输出操作数。



    我们的案例正是如此。

    进一步的确认是通过官方文档(链接部分的底部):


    这个asm需要两个输入,它们被 fyl2xp1 操作码弹出,并用一个输出替换它们。编译器需要 st(1) clobber来知道 fyl2xp1 会弹出两个输入。

      asm(fyl2xp1:= t(result):0(x),u(y):st 1)); 


    这里,clobber st(1 )与输入约束 u 相同,这似乎违反了有关clobbers的上述文档,但是恰恰用于因为 fistpl 会弹出输入内容,所以st被用作原始代码中的clobber。所有这些说,现在你知道如何正确地编写内联汇编代码,我必须回声先前的评论者谁建议最佳解决方案根本不要使用内联汇编。只需调用 lrint ,它不仅具有所需的确切语义,还可以在某些情况下被编译器更好地优化( eg ),当目标体系结构支持SSE时,将它转换为一个 cvtsd2si 指令。)


    In a codebase of ours I found this snippet for fast, towards-negative-infinity1 rounding on x87:

    inline int my_int(double x)
    {
      int r;
    #ifdef _GCC_
      asm ("fldl %1\n"
           "fistpl %0\n"
           :"=m"(r)
           :"m"(x));
    #else
      // ...
    #endif
      return r;
    }
    

    I'm not extremely familiar with GCC extended assembly syntax, but from what I gather from the documentation:

    • r must be a memory location, where I'm writing back stuff;
    • x must be a memory location too, whence the data comes from.
    • there's no clobber specification, so the compiler can rest assured that at the end of the snippet the registers are as he left them.

    Now, to come to my question: it's true that in the end the FPU stack is balanced, but what if all the 8 locations were already in use and I'm overflowing it? How can the compiler know that it cannot trust ST(7) to be where it left it? Should some clobber be added?

    Edit I tried to specify st(7) in the clobber list and it seems to affect the codegen, now I'll wait for some confirmation of this fact.


    As a side note: looking at the implementation of the barebones lrint both in glibc and in MinGW I see something like

    __asm__ __volatile__ ("fistpl %0"
                          : "=m" (retval)
                          : "t" (x)
                          : "st");
    

    where we are asking for the input to be placed directly in ST(0) (which avoids that potentially useless fldl); what is that "st" clobber? The docs seems to mention only t (i.e. the top of the stack).


    1. yes, it depends from the current rounding mode, which in our application should always be "towards negative infinity".

    解决方案

    looking at the implementation of the barebones lrint both in glibc and in MinGW I see something like

    __asm__ __volatile__ ("fistpl %0"
                         : "=m" (retval)
                         : "t" (x)
                         : "st");
    

    where we are asking for the input to be placed directly in ST(0) (which avoids that potentially useless fldl)

    This is actually the correct way to represent the code you want as inline assembly.

    To get the most optimal possible code generated, you want to make use of the inputs and outputs. Rather than hard-coding the necessary load/store instructions, let the compiler generate them. Not only does this introduce the possibility of eliding potentially unnecessary instructions, it also means that the compiler can better schedule these instructions when they are required (that is, it can interleave the instruction within a prior sequence of code, often minimizing its cost).

    what is that "st" clobber? The docs seems to mention only t (i.e. the top of the stack).

    The "st" clobber refers to the st(0) register, i.e., the top of the x87 FPU stack. What Intel/MASM notation calls st(0), AT&T/GAS notation generally refers to as simply st. And, as per GCC's documentation for clobbers, the items in the clobber list are "either register names or the special clobbers" ("cc" (condition codes/flags) and "memory"). So this just means that the inline assembly clobbers (overwrites) the st(0) register. The reason why this clobber is necessary is that the fistpl instruction pops the top of the stack, thus clobbering the original contents of st(0).

    The only thing that concerns me regarding this code is the following paragraph from the documentation:

    Clobber descriptions may not in any way overlap with an input or output operand. For example, you may not have an operand describing a register class with one member when listing that register in the clobber list. Variables declared to live in specific registers (see Explicit Register Variables) and used as asm input or output operands must have no part mentioned in the clobber description. In particular, there is no way to specify that input operands get modified without also specifying them as output operands.

    When the compiler selects which registers to use to represent input and output operands, it does not use any of the clobbered registers. As a result, clobbered registers are available for any use in the assembler code.

    As you already know, the t constraint means the top of the x87 FPU stack. The problem is, this is the same as the st register, and the documentation very clearly said that we could not have a clobber that specifies the same register as one of the input/output operands. Furthermore, since the documentation states that the compiler is forbidden to use any of the clobbered registers to represent input/output operands, this inline assembly makes an impossible request—load this value at the top of the x87 FPU stack without putting it in st!

    Now, I would assume that the authors of glibc know what they are doing and are more familiar with the compiler's implementation of inline assembly than you or I, so this code is probably legal and legitimate.

    Actually, it seems that the unusual case of the x87's stack-like registers forces an exception to the normal interactions between clobbers and operands. The official documentation says:

    On x86 targets, there are several rules on the usage of stack-like registers in the operands of an asm. These rules apply only to the operands that are stack-like registers:

    1. Given a set of input registers that die in an asm, it is necessary to know which are implicitly popped by the asm, and which must be explicitly popped by GCC.

      An input register that is implicitly popped by the asm must be explicitly clobbered, unless it is constrained to match an output operand.

    That fits our case exactly.

    Further confirmation is provided by an example appearing in the official documentation (bottom of the linked section):

    This asm takes two inputs, which are popped by the fyl2xp1 opcode, and replaces them with one output. The st(1) clobber is necessary for the compiler to know that fyl2xp1 pops both inputs.

    asm ("fyl2xp1" : "=t" (result) : "0" (x), "u" (y) : "st(1)");
    

    Here, the clobber st(1) is the same as the input constraint u, which seems to violate the above-quoted documentation regarding clobbers, but is used and justified for precisely the same reason that "st" is used as the clobber in your original code, because fistpl pops the input.


    All of that said, and now that you know how to correctly write the code in inline assembly, I have to echo previous commenters who suggested that the best solution would be not to use inline assembly at all. Just call lrint, which not only has the exact semantics that you want, but can also be better optimized by the compiler under certain circumstances (e.g., transforming it into a single cvtsd2si instruction when the target architecture supports SSE).

    这篇关于如何用扩展的gcc程序集指定x87 FPU堆栈的底部底部?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆