似乎违反了内联汇编规则的GCC代码,但专家认为并非如此 [英] GCC code that seems to break inline assembly rules but an expert believes otherwise

查看:104
本文介绍了似乎违反了内联汇编规则的GCC代码,但专家认为并非如此的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我和一位专家进行了交流,据称他有比我更强的编码技能,而我自己对内联汇编的理解要比我以往任何时候都好.

其中一项主张是,只要操作数作为输入约束出现,您就不需要将其列为障碍物,也不必指定该寄存器已被内联汇编程序潜在地修改了.对话是在其他人试图通过有效编码的memset实现获得帮助的时候进行的:

void *memset(void *dest, int value, size_t count)
{
    asm volatile  ("cld; rep stosb" :: "D"(dest), "c"(count), "a"(value));
    return dest;
}

当我在不告知编译器的情况下评论破坏寄存器的问题时,专家的主张是告诉我们:

"c"(计数)已经告诉编译器c崩溃了

我找到了一个例子专家自己的操作系统中,他们以相同的设计模式编写相似的代码.他们将Intel语法用于其内联汇编.该业余操作系统代码在内核(ring0)上下文中运行.一个示例就是此缓冲区交换函数 1 :

void swap_vbufs(void) {
    asm volatile (
        "1: "
        "lodsd;"
        "cmp eax, dword ptr ds:[rbx];"
        "jne 2f;"
        "add rdi, 4;"
        "jmp 3f;"
        "2: "
        "stosd;"
        "3: "
        "add rbx, 4;"
        "dec rcx;"
        "jnz 1b;"
        :
        : "S" (antibuffer0),
          "D" (framebuffer),
          "b" (antibuffer1),
          "c" ((vbe_pitch / sizeof(uint32_t)) * vbe_height)
        : "rax"
    );

    return;
}

antibuffer0antibuffer1framebuffer都是内存中被视为uint32_t数组的缓冲区. framebuffer是实际的视频内存(MMIO),而antibuffer0antibuffer1是在内存中分配的缓冲区.

在调用此函数之前已正确设置了全局变量.它们声明为:

volatile uint32_t *framebuffer;
volatile uint32_t *antibuffer0;
volatile uint32_t *antibuffer1;

int vbe_width = 1024;
int vbe_height = 768;
int vbe_pitch;


我对此类代码的疑问和担忧

作为对内联汇编的一个明显的新手,对这个主题有一个天真的理解,我想知道我显然没有受过教育的信念,该代码可能是非常多的错误,是否正确.我想知道这些担忧是否有好处:

    此代码修改了
  1. RDI RSI RBX RCX . RDI RSI 隐式地增加 LODSD STOSD .其余的将使用

    进行显式修改

        "add rbx, 4;"
        "dec rcx;"
    

    这些寄存器均未列为输入/输出,也未列为输出操作数.我认为需要修改这些约束条件,以告知编译器这些寄存器可能已被修改/填充.我认为是正确的唯一列出的内容是 RAX .我的理解正确吗?我的感觉是 RDI RSI RBX RCX 应该是输入/输出约束(使用+修饰符).即使有人试图争辩说64位System V ABI调用约定将保存它们(假设恕我直言,编写此类代码的方法很差), RBX 是一种非易失性寄存器,它将在此更改代码.

  2. 由于地址是通过寄存器传递的(而不是通过内存限制的),所以我认为这是一个潜在的错误,即编译器没有被告知和/或修改了这些指针所指向的内存.我的理解正确吗?

  3. RBX RCX 是硬编码寄存器.允许编译器通过约束条件自动选择这些寄存器是否合理?

  4. 如果有人假设必须在这里使用内联汇编(假设),此功能的无bug GCC内联汇编代码会是什么样?此功能是否还可以,而且我只是不了解GCC扩展内联汇编的基础知识,如 expert 一样?


脚注

您在所有方面都是正确的,此代码充满了可能会咬住您的编译器的谎言.带有不同的周围代码或不同的编译器版本/选项(尤其是链接时优化以启用跨文件内联).

swap_vbufs甚至看起来都不是很有效,我怀疑gcc在纯C版本中会做得更好或更出色. https://gcc.gnu.org/wiki/DontUseInlineAsm . stosd在Intel上为3微秒,比常规的mov -store + add rdi,4差.并且将add rdi,4设置为无条件将避免需要该else块,该块将在多余的[c16>路径上放置了一个额外的jmp,因为缓冲区相等,因此没有MMIO存储到视频RAM的快速路径上.

(lodsd在Haswell和更高版本上只有2 oups,所以如果您不关心IvyBridge或更旧的版本也可以).

在内核代码中,我想他们会避免使用SSE2,即使它是x86-64的基准,否则您可能想要使用它.对于普通的内存目标,您只需memcpyrep movsd或ERMSB rep movsb,但是我想这里的重点是,通过检查视频RAM的缓存副本来避免MMIO存储.尽管如此,除非视频RAM被映射为UC(不可缓存)而不是WC,否则使用movnti的无条件流存储可能是有效的.


很容易构造出实际上确实在实践中出现问题的示例,例如在同一函数中的内联asm语句之后 再次使用相关的C变量. (或者在内联了asm的父函数中).

要销毁的输入通常必须使用匹配的虚拟输出或带有C tmp var的RMW输出进行处理,而不仅仅是"r".或"a".

"r"或任何特定寄存器约束(例如"D")意味着这是只读输入,编译器可以期望之后找到不受干扰的值.没有我要销毁的输入"约束;您必须将其与虚拟输出或变量进行综合.

这全部适用于其他支持GNU C内联asm语法的编译器(clang和ICC).

在GCC手册中:扩展了asm输入操作数:

请勿修改仅输入操作数的内容(绑定到输出的输入除外).编译器假定从asm语句退出时,这些操作数包含的值与执行该语句之前的值相同.不能使用Clobber通知编译器这些输入中的值正在更改.

(rax Clobber使使用"a"作为输入是错误的; clobber和操作数不能重叠.)


示例1:注册输入操作数

int plain_C(int in) {   return (in+1) + in;  }

// buggy: modifies an input read-only operand
int bad_asm(int in) {
    int out;
    asm ("inc %%edi;\n\t mov %%edi, %0" : "=a"(out) : [in]"D"(in) );
    return out + in;
}

)

int safe(int in) {
    int out;
    int dummy;
    asm ("inc %%edi;\n\t mov %%edi, %%eax"
     : "=a"(out),
       "=&D"(dummy)
     : [in]"1"(in)  // matching constraint, or "D" works.
    );
    return out + in;
}

# gcc9.1 again.
safe_asm(int):
        movl    %edi, %edx      # tmp89, in    compiler-generated save of in
          # start inline asm
        inc %edi;
         mov %edi, %eax
          # end inline asm
        addl    %edx, %eax      # in, tmp88
        ret

很显然,"lea 1(%%rdi), %0"可以通过不首先修改输入来避免问题,mov/inc也可以.这是一个故意破坏输入的虚假示例.


如果该函数不是内联并且在asm语句后不使用输入变量,则只要是调用聚集寄存器,通常就不必对编译器撒谎. /p>

找到写过不安全代码的人碰巧可以在其使用的上下文中工作的情况并不罕见.让他们确信只需在一个上下文中使用一个编译器版本/选项对它进行测试就可以了.足以验证其安全性或正确性.

但这不是asm的工作方式;编译器相信您可以准确地描述asm的行为,只需在模板部分进行文本替换即可.

如果gcc假设asm语句总是破坏其输入,那将是一个糟糕的错过优化.实际上,在内部机器描述文件中(我认为)使用了内联asm使用的相同约束,这些约束向gcc教授了有关ISA的信息. (因此,被破坏的输入对于代码生成而言将是可怕的).

GNU C内联汇编的整个设计都基于包装一条指令,这就是为什么即使是输出早期指令也不是默认值的原因.如果要编写多个指令或在嵌入式asm中循环,则必须在必要时手动进行操作.


一个潜在的错误,尚未告知编译器这些指针指向的内存已被读取和/或修改.

那也是正确的.寄存器输入操作数 not 并不暗示指向的存储器也是输入操作数.在无法内联的函数中,这实际上不会引起问题,但是,一旦启用链接时优化,就可以进行跨文件内联和过程间优化.

现有一个> inline通知clang程序集读取内存的特定区域未解决的问题.此 Godbolt链接显示了一些解决此问题的方法,例如

   arr[2] = 1;
   asm(...);
   arr[2] = 0;

如果gcc假定arr[2]不是asm的输入,则仅arr地址本身,它将执行死存储消除并删除=1分配. (或者将其视为使用asm语句对商店进行重新排序,然后将2个商店折叠到同一位置.)

一个数组是好的,因为它表明即使"m"(*arr)也不适用于指针,只有实际的 array 才起作用.该输入操作数只会告诉编译器arr[0]是输入,仍然不是arr[2].如果这是您的asm读取的全部内容,那将是一件好事,因为它不会阻止其他部分的优化.

对于该memset示例,要正确声明指向的内存是输出操作数,请将指针转换为指向数组的指针并将其取消引用,以告诉gcc内存的整个范围都是该操作数. *(char (*)[count])pointer. (您可以将[]留空以指定通过此指针访问的任意长度的内存区域.)

// correct version written by @MichaelPetch.  
void *memset(void *dest, int value, size_t count)
{
  void *tmp = dest;
  asm ("rep stosb    # mem output is %2"
     : "+D"(tmp), "+c"(count),       // tell the compiler we modify the regs
       "=m"(*(char (*)[count])tmp)   // dummy memory output
     : "a"(value)                    // EAX actually is read-only
     : // no clobbers
  );
  return dest;
}

使用伪操作数包含一个asm注释,使我们可以看到编译器是如何分配它的.我们可以看到编译器使用AT& T语法选择(%rdi),因此它愿意使用也是输入/输出操作数的寄存器.

在输出早期消息的情况下,它可能希望使用另一个寄存器,但如果不这样做,则无需花费任何代价即可获得正确性.

使用不返回指针的void函数(或在内联到不使用返回值的函数之后),在让rep stosb销毁之前,不必将指针arg复制到任何地方它.

I was engaged with an expert who allegedly has vastly superior coding skills than myself who understands inline assembly far better than I ever could.

One of the claims is that as long as an operand appears as an input constraint, you don't need to list it as a clobber or specify that the register has been potentially modified by the inline assembly. The conversation came about when someone else was trying to get assistance on a memset implementation that was effectively coded this way:

void *memset(void *dest, int value, size_t count)
{
    asm volatile  ("cld; rep stosb" :: "D"(dest), "c"(count), "a"(value));
    return dest;
}

The expert's claim when I commented about the issue with clobbering registers without telling the compiler, was to tell us that:

"c"(count) already tells the compiler c is clobbered

I found an example in the expert's own operating system where they write similar code with the same design pattern. They use Intel syntax for their inline assembly. This hobby operating system code operates in a kernel (ring0) context. An example is this buffer swap function1:

void swap_vbufs(void) {
    asm volatile (
        "1: "
        "lodsd;"
        "cmp eax, dword ptr ds:[rbx];"
        "jne 2f;"
        "add rdi, 4;"
        "jmp 3f;"
        "2: "
        "stosd;"
        "3: "
        "add rbx, 4;"
        "dec rcx;"
        "jnz 1b;"
        :
        : "S" (antibuffer0),
          "D" (framebuffer),
          "b" (antibuffer1),
          "c" ((vbe_pitch / sizeof(uint32_t)) * vbe_height)
        : "rax"
    );

    return;
}

antibuffer0, antibuffer1, and framebuffer are all buffers in memory treated as arrays of uint32_t. framebuffer is actual video memory (MMIO) and antibuffer0, antibuffer1 are buffers allocated in memory.

The global variables are properly set up before this function is called. They are declared as:

volatile uint32_t *framebuffer;
volatile uint32_t *antibuffer0;
volatile uint32_t *antibuffer1;

int vbe_width = 1024;
int vbe_height = 768;
int vbe_pitch;


My Questions and Concerns about this Kind of Code

As an apparent neophyte to inline assembly having an apparent naive understanding of the subject, I'm wondering whether my apparent uneducated belief this code is potentially very buggy is correct. I want to know if these concerns have any merit:

  1. RDI, RSI, RBX, and RCX are all modified by this code. RDI and RSI are incremented by LODSD and STOSD implicitly. The rest are modified explicitly with

        "add rbx, 4;"
        "dec rcx;"
    

    None of these registers are listed as input/output nor are they listed as output operands. I believe these constraints need to be modified to inform the compiler that these registers may have been modified/clobbered. The only register that is listed as clobbered which I believe is correct is RAX. Is my understanding correct? My feeling is that RDI, RSI, RBX, and RCX should be input/output constraints (Using the + modifier). Even if one tries to argue that the 64-bit System V ABI calling convention will save them (assumptions that a poor way IMHO to write such code) RBX is a non-volatile register that will change in this code.

  2. Since the addresses are passed via registers (and not memory constraints), I believe it is a potential bug that the compiler hasn't been told that memory that these pointers are pointing at has been read and/or modified. Is my understanding correct?

  3. RBX, and RCX are hard coded registers. Wouldn't it make sense to allow the compiler to choose these registers automatically via the constraints?

  4. If one assumes that inline assembly has to be used here (hypothetically) what would bug free GCC inline assembly code look like for this function? Is this function fine as is, and I just don't understand the basics of GCC's extended inline assembly like the expert does?


Footnotes

  • 1The swap_vbufs function and associated variable declarations have been reproduced verbatim without the copyright holder's permission under fair use for purposes of commentary about a larger body of work.

解决方案

You are correct on all counts, this code is full of lies to the compiler that could bite you. e.g. with different surrounding code, or different compiler versions / options (especially link-time optimization to enable cross-file inlining).

swap_vbufs doesn't even look very efficient, I suspect gcc would do equal or better with a pure C version. https://gcc.gnu.org/wiki/DontUseInlineAsm. stosd is 3 uops on Intel, worse than a regular mov-store + add rdi,4. And making add rdi,4 unconditional would avoid the need for that else block which puts an extra jmp on the (hopefully) fast path where there's no MMIO store to video RAM because the buffers were equal.

(lodsd is only 2 uops on Haswell and newer so that's ok if you don't care about IvyBridge or older).

In kernel code I guess they're avoiding SSE2, even though it's baseline for x86-64, otherwise you'd probably want to use that. For a normal memory destination, you'd just memcpy with rep movsd or ERMSB rep movsb, but I guess the point here is to avoid MMIO stores when possible by checking against a cached copy of video RAM. Still, unconditional streaming stores with movnti might be efficient, unless video RAM is mapped UC (uncacheable) instead of WC.


It's easy to construct examples where this really does break in practice, by e.g. using the relevant C variable again after the inline asm statement in the same function. (Or in a parent function which inlined the asm).

An input you want to destroy has to be handled usually with a matching dummy output or a RMW output with a C tmp var, not just "r". or "a".

"r" or any specific-register constraint like "D" means this is a read-only input, and the compiler can expect to find the value undisturbed afterwards. There is no "input I want to destroy" constraint; you have to synthesize that with a dummy output or variable.

This all applies to other compilers (clang and ICC) that support GNU C inline asm syntax.

From the GCC manual: Extended asm Input Operands:

Do not modify the contents of input-only operands (except for inputs tied to outputs). The compiler assumes that on exit from the asm statement these operands contain the same values as they had before executing the statement. It is not possible to use clobbers to inform the compiler that the values in these inputs are changing.

(An rax clobber makes it an error to use "a" as an input; clobbers and operands can't overlap.)


Example for 1: register input operands

int plain_C(int in) {   return (in+1) + in;  }

// buggy: modifies an input read-only operand
int bad_asm(int in) {
    int out;
    asm ("inc %%edi;\n\t mov %%edi, %0" : "=a"(out) : [in]"D"(in) );
    return out + in;
}

Compiled on the Godbolt compiler explorer

Notice that gcc's addl uses edi for in, even though inline asm used that register as an input. (And thus breaks because this buggy inline asm modifies the register). It happens to hold in+1 in this case. I used gcc9.1, but this is not new behaviour.

## gcc9.1 -O3 -fverbose-asm
bad(int):
        inc %edi;
         mov %edi, %eax         # out  (comment mentions out because I used %0)

        addl    %edi, %eax      # in, tmp86
        ret     

We fix that by telling the compiler that the same input register is also an output, so it can no longer count on that . (Or by using auto tmp = in; asm("..." : "+r"(tmp));)

int safe(int in) {
    int out;
    int dummy;
    asm ("inc %%edi;\n\t mov %%edi, %%eax"
     : "=a"(out),
       "=&D"(dummy)
     : [in]"1"(in)  // matching constraint, or "D" works.
    );
    return out + in;
}

# gcc9.1 again.
safe_asm(int):
        movl    %edi, %edx      # tmp89, in    compiler-generated save of in
          # start inline asm
        inc %edi;
         mov %edi, %eax
          # end inline asm
        addl    %edx, %eax      # in, tmp88
        ret

Obviously "lea 1(%%rdi), %0" would avoid the problems by not modifying the input in the first place, and so would mov/inc. This is an artificial example that intentionally destroys an input.


If the function does not inline and doesn't use the input variable after the asm statement, you typically get away with lying to the compiler, as long as it's a call-clobbered register.

It's not rare to find people that have written unsafe code that happens to work in the context they're using it in. It's also not rare for them to be convinced that simply testing it in that context with one compiler version/options is sufficient to verify its safety or correctness.

But that's not how asm works; the compiler trusts you to accurately describe the asm's behaviour, and simply does text substitution on the template part.

It would be a crappy missed optimization if gcc assumed that asm statements always destroyed their inputs. In fact, the same constraints that inline asm uses are (I think) used in the internal machine-description files that teach gcc about an ISA. (So destroyed inputs would be terrible for code-gen).

The whole design of GNU C inline asm is based around wrapping a single instruction, that's why even early-clobber for outputs isn't the default. You have to do that manually if necessary, if writing multiple instructions or a loop inside inline asm.


a potential bug that the compiler hasn't been told that memory that these pointers are pointing at has been read and or modified.

That's also correct. A register input operand does not imply that the pointed-to memory is also an input operand. In a function that can't inline, this can't actually cause problems, but as soon as you enable link-time optimization, cross-file inlining and inter-procedural optimization becomes possible.

There's an existing Informing clang that inline assembly reads a particular region of memory unanswered question. This Godbolt link shows some of the ways you can reveal this problem, e.g.

   arr[2] = 1;
   asm(...);
   arr[2] = 0;

If gcc assumes arr[2] isn't an input to the asm, only the arr address itself, it will do dead-store elimination and remove the =1 assignment. (Or look at it as reordering the store with the asm statement, then collapsing 2 stores to the same location).

An array is good because it shows that even "m"(*arr) doesn't work for a pointer, only an actual array. That input operand would only tell the compiler that arr[0] is an input, still not arr[2]. That's a good thing if that's all your asm reads, because it doesn't block optimization of other parts.

For that memset example, to properly declare that the pointed-to memory is an output operand, cast the pointer to a pointer-to-array and dereference it, to tell gcc that an entire range of memory is the operand. *(char (*)[count])pointer. (You can leave the [] empty to specify an arbitrary-length region of memory accessed via this pointer.)

// correct version written by @MichaelPetch.  
void *memset(void *dest, int value, size_t count)
{
  void *tmp = dest;
  asm ("rep stosb    # mem output is %2"
     : "+D"(tmp), "+c"(count),       // tell the compiler we modify the regs
       "=m"(*(char (*)[count])tmp)   // dummy memory output
     : "a"(value)                    // EAX actually is read-only
     : // no clobbers
  );
  return dest;
}

Including an asm comment using the dummy operand lets us see how the compiler allocated it. We can see the compiler picks (%rdi) with AT&T syntax, so it is willing to use a register that's also an input/output operand.

With an early-clobber on the output it might have wanted to use another register, but without that it doesn't cost us anything to gain correctness.

With a void function that doesn't return the pointer (or after inlining into a function that doesn't use the return value), it doesn't have to copy the pointer arg anywhere before letting rep stosb destroy it.

这篇关于似乎违反了内联汇编规则的GCC代码,但专家认为并非如此的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆