Gcc内联汇编:输入操作数中的动态分配寄存器“r”有什么问题? [英] Gcc inline assembly: what's wrong with the dynamic allocated register `r` in input operand?

查看:215
本文介绍了Gcc内联汇编:输入操作数中的动态分配寄存器“r”有什么问题?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我测试GCC inline-assembly时,我使用 test 函数在BOCHS模拟器上显示一个字符。此代码在32位保护模式下运行。代码如下:

  test(){
char ch ='B';
__asm__(mov $ 0x10,%% ax\\\
\t
mov %% ax,%% es\\\
\t
movl $ 0xb8000, %% ebx\\\
\t
mov $ 0x04,%% ah\\\
\t
mov%0,%% al\\\
\t
mov %% ax,%% es:((80 * 3 + 40)* 2)(%% ebx)\\\
\t
::r(ch):);
}

我得到的结果是:



屏幕上的红色字符未正确显示 B 。但是,当我将输入寄存器 r 更改为 c 时,像这样: ::c (ch):); ,这是上述代码的最后一行,字符'B'通常显示:


有什么区别?计算机进入保护模式后,我直接通过数据段访问视频内存。


我跟踪汇编代码,发现代码已经汇编到 mov al,al 时选择 r 寄存器,并且 ax 的值为 0x0010 ,所以 al 0x10 。结果应该是这样的,但它为什么选择 al 寄存器。难道它不应该选择以前没有使用的注册表吗?当我添加 clobbers 列表时,我已经解决了这个问题。

解决方案

像@MichaelPetch评论的那样,你可以使用32位地址来访问你想要的任何内存.gcc发出的将假定一个平坦的内存空间,并假定它可以复制 esp to edi 并使用 rep stos 来清零一些堆栈内存,例如(这要求%es 的基数与%ss )相同。



我会猜测最好的解决方案不是使用任何内联asm,而是使用全局常量作为指向 char 的指针。例如

  //指针是常量,但指向非常量内存
uint16_t * const vga_base =(uint16_t * )0xb8000; //无论你的分段是什么

//偏移量被缩放2.如果你想要地址数学将偏移量视为字节偏移量,则进行一些转换
void store_in_flat_memory(unsigned char c, uint32_t偏移量){
vga_base [offset] = 0x0400U | C; //重要的是c是无符号的,所以它是零扩展而不是符号扩展
}
movzbl 4(%esp),%eax#c,c
movl 8(%esp ),%edx#offset,offset
orb $ 4,%ah#,tmp95#超级怪异,wtf gcc。我们甚至在-mtune = core2的情况下得到这个结果,它会导致部分寄存器停顿
movw%ax,753664(%edx,%edx)#tmp95,* _3#寻址模式将偏移量缩放2 (uint16_t)),使用它作为基础和索引
ret

从gcc6.1如果没有 const ,代码如 vga_base [10] = 0x4<< 8 | 'A'; 必须加载 vga_base 全局然后偏移它。使用 const & vga_base [10] 是一个编译时常量。






如果你真的想要一个细分:



既然你不能让%es 修改,你需要保存/恢复它。这是避免首先使用它的另一个原因。如果你真的想要一个特殊的细分市场,请设置%fs %gs 它不会影响任何不使用段重写的指令的正常操作。

有内建语法可以使用 %fs %gs 不带内联asm,用于线程局部变量。 您可以利用它来避免内联asm b
$ b

如果您使用自定义细分,则可以将其基址设置为非零,因此您无需自行添加 0xb8000 。然而,英特尔CPU针对平面内存情况进行了优化,因此使用非零段基地址的地址生成速度比IIRC慢了几个周期。



我找到了一个请求gcc允许段无需内联asm ,并且有关向gcc添加细分受众群的问题。目前您无法做到这一点。






在asm中手动执行,并带有专用段



为了看看asm输出,我把它放在 Godbolt与 -mx32 ABI ,所以参数在寄存器中传递,但地址不需要符号扩展到64位。 (我想避免在 -m32 代码中加载args的噪音, -m32 asm for protected模式将看起来类似)

pre $ void store_in_special_segment(unsigned char c,uint32_t offset){
char * base =(char *)0xb8000; // sizeof(char)= 1,所以地址数学不会被任何东西缩放

//让编译器在编译时执行地址数学运算,而不是强制一个32位常量进入寄存器,另一个变为disp32
char * dst = base + offset; //不是真正的地址,因为它与特定的分段相关。我们使用C指针,所以gcc可以利用它想要的任何寻址模式。
uint16_t val =(uint32_t)c | 0x0400U; //重要的是c是无符号的,所以它将零
$ b $ asm volatile(movw%[val],%% fs:%[dest] \\\


:[val]ri(val),// register or immediate
[dest]m(* dst)
:memory//我们写的东西不是'输出操作数
);
}
movzbl%dil,%edi#dil是%edi中的低8位(仅AMD64,但32位代码概率不会在第一位置放置char)
orw $ 1024,%di#,val#gcc导致LCP失速,即使使用-mtune = haswell和gcc 6.1
movw%di,%fs:753664(%esi)#val,* dst_2

void test_const_args(void){
uint32_t offset =(80 * 3 + 40)* 2;
store_in_special_segment('B',offset);

movw $ 1090,%fs:754224#,MEM [(char *)754224B]

void test_const_offset(char ch){
uint32_t offset =(80 * 3 + 40)* 2;
store_in_special_segment(ch,offset);

movzbl%dil,%edi#ch,ch $ b $ orw $ 1024,%di#,val
movw%di,%fs:754224#val,MEM [(char *)754224B]

void test_const_char(uint32_t offset){
store_in_special_segment('B',offset);
}
movw $ 1090,%fs:753664(%edi)#,* dst_4

因此,这段代码让gcc在使用寻址模式来完成地址数学方面做得非常出色,并且在编译时尽可能地做到了这些。




段寄存器



如果你想修改每个商店的段寄存器,请记住它很慢: Agner Fog的insn表在Nehalem后停止包括 mov sr,r 但对Nehalem来说,这是一个6 uop指令,包含3个负载微指令(来自GDT,我认为)。它的吞吐量为每13个周期一个。读段寄存器是好的(例如 push sr mov r,sr )。 pop sr 甚至有点慢。



我甚至不会为此编写代码,因为它是这是一个糟糕的主意。确保你使用clobber约束来让编译器知道你所登录的每一个寄存器,或者当周围的代码停止工作时你将会很难调试错误。



请参阅 x86 标记wiki的问题。

When I test the GCC inline-assembly, I use the test function to display a character on the screen with the BOCHS emulator. This code is running in 32-bit protected mode. The code is as follows:

test() {
    char ch = 'B';
    __asm__ ("mov $0x10, %%ax\n\t" 
                "mov %%ax, %%es\n\t"
                "movl $0xb8000, %%ebx\n\t"
                "mov $0x04, %%ah\n\t" 
                "mov %0, %%al\n\t" 
                "mov %%ax, %%es: ((80 * 3 + 40) * 2)(%%ebx)\n\t" 
                ::"r"(ch):);
}

The result I'm getting is:

The red character on the screen isn't displaying B correctly. However, when I changed the input register r to c like this: ::"c"(ch):);, which is the last line of the above code, the character 'B' displays normally:

What's the difference? I accessed the video memory through the data segment directly after the computer entered into protected mode.

I have trace the assembly code, I have found that the code has been assembled to mov al, al when the r register is chosen and the value of ax is 0x0010, so al is 0x10. The result should be like this, but why did it choose the al register. Isn't it supposed to choose the register which hasn't been used before? When I add the clobbers list, I have solved the problem.

解决方案

Like @MichaelPetch commented, you can use 32bit addresses to access whatever memory you want from C. The asm gcc emits will assume a flat memory space, and assume that it can copy esp to edi and use rep stos to zero some stack memory, for example (this requires that %es has the same base as %ss).

I'd guess that the best solution is not to use any inline asm, but instead just use a global constant as a pointer to char. e.g.

// pointer is constant, but points to non-const memory
uint16_t *const vga_base = (uint16_t*)0xb8000;   // + whatever was in your segment

// offsets are scaled by 2.  Do some casting if you want the address math to treat offsets as byte offsets
void store_in_flat_memory(unsigned char c, uint32_t offset) {
  vga_base[offset] = 0x0400U | c;            // it matters that c is unsigned, so it zero-extends instead of sign-extending
}
    movzbl  4(%esp), %eax       # c, c
    movl    8(%esp), %edx       # offset, offset
    orb     $4, %ah   #, tmp95         # Super-weird, wtf gcc.  We get this even for -mtune=core2, where it causes a partial-register stall
    movw    %ax, 753664(%edx,%edx)  # tmp95, *_3   # the addressing mode scales the offset by two (sizeof(uint16_t)), by using it as base and index
    ret

From gcc6.1 on godbolt (link below), with -O3 -m32.

Without the const, code like vga_base[10] = 0x4 << 8 | 'A'; would have to load the vga_base global and then offset from it. With the const, &vga_base[10] is a compile-time constant.


If you really want a segment:

Since you can't leave %es modified, you need to save/restore it. This is another reason to avoid using it in the first place. If you really want a special segment for something, set up %fs or %gs once and leave them set, so it doesn't affect the normal operation of any instructions that don't use a segment override.

There is builtin syntax to use %fs or %gs without inline asm, for thread-local variables. You might be able to take advantage of it to avoid inline asm altogether

If you're using a custom segment, you could make it's base address non-zero, so you don't need to add a 0xb8000 yourself. However, Intel CPUs optimize for flat memory case, so address-generation using non-zero segment bases are a couple cycles slower, IIRC.

I did find a request for gcc to allow segment overrides without inline asm, and a question about adding segment support to gcc. Currently you can't do that.


Doing it manually in asm, with a dedicated segment

To look at the asm output, I put it on Godbolt with the -mx32 ABI, so args are passed in registers, but addresses don't need to be sign-extended to 64bits. (I wanted to avoid the noise of loading args from the stack for -m32 code. The -m32 asm for protected mode will look similar)

void store_in_special_segment(unsigned char c, uint32_t offset) {
    char *base = (char*)0xb8000;               // sizeof(char) = 1, so address math isn't scaled by anything

    // let the compiler do the address math at compile time, instead of forcing one 32bit constant into a register, and another into a disp32
    char *dst = base+offset;               // not a real address, because it's relative to a special segment.  We're using a C pointer so gcc can take advantage of whatever addressing mode it wants.
    uint16_t val = (uint32_t)c | 0x0400U;  // it matters that c is unsigned, so it zero-extends

    asm volatile ("movw  %[val], %%fs: %[dest]\n"
         : 
         : [val] "ri" (val),  // register or immediate
           [dest] "m" (*dst)
         : "memory"   // we write to something that isn't an output operand
    );
}
    movzbl  %dil, %edi        # dil is the low 8 of %edi (AMD64-only, but 32bit code prob. wouldn't put a char there in the first place)
    orw     $1024, %di        #, val   # gcc causes an LCP stall, even with -mtune=haswell, and with gcc 6.1
    movw  %di, %fs: 753664(%esi)    # val, *dst_2

void test_const_args(void) {
    uint32_t offset = (80 * 3 + 40) * 2;
    store_in_special_segment('B', offset);
}
    movw  $1090, %fs: 754224        #, MEM[(char *)754224B]

void test_const_offset(char ch) {
    uint32_t offset = (80 * 3 + 40) * 2;
    store_in_special_segment(ch, offset);
}
    movzbl  %dil, %edi  # ch, ch
    orw     $1024, %di        #, val
    movw  %di, %fs: 754224  # val, MEM[(char *)754224B]

void test_const_char(uint32_t offset) {
    store_in_special_segment('B', offset);
}
    movw  $1090, %fs: 753664(%edi)  #, *dst_4

So this code gets gcc to do an excellent job at using an addressing mode to do the address math, and do as much as possible at compile time.


Segment register

If you do want to modify a segment register for every store, keep in mind that it's slow: Agner Fog's insn tables stop including mov sr, r after Nehalem, but on Nehalem it's a 6 uop instruction that includes 3 load uops (from the GDT I assume). It has a throughput of one per 13 cycles. Reading a segment register is fine (e.g. push sr or mov r, sr). pop sr is even a bit slower.

I'm not even going to write code for this, because it's such a bad idea. Make sure you use clobber constraints to let the compiler know about every register you step on, or you will have hard-to-debug errors where surrounding code stops working.

See the tag wiki for GNU C inline asm info.

这篇关于Gcc内联汇编:输入操作数中的动态分配寄存器“r”有什么问题?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆