MSVC内联ASM到GCC [英] MSVC Inline ASM to GCC

查看:165
本文介绍了MSVC内联ASM到GCC的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试处理MSVC和GCC编译器,同时更新此代码库以在GCC上工作。但我不确定GCC内联ASM的工作原理。现在我不擅长将ASM转换为C,否则我只会使用C而不是ASM。

  SLONG Div16 a,签署长b)
{
签署长v;
#ifdef __GNUC__ // GCC不起作用。
__asm(){
#else // MSVC
__asm {
#endif
mov edx,a
mov ebx,b
mov eax,edx
shl eax,16
sar edx,16
idiv ebx
mov v,eax
}
return v;
}

签署长期ROR13(签署长期限)
{
_asm {
ror val,13
}
}

我假设ROR13的工作原理类似于(val <13) (val>>(32 - 13)),但代码不会产生相同的输出。

什么是正确的方法将此内联ASM转换为GCC和/或此代码的C翻译是什么? 解决方案

GCC对内联汇编使用完全不同的语法比MSVC所做的更好,所以维护这两种表单的工作量相当大。这也不是一个特别好的主意。 内联汇编有很多问题。人们经常使用它,因为他们认为它会使代码运行得更快,但通常会产生相反的效果。 除非你是汇编语言专家编译器的代码生成策略,您最好让编译器的优化器生成代码



当您尝试执行那么,在这里你必须要小心一点,但是:签名的右移是用C实现定义的,所以如果你关心可移植性,你需要将值转换为等价的无符号类型:

  #include< limits.h> //对于CHAR_BIT 

签名为long ROR13(signed long val)
{
return((unsigned long)val>> 13) ((unsigned long)val<<((sizeof(val)* CHAR_BIT) - 13));

(另见)。

这将与原始代码具有相同的语义: ROR val,13 。事实上,MSVC会像GCC一样精确地生成目标代码。 (Clang,有趣的是,将会执行 ROL val,19 ,这会产生相同的结果,考虑到旋转的方式,ICC 17会生成一个扩展的移位: SHLD val,val,19 。我不确定为什么;也许这比在某些英特尔处理器上的轮转更快,或者对于英特尔来说可能是相同的,但对AMD来说速度会更慢。)



要在纯C中实现 Div16 ,您需要:

<$ p $ (long long)a<<< 16)/ b;< code>签名长Div16(签名长a,签名long b)
{
return

$ / code>

在可以执行原生64位除法的64位架构上,假设 long 仍然是Windows中的32位类型),这将转换为:

  movsxd rax,#从32扩展到64,如果long还不是64位
shl rax,16
cqo#sign-extend rax into rdx:rax
movsxd rcx,b
idiv rcx#或idiv b如果输入已经是64位的
ret

不幸的是,在32位x86上,代码几乎没有那么好。编译器向内部库函数发出调用,提供扩展的64位除法,因为它们不能证明使用单个64b / 32b => 32b idiv 指令不会出错。 (如果商不符合 eax ,将会引发 #DE 异常,而不仅仅是截断) / b>

换句话说,转换:

  int32_t Divide(int64_t a, int32_t b)
{
return(a / b);

转换为:

  mov eax,a_low 
mov edx,a_high
idiv b#如果a / b位于[-2 ^ 32,2 ^ 32-1]
ret

不是合法优化 - 编译器无法发出此代码。该语言标准说64/32分区被提升为64/64分区,总是产生64位结果。您后来将64位结果强制转换或强制为32位值与分割操作本身的语义无关。对于 a b 组合的错误将违反as-if规则,除非编译器可以证明这些组合 a b 是不可能的。 (例如,如果知道 b 大于 1 <16 ,则这可能是合法的优化对于 a =(int32_t)输入; a <= 16; 但即使这会产生与所有输入的C抽象机器相同的行为,gcc和clang
目前不会做这种优化。)






没有一种好的方法可以覆盖由语言标准强加的规则,并强制编译器发出所需的目标代码。 MSVC并没有为它提供一个内在的东西(虽然有一个Windows API函数, MulDiv ,但它并不快,只是使用内联汇编来实现它自己的实现,而< a href =https://blogs.msdn.microsoft.com/oldnewthing/20120514-00/?p=7633/ =nofollow noreferrer>某个特定情况下的错误,现在已经巩固需要向后兼容)。你基本上别无选择,只能求助于组装,无论是内联还是外联模块。



所以,你会陷入丑陋之中。它看起来像这样:

 签名长Div16(签名长a,签名长b)
{
#ifdef __GNUC__ // GNU风格的编译器(例如GCC,Clang等)
签名为长商;
签署了长余额; //(未使用,但必须标示clobbering)
__asm __(idivl%[divisor]
:= a(商),
= d(余数)
:0((unsigned long)a<< 16),
1(a> 16),
[divisor]rm(b)

);
返回商数;
#elif _MSC_VER // Microsoft式编译器(即MSVC)
__asm
{
mov eax,DWORD PTR [a]
mov edx,eax
shl eax,16
sar edx,16
idiv DWORD PTR [b]
//在EAX中保留结果,它将返回
}
#else
#error不支持的编译器
#endif
}



<这会在Microsoft和GNU风格的编译器上产生所需的输出。


好的,主要是。出于某种原因,当使用 rm 约束时,编译器可以自由选择是将内存操作数还是内存操作数,或将其加载到寄存器中,铛与只使用 r (这会迫使它将其加载到寄存器中)相比,会产生更差的对象代码。这不影响GCC或ICC。如果您关心Clang的输出质量,那么您可能只想使用 r ,因为这将在所有编译器上提供同样好的目标代码。



上Godbolt编译器资源管理器



现场演示

(注:GCC使用 SAL 助记符,而不是 SHL 助记符。这些都是相同指令 - 差异只对右移有影响 - 并且所有理智的汇编程序员都使用 SHL 。我不知道为什么GCC发出 SAL ,但你可以将它精神转换为 SHL 。)


I'm trying to handle both MSVC and GCC compilers while updating this code base to work on GCC. But I'm unsure exactly how GCCs inline ASM works. Now I'm not great at translating ASM to C else I would just use C instead of ASM.

SLONG Div16(signed long a, signed long b)
{
    signed long v;
#ifdef __GNUC__ // GCC doesnt work.
__asm() {
#else // MSVC
__asm {
#endif
        mov edx, a
        mov ebx, b          
        mov eax, edx           
        shl eax, 16          
        sar edx, 16            
        idiv ebx              
        mov v, eax              
    }
    return v;
}

signed long ROR13(signed long val)
{
    _asm{ 
        ror val, 13
    }
}

I assume ROR13 works something like (val << 13) | (val >> (32 - 13)) but the code doesn't produce the same output.

What is the proper way to translate this inline ASM to GCC and/or whats the C translation of this code?

解决方案

GCC uses a completely different syntax for inline assembly than MSVC does, so it's quite a bit of work to maintain both forms. It's not an especially good idea, either. There are many problems with inline assembly. People often use it because they think it'll make their code run faster, but it usually has quite the opposite effect. Unless you're an expert in both assembly language and the compiler's code-generation strategies, you are far better off letting the compiler's optimizer generate the code.

When you try to do that, you will have to be a bit careful here, though: signed right shifts are implementation-defined in C, so if you care about portability, you need to cast the value to an equivalent unsigned type:

#include <limits.h>   // for CHAR_BIT

signed long ROR13(signed long val)
{
    return ((unsigned long)val >> 13) |
           ((unsigned long)val << ((sizeof(val) * CHAR_BIT) - 13));
}

(See also Best practices for circular shift (rotate) operations in C++).

This will have the same semantics as your original code: ROR val, 13. In fact, MSVC will generate precisely that object code, as will GCC. (Clang, interestingly, will do ROL val, 19, which produces the same result, given the way that rotations work. ICC 17 generates an extended shift instead: SHLD val, val, 19. I'm not sure why; maybe that's faster than rotation on certain Intel processors, or maybe it's the same on Intel but slower on AMD.)

To implement Div16 in pure C, you want:

signed long Div16(signed long a, signed long b)
{
    return ((long long)a << 16) / b;
}

On a 64-bit architecture that can do native 64-bit division, (assuming long is still a 32-bit type like on Windows) this will be transformed into:

movsxd  rax, a   # sign-extend from 32 to 64, if long wasn't already 64-bit
shl     rax, 16
cqo              # sign-extend rax into rdx:rax
movsxd  rcx, b
idiv    rcx      # or  idiv b  if the inputs were already 64-bit
ret

Unfortunately, on 32-bit x86, the code isn't nearly as good. Compilers emit a call into their internal library function that provides extended 64-bit division, because they can't prove that using a single 64b/32b => 32b idiv instruction won't fault. (It will raise a #DE exception if the quotient doesn't fit in eax, rather than just truncating)

In other words, transforming:

int32_t Divide(int64_t a, int32_t b)
{
    return (a / b);
}

into:

mov   eax, a_low
mov   edx, a_high
idiv  b                 # will fault if a/b is outside [-2^32, 2^32-1]
ret

is not a legal optimization—the compiler is unable to emit this code. The language standard says that a 64/32 division is promoted to a 64/64 division, which always produces a 64-bit result. That you later cast or coerce that 64-bit result to a 32-bit value is irrelevant to the semantics of the division operation itself. Faulting for some combinations of a and b would violate the as-if rule, unless the compiler can prove that those combinations of a and b are impossible. (For example, if b was known to be greater than 1<<16, this could be a legal optimization for a = (int32_t)input; a <<= 16; But even though this would produce the same behaviour as the C abstract machine for all inputs, gcc and clang currently don't do that optimization.)


There simply isn't a good way to override the rules imposed by the language standard and force the compiler to emit the desired object code. MSVC doesn't offer an intrinsic for it (although there is a Windows API function, MulDiv, it's not fast, and just uses inline assembly for its own implementation—and with a bug in a certain case, now cemented thanks to the need for backwards compatibility). You essentially have no choice but to resort to assembly, either inline or linked in from an external module.

So, you get into ugliness. It looks like this:

signed long Div16(signed long a, signed long b)
{
#ifdef __GNUC__     // A GNU-style compiler (e.g., GCC, Clang, etc.)
    signed long quotient;
    signed long remainder;  // (unused, but necessary to signal clobbering)
    __asm__("idivl  %[divisor]"
           :          "=a"  (quotient),
                      "=d"  (remainder)
           :           "0"  ((unsigned long)a << 16),
                       "1"  (a >> 16),
             [divisor] "rm" (b)
           : 
           );
    return quotient;
#elif _MSC_VER      // A Microsoft-style compiler (i.e., MSVC)
    __asm
    {
        mov  eax, DWORD PTR [a]
        mov  edx, eax
        shl  eax, 16
        sar  edx, 16
        idiv DWORD PTR [b]
        // leave result in EAX, where it will be returned
    }
#else
    #error "Unsupported compiler"
#endif
}

This results in the desired output on both Microsoft and GNU-style compilers.

Well, mostly. For some reason, when you use the rm constraint, which gives the compiler to freedom to choose whether to treat the divisor as either a memory operand or load it into a register, Clang generates worse object code than if you just use r (which forces it to load it into a register). This doesn't affect GCC or ICC. If you care about the quality of output on Clang, you'll probably just want to use r, since this will give equally good object code on all compilers.

Live Demo on Godbolt Compiler Explorer

(Note: GCC uses the SAL mnemonic in its output, instead of the SHL mnemonic. These are identical instructions—the difference only matters for right shifts—and all sane assembly programmers use SHL. I have no idea why GCC emits SAL, but you can just convert it mentally into SHL.)

这篇关于MSVC内联ASM到GCC的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆