如何指示MSVC编译器使用64位/ 32位除法而不是较慢的128位/ 64位除法? [英] How can I instruct the MSVC compiler to use a 64bit/32bit division instead of the slower 128bit/64bit division?

查看:179
本文介绍了如何指示MSVC编译器使用64位/ 32位除法而不是较慢的128位/ 64位除法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何告诉MSVC编译器使用64位/ 32位除法运算来为x86-64目标计算以下函数的结果:

  #include< stdint.h> 

uint32_t ScaledDiv(uint32_t a,uint32_t b)
{
if(a> b)
return((uint64_t)b<< 32)/ a ; //是的,必须强制转换,因为b<< 32的结果是不确定的
else
return uint32_t(-1);
}

if 语句为true,可以编译为使用64位/ 32位除法运算,例如

 ;假设输入参数为:EDX中的股息,ECX中的除数
mov edx,edx;一个伪指令,指示股息已经在应为
或eax,eax
的位置div ecx; EAX = EDX:EAX / ECX

...但是x64 MSVC编译器坚持使用128bit / 64位 div 指令,例如:

  mov eax,edx 
xor edx,edx
shl rax,32;扩大股息
mov ecx,ecx
div rcx; RAX = RDX:RAX / RCX

请参阅: https://www.godbolt.org/z/VBK4R7 1



根据答案到此问题,128位/ 64位 div 指令不快,而不是64位/ 32位 div 指令。



这是一个问题,因为它会不必要地减慢我的DSP算法的速度,从而使成千上万的这种比例除法运算。



我通过修补可执行文件以使用64位/ 32位div指令来测试此优化:根据 rdtsc产生的两个时间戳,性能提高了28%。 code>指令



(编者注:大概是在最近的一些Intel CPU上。如链接Q& A中所述,AMD CPU不需要这种微优化。)

解决方案

当前没有编译器( gcc / clang / ICC / MSVC),即使您让他们证明 b< a ,因此商将适合32位。 (例如,对于GNU C if(b> = a)__builtin_unreachable(); )。这是一个错过的优化;



(或改用GPU或SIMD;如果许多元素使用相同的除数,请参见 https://libdivide.com/ for SIMD一次计算一个乘法逆,然后重复应用。)






从Visual Studio 2019 RTM开始可用的 _udiv64



在C模式下( -TC )显然总是被定义。在C ++模式下,根据Microsoft文档,您需要 #include< immintrin.h> 。或 intrin.h



https://godbolt.org/z/vVZ25L (或在Godbolt.ms 上,因为最近的MSVC主要Godbolt网站无法正常运行 1 。)

  #include< stdint.h> 
#include< immintrin.h> //定义原型

//前提:a> b其他64/32位除法溢出
uint32_t ScaledDiv(uint32_t a,uint32_t b)
{
uint32_t余数;
uint64_t d =((uint64_t)b)<< 32;
return _udiv64(d,a,& remainder);
}

int main(){
uint32_t c = ScaledDiv(5,4);
返回c;
}

_udiv64将产生64/32 div。

 ;  b 

 ; MSVC 19.20 -O2 -TC 
a $ = 8
b $ = 16
ScaledDiv PROC; COMDAT
mov edx,edx
shl rdx,32; 00000020H
mov rax,rdx
shr rdx,32; 00000020H
div ecx
ret 0
ScaledDiv ENDP

main PROC; COMDAT
x或eax,eax
mov edx,4
mov ecx,5
div ecx
ret 0
main ENDP

所以我们可以看到MSVC不会通过 _udiv64 进行恒定传播,即使在这种情况下也不会溢出,并且可以将 main 编译为 mov eax,0ccccccccH / ret






更新#2 https://godbolt.org/z/n3Dyp-
添加了一个解决方案英特尔C ++编译器,但效率较低,并且由于它是嵌入式asm,因此将无法克服常数传播。

  #include< stdio。 h> 
#include< stdint.h>

__declspec(regcall,裸)uint32_t ScaledDiv(uint32_t a,uint32_t b)
{
__asm mov edx,eax
__asm xor eax,eax
__asm div ecx
__asm ret
// MSVC支持EAX的隐式返回,并希望ICC
//即使内联+优化
}

int main()
{
uint32_t a = 3,b = 4,c = ScaledDiv(a,b);
printf((%u<< 32)/%u =%u\n,a,b,c);
uint32_t d =((uint64_t)a<< 32)/ b;
printf((%u<< 32)/%u =%u\n,a,b,d);
return c!= d;
}






脚注1:Matt Godbolt的主站点的非WINE MSVC编译器暂时消失了? Microsoft运行 https://www.godbolt.ms/ 来在实际Windows上托管最新的MSVC编译器,并且通常,主要的Godbolt.org网站转为MSVC的网站。)



似乎godbolt.ms会生成短链接,但不会再次扩展它们!无论如何,完整链接会更好地抵抗链接腐烂。


How can I tell the MSVC compiler to use the 64bit/32bit division operation to compute the result of the following function for the x86-64 target:

#include <stdint.h> 

uint32_t ScaledDiv(uint32_t a, uint32_t b) 
{
  if (a > b)
        return ((uint64_t)b<<32) / a;   //Yes, this must be casted because the result of b<<32 is undefined
  else
        return uint32_t(-1);
}

I would like the code, when the if statement is true, to compile to use the 64bit/32bit division operation, e.g. something like this:

; Assume arguments on entry are: Dividend in EDX, Divisor in ECX
mov edx, edx  ;A dummy instruction to indicate that the dividend is already where it is supposed to be
xor eax,eax
div ecx   ; EAX = EDX:EAX / ECX

...however the x64 MSVC compiler insists on using the 128bit/64bit div instruction, such as:

mov     eax, edx
xor     edx, edx
shl     rax, 32                             ; Scale up the dividend
mov     ecx, ecx
div rcx   ;RAX = RDX:RAX / RCX

See: https://www.godbolt.org/z/VBK4R71

According to the answer to this question, the 128bit/64bit div instruction is not faster than the 64bit/32bit div instruction.

This is a problem because it unnecessarily slows down my DSP algorithm which makes millions of these scaled divisions.

I tested this optimization by patching the executable to use the 64bit/32bit div instruction: The performance increased 28% according to the two timestamps yielded by the rdtsc instructions.

(Editor's note: presumably on some recent Intel CPU. AMD CPUs don't need this micro-optimization, as explained in the linked Q&A.)

解决方案

No current compilers (gcc/clang/ICC/MSVC) will do this optimization from portable ISO C source, even if you let them prove that b < a so the quotient will fit in 32 bits. (For example with GNU C if(b>=a) __builtin_unreachable(); on Godbolt). This is a missed optimization; until that's fixed, you have to work around it with intrinsics or inline asm.

(Or use a GPU or SIMD instead; if you have the same divisor for many elements see https://libdivide.com/ for SIMD to compute a multiplicative inverse once and apply it repeatedly.)


_udiv64 is available starting in Visual Studio 2019 RTM.

In C mode (-TC) it's apparently always defined. In C++ mode, you need to #include <immintrin.h>, as per the Microsoft docs. or intrin.h.

https://godbolt.org/z/vVZ25L (Or on Godbolt.ms because recent MSVC on the main Godbolt site is not working1.)

#include <stdint.h>
#include <immintrin.h>       // defines the prototype

// pre-condition: a > b else 64/32-bit division overflows
uint32_t ScaledDiv(uint32_t a, uint32_t b) 
{
    uint32_t remainder;
    uint64_t d = ((uint64_t) b) << 32;
    return _udiv64(d, a, &remainder);
}

int main() {
    uint32_t c = ScaledDiv(5, 4);
    return c;
}

_udiv64 will produce 64/32 div. The two shifts left and right are a missed optimization.

;; MSVC 19.20 -O2 -TC
a$ = 8
b$ = 16
ScaledDiv PROC                                      ; COMDAT
        mov     edx, edx
        shl     rdx, 32                             ; 00000020H
        mov     rax, rdx
        shr     rdx, 32                             ; 00000020H
        div     ecx
        ret     0
ScaledDiv ENDP

main    PROC                                            ; COMDAT
        xor     eax, eax
        mov     edx, 4
        mov     ecx, 5
        div     ecx
        ret     0
main    ENDP

So we can see that MSVC doesn't do constant-propagation through _udiv64, even though in this case it doesn't overflow and it could have compiled main to just mov eax, 0ccccccccH / ret.


UPDATE #2 https://godbolt.org/z/n3Dyp- Added a solution with Intel C++ Compiler, but this is less efficient and will defeat constant-propagation because it's inline asm.

#include <stdio.h>
#include <stdint.h>

__declspec(regcall, naked) uint32_t ScaledDiv(uint32_t a, uint32_t b) 
{
    __asm mov edx, eax
    __asm xor eax, eax
    __asm div ecx
    __asm ret
    // implicit return of EAX is supported by MSVC, and hopefully ICC
    // even when inlining + optimizing
}

int main()
{
    uint32_t a = 3 , b = 4, c = ScaledDiv(a, b);
    printf( "(%u << 32) / %u = %u\n", a, b, c);
    uint32_t d = ((uint64_t)a << 32) / b;
    printf( "(%u << 32) / %u = %u\n", a, b, d);
    return c != d;
}


Footnote 1: Matt Godbolt's main site's non-WINE MSVC compilers are temporarily(?) gone. Microsoft runs https://www.godbolt.ms/ to host the recent MSVC compilers on real Windows, and normally the main Godbolt.org site relayed to that for MSVC.)

It seems godbolt.ms will generate short links, but not expand them again! Full links are better anyway for their resistance to link-rot.

这篇关于如何指示MSVC编译器使用64位/ 32位除法而不是较慢的128位/ 64位除法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆