Visual Studio 2010-2015不使用ymm 寄存器进行AVX优化 [英] Visual Studio 2010 - 2015 does not use ymm* registers for AVX optimization*

查看：115 发布时间：2020/5/21 21:21:32 c++ visual-studio optimization visual-studio-2015 avx

本文介绍了Visual Studio 2010-2015不使用ymm *寄存器进行AVX优化的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我的笔记本电脑CPU仅支持AVX(高级矢量扩展名)，但不支持AVX2.对于AVX，已经将128位xmm *寄存器扩展为256位ymm *寄存器，以进行浮点运算.但是，我测试了所有版本的Visual Studio(从2010年到2015年)在/arch:AVX优化下都没有使用ymm *寄存器，尽管它们在/arch:AVX2优化下也使用了ymm *寄存器.

My laptop CPU supports only AVX (advanced vector extension) but does not support AVX2. For AVX, the 128-bit xmm* registers have already been extended to the 256-bit ymm* registers for floating point arithmetic. However, I have tested that all versions of Visual Studio (from 2010 to 2015) do not use ymm* registers under /arch:AVX optimization, although they do so under /arch:AVX2 optimization.

下面显示了一个简单的for循环的反汇编.该程序在发布版本中使用/arch:AVX进行编译，并启用了所有优化选项.

The following shows the disassembly for a simple for loop. The program is compiled with /arch:AVX in release build, with all optimization options on.

    float a[10000], b[10000], c[10000];
    for (int x = 0; x < 10000; x++)
1000988F  xor         eax,eax  
10009891  mov         dword ptr [ebp-9C8Ch],ecx  
        c[x] = (a[x] + b[x])*b[x];
10009897  vmovups     xmm1,xmmword ptr c[eax]  
100098A0  vaddps      xmm0,xmm1,xmmword ptr c[eax]  
100098A9  vmulps      xmm0,xmm0,xmm1  
100098AD  vmovups     xmmword ptr c[eax],xmm0  
100098B6  vmovups     xmm1,xmmword ptr [ebp+eax-9C78h]  
100098BF  vaddps      xmm0,xmm1,xmmword ptr [ebp+eax-9C78h]  
100098C8  vmulps      xmm0,xmm0,xmm1  
100098CC  vmovups     xmmword ptr [ebp+eax-9C78h],xmm0  
100098D5  add         eax,20h  
100098D8  cmp         eax,9C40h  
100098DD  jl          ComputeTempo+67h (10009897h)  


    const int   winpts = (int)(window_size*sr+0.5);
100098DF  vxorps      xmm1,xmm1,xmm1  
100098E3  vcvtsi2ss   xmm1,xmm1,ecx

我还测试了可以使用ymm *寄存器进一步加速程序而不会崩溃.我使用IMM内部函数做到了这一点，例如_mm256_mul_ps.

I have also tested that I can use ymm* registers to further speed up my program without crashing. I did that using IMM intrinsics, e.g. _mm256_mul_ps.

任何Microsoft编译器开发人员都可以提供解释吗?还是这是Visual Studio提供比gcc/g ++编译器慢的代码的原因之一?

Can any Microsoft compiler developer give an explanation? Or maybe that is one of the reasons why Visual Studio gives slower codes than gcc/g++ compiler?

=============编辑=============

=============edited==============

事实证明，原因是在32位计算机上运行32位操作系统与在64位计算机上运行32位操作系统之间存在一些差异.在后一种情况下，某些操作系统可能不知道ymm *寄存器的存在，因此无法在上下文切换期间正确保留上半部分寄存器.因此，如果在64位计算机上的32位OS上使用ymm *寄存器，则在发生上下文切换时，如果另一个程序也在使用ymm *寄存器，则上半部分寄存器可能会被静默破坏.在这种情况下，Visual Studio有点保守.

The reason turns out to be that there exist some difference between running 32-bit OS on 32-bit machine and running 32-bit OS on 64-bit machine. In the latter case, some OS might not know the existence of ymm* registers and thus does not preserve the upper half registers properly during a context switch. Thus, if ymm* registers are used on 32-bit OS on 64-bit machine, if a context switch occurs, the upper half registers might get silently corrupted if another program is also using ymm* registers. Visual Studio is kind of conservative in this context.

推荐答案

我制作了一个文本文件vec.cpp

I made a text file vec.cpp

//vec.cpp
void foo(float *a, float *b, float *c) {
    for (int i = 0; i < 10000; i++) c[i] = (a[i] + b[i])*b[i];
}

在启用并启用了Visual Studio 2015 x86 x64的情况下进入命令行

went to the command line with Visual Studio 2015 x86 x64 enabled and did

cl /c /O2 /arch:AVX /FA vec.cpp

看着文件vec.asm，我看到了

$LL4@foo:
    vmovups ymm0, YMMWORD PTR [rax-32]
    lea rax, QWORD PTR [rax+64]
    vmovups ymm2, ymm0
    vaddps  ymm0, ymm0, YMMWORD PTR [rcx+rax-96]
    vmulps  ymm2, ymm0, ymm2
    vmovups YMMWORD PTR [r8+rax-96], ymm2
    vmovups ymm0, YMMWORD PTR [rax-64]
    vmovups ymm2, ymm0
    vaddps  ymm0, ymm0, YMMWORD PTR [rcx+rax-64]
    vmulps  ymm2, ymm0, ymm2
    vmovups YMMWORD PTR [r8+rax-64], ymm2
    sub rdx, 1
    jne SHORT $LL4@foo
    vzeroupper

问题是您正在32位模式下进行编译.使用上面相同的功能，但在32位模式下编译，我得到了

The problem is that you are compiling in 32-bit mode. Using the same function above but compiling in 32-bit mode I get

$LL4@foo:
    lea eax, DWORD PTR [ebx+esi]
    lea ecx, DWORD PTR [ecx+32]
    lea esi, DWORD PTR [esi+32]
    vmovups xmm1, XMMWORD PTR [esi-48]
    vaddps  xmm0, xmm1, XMMWORD PTR [ecx-32]
    vmulps  xmm0, xmm0, xmm1
    vmovups XMMWORD PTR [edx+ecx-32], xmm0
    vmovups xmm1, XMMWORD PTR [esi-32]
    vaddps  xmm0, xmm1, XMMWORD PTR [eax]
    vmulps  xmm0, xmm0, xmm1
    vmovups XMMWORD PTR [eax+edx], xmm0
    sub edi, 1
    jne SHORT $LL4@foo

这篇关于Visual Studio 2010-2015不使用ymm *寄存器进行AVX优化的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Visual Studio 2010-2015不使用ymm 寄存器进行AVX优化 [英] Visual Studio 2010 - 2015 does not use ymm* registers for AVX optimization*

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

Visual Studio 2010-2015不使用ymm *寄存器进行AVX优化 [英] Visual Studio 2010 - 2015 does not use ymm* registers for AVX optimization

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

Visual Studio 2010-2015不使用ymm 寄存器进行AVX优化 [英] Visual Studio 2010 - 2015 does not use ymm* registers for AVX optimization*

登录关闭