更改完全不相关的代码时,Visual Studio C++ 编译器会生成 3 倍慢的代码 [英] Visual Studio C++ compiler generates 3x slower code when changing completely unrelated code

查看:30
本文介绍了更改完全不相关的代码时,Visual Studio C++ 编译器会生成 3 倍慢的代码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个嵌套的 for 循环,它生成以下程序集:

# 手动添加分支目标标签以提高可读性002E20F8 mov ebx,esi002E20FA mov dword ptr [ebp-10h],3B9ACA00h002E2101 子 ebx,edi002E2103 添加 ebx,7002E2106 shr ebx,3002E2109 nop dword ptr [eax]外循环:002E2110 异或 eax,eax002E2112 异或 ecx,ecx002E2114 cmp edi,esi002E2116 mov edx,ebx002E2118 cmova edx,eax002E211B mov eax,edi002E211D 测试 edx,edx002E211F je main+107h (02E2137h) ;end_innerloop内循环:002E2121 movsd xmm0,mmword ptr [eax]002E2125 公司 ecx ;inc/addsd 交换002E2126 添加了 xmm0,mmword ptr [k]002E212B 添加 eax,8002E212E movsd mmword ptr [k],xmm0002E2133 cmp ecx,edx002E2135 jne main+0F1h (02E2121h) ;inner_loopend_innerloop:002E2137 sub dword ptr [ebp-10h],1002E213B jne main+0E0h (02E2110h) ;outer_loop

如果我在嵌套的 for 循环之前更改一行代码以简单地声明一个 int,然后在 for 循环之后将其打印出来.这使得编译器将 k 的存储/重新加载从循环中拉出.

问题的第一个版本将其描述为以稍微不同的顺序生成指令".(编者注:也许我应该留下这个分析/更正的答案?)

003520F8 mov ebx,esi003520FA mov dword ptr [ebp-10h],3B9ACA00h00352101 子 ebx,edi00352103 添加 ebx,700352106 shr ebx,300352109 nop dword ptr [eax]外循环:00352110 异或 eax,eax00352112 异或 ecx,ecx00352114 cmp edi,esi00352116 mov edx,ebx00352118 cmova edx,eax0035211B mov eax,edi0035211D 测试edx,edx0035211F je main+107h (0352137h) ;end_innerloop00352121 movsd xmm0,mmword ptr [k];k 的负载从循环中提升.奇怪的是没有针对 xorpd xmm0,xm​​m0 进行优化内循环:00352126 增加了 xmm0,mmword ptr [eax]0035212A inc ecx0035212B 添加 eax,80035212E cmp ecx,edx00352130 jne main+0F6h (0352126h) ;inner_loop00352132 movsd mmword ptr [k],xmm0 ;movsd 在不同的地方.end_innerloop:00352137 sub dword ptr [ebp-10h],10035213B jne main+0E0h(0352110h);outer_loop

编译器的第二种安排快了 3 倍.我对此感到有些震惊.有谁知道这是怎么回事?

这是使用 Visual Studio 2015 编译的.

编译器标志(如果需要,我可以添加更多):

优化:最大化速度/O2

代码:

#include #include <向量>#include "秒表.h"静态 constexpr int N = 1000000000;int main(){std::vector缓冲;缓冲区.调整大小(10);for (auto& i : 缓冲区){i = 1e-100;}双 k = 0;INT H = 0;//删除此行并交换行 std::cout <<"time = "... 导致代码慢 3 倍??!!秒表手表;for (int i = 0; i 

秒表类:

#pragma once#include <chrono>类 秒表{私人的:typedef std::chrono::high_resolution_clock 时钟;typedef std::chrono::microseconds 微秒;typedef std::chrono::milliseconds 毫秒;时钟::时间点_开始;民众:跑表(){重新开始();}无效重启(){_start = 时钟::现在();}double ElapsedMilliseconds(){返回 ElapsedMicroseconds() * 1E-3;}双 ElapsedSeconds(){返回 ElapsedMicroseconds() * 1E-6;}秒表(const 秒表&) = 删除;秒表&运算符=(常量秒表&)=删除;私人的:双 ElapsedMicroseconds(){return static_cast(std::chrono::duration_cast(clock::now() - _start).count());}};

解决方案

在编辑问题以修复令人困惑的换行符后,并在 jcc 中的地址前添加分支目标标签code> 指令来弄清楚代码实际上在做什么,很明显循环明显不同.movsd 不在循环内重新排序;它在循环之外.

我决定编辑问题并在此处讨论,而不是将这些内容留在问题中并在答案中进行更正.我认为代码块足够长,以至于未来的读者在试图跟踪代码的 4 个版本时会陷入困境,而且这不会帮助有相同问题的人通过搜索引擎找到它.

<小时>

快速版本将 k 保存在寄存器 (xmm0) 中,而慢速版本每次迭代都会重新加载/存储它.这通常表明编译器的别名分析未能证明事物不能重叠.

造成伤害的不是额外的存储和加载本身,而是它通过存储转发延迟延长循环携带的依赖链,从一次迭代中的存储到加载下一次迭代.存储转发延迟在现代 Intel CPU 上大约为 6 个周期,而 addsd 为 3 个周期(例如在 Haswell 上).所以这完美地解释了 3 倍加速的因素:

  • 当循环携带的依赖链为 addsd + store-forwarding
  • 时,每次迭代 9 个周期
  • 当循环携带的依赖链只是 addsd
  • 时,每次迭代 3 个周期

请参阅http://agner.org/optimize/ 了解说明表和微架构详细信息.还有 标签维基中的其他链接.><小时>

IDK MSVC 怎么没能证明 k 不与任何东西重叠,因为它是一个地址不转义函数的本地.(它的地址甚至没有被占用).MSVC 在那里做得很糟糕.它也应该只是 xorps xmm0,xm​​m0 在循环之前将其归零,而不是加载一些归零的内存.我什至看不到它在哪里将任何内存归零;我想这不是整个函数的 asm.

如果您使用 MSVC 的 -ffast-math 等价物进行编译,它可以对归约进行矢量化(使用 addpd),并希望使用多个累加器.尽管使用这样一个 向量,您要循环多次,但非 4 的倍数的元素计数相当不方便.尽管如此,循环开销在这里不是问题;即使 k 保存在寄存器中,循环携带的依赖链也占主导地位,因为您的代码只使用一个累加器.每 3 个时钟一个 addsd 为其他 insn 运行留下了大量时间.

理想情况下,允许关联 FP 数学重新排序将使编译器将其优化为 k = N * std::accumulate(...); 就像@Ped7g 建议的那样,处理数组上的总和作为公共子表达式.

<小时>

顺便说一句,初始化向量有更好的方法:

与其调整向量的大小(使用默认构造函数构造新元素)然后然后写入新值,不如做一些类似的事情

std::vector缓冲区(10, 1e-100);//10 个元素设置为 1e-100

这确保 asm 在存储您想要的值之前不会浪费时间存储零.我认为 resize 也可以将一个值复制到新元素中,因此您仍然可以声明一个空向量,然后调整它的大小.

I have a nested for loop which generates the following assembly:

# branch target labels manually added for readability
002E20F8  mov         ebx,esi  
002E20FA  mov         dword ptr [ebp-10h],3B9ACA00h  
002E2101  sub         ebx,edi  
002E2103  add         ebx,7  
002E2106  shr         ebx,3  
002E2109  nop         dword ptr [eax]  
  outer_loop:
002E2110  xor         eax,eax  
002E2112  xor         ecx,ecx  
002E2114  cmp         edi,esi  
002E2116  mov         edx,ebx  
002E2118  cmova       edx,eax  
002E211B  mov         eax,edi  
002E211D  test        edx,edx 
002E211F  je          main+107h (02E2137h)  ;end_innerloop

  inner_loop:           
002E2121  movsd       xmm0,mmword ptr [eax] 
002E2125  inc         ecx                     ; inc/addsd swapped
002E2126  addsd       xmm0,mmword ptr [k]   
002E212B  add         eax,8  
002E212E  movsd       mmword ptr [k],xmm0  
002E2133  cmp         ecx,edx  
002E2135  jne         main+0F1h (02E2121h)  ;inner_loop
  end_innerloop:        
002E2137  sub         dword ptr [ebp-10h],1  
002E213B  jne         main+0E0h (02E2110h)   ;outer_loop

If I change a line of code before the nested for loop to simply declare an int and then print it out after the for loop. This makes the compiler pull the store/reload of k out of the loop.

The first version of the question described this as "generate the instructions in a slightly different order". (editor's note: perhaps I should leave this analysis / correction for the answer?)

003520F8  mov         ebx,esi  
003520FA  mov         dword ptr [ebp-10h],3B9ACA00h  
00352101  sub         ebx,edi  
00352103  add         ebx,7  
00352106  shr         ebx,3  
00352109  nop         dword ptr [eax]  
  outer_loop:
00352110  xor         eax,eax  
00352112  xor         ecx,ecx  
00352114  cmp         edi,esi  
00352116  mov         edx,ebx  
00352118  cmova       edx,eax  
0035211B  mov         eax,edi  
0035211D  test        edx,edx  
0035211F  je          main+107h (0352137h) ;end_innerloop

00352121  movsd       xmm0,mmword ptr [k]    ; load of k hoisted out of the loop.  Strangely not optimized to xorpd xmm0,xmm0

  inner_loop:
00352126  addsd       xmm0,mmword ptr [eax]
0035212A  inc         ecx  
0035212B  add         eax,8  
0035212E  cmp         ecx,edx  
00352130  jne         main+0F6h (0352126h)  ;inner_loop

00352132  movsd       mmword ptr [k],xmm0     ; movsd in different place.

  end_innerloop:
00352137  sub         dword ptr [ebp-10h],1  
0035213B  jne         main+0E0h (0352110h)  ;outer_loop

This second arrangement by the compiler is 3x faster. I'm slightly shocked by this. Does anyone know what is going on?

This was compiled with Visual Studio 2015.

Compiler flags (I can add more if requested):

Optimization: Maximize Speed /O2

The code:

#include <iostream>
#include <vector>
#include "Stopwatch.h"

static constexpr int N = 1000000000;

int main()
{
    std::vector<double> buffer;

    buffer.resize(10);

    for (auto& i : buffer)
    {
        i = 1e-100;
    }

    double k = 0;
    int h = 0; // removing this line and swapping the lines std::cout << "time = "... results in 3x slower code??!!

    Stopwatch watch;

    for (int i = 0; i < N; i++)
    {
        for (auto& j : buffer)
        {
            k += j;
        }
    }

    //std::cout << "time = " << watch.ElapsedMilliseconds() << " / " << k << std::endl;
    std::cout << "time = " << watch.ElapsedMilliseconds() << " / " << k << " / " << h << std::endl;

    std::cout << "Done...";
    std::getchar();

    return EXIT_SUCCESS;
}

Stopwatch class:

#pragma once

#include <chrono>

class Stopwatch
{
private:
    typedef std::chrono::high_resolution_clock clock;
    typedef std::chrono::microseconds microseconds;
    typedef std::chrono::milliseconds milliseconds;

    clock::time_point _start;

public:
    Stopwatch()
    {
        Restart();
    }

    void Restart()
    {
        _start = clock::now();
    }

    double ElapsedMilliseconds()
    {
        return ElapsedMicroseconds() * 1E-3;
    }

    double ElapsedSeconds()
    {
        return ElapsedMicroseconds() * 1E-6;
    }

    Stopwatch(const Stopwatch&) = delete;
    Stopwatch& operator=(const Stopwatch&) = delete;

private:
    double ElapsedMicroseconds()
    {
        return static_cast<double>(std::chrono::duration_cast<microseconds>(clock::now() - _start).count());
    }
};

解决方案

After editing the question to fix your confusingly-placed line-breaks, and add branch-target labels in front of the addresses in the jcc instructions to figure out what the code was actually doing, it became clear that the loops were significantly different. The movsd isn't reordered within the loop; it's outside the loop.

Instead of leaving that stuff in the question and correcting it in the answer, I decided to edit the question and talk about it here. I figured the code blocks were long enough that future readers would just get bogged down trying to keep track of 4 versions of the code, and it wasn't something that would help people with the same question would find it with a search engine.


The fast version keeps k in a register (xmm0), while the slow version reloads/stores it every iteration. This is often a sign that the compiler's alias analysis failed to prove that things couldn't overlap.

It's not the extra stores and loads themselves that hurt, it's the fact that it lengthens the loop-carried dependency chain by the store-forwarding latency from the store in one iteration to the load in the next iteration. Store-forwarding latency is something like 6 cycles on modern Intel CPUs, vs. 3 cycles for addsd (on Haswell for example). So that perfectly explains the factor of 3 speedup:

  • 9 cycles per iteration when the loop-carried dependency chain is addsd + store-forwarding
  • 3 cycles per iteration when the loop-carried dependency chain is just addsd

See http://agner.org/optimize/ for instruction tables and microarch details. Also other links in the tag wiki.


IDK how MSVC didn't manage to prove that k doesn't overlap with anything, because it's a local whose address doesn't escape the function. (Its address isn't even taken). MSVC does a horrible job there. It should also just xorps xmm0,xmm0 to zero it ahead of the loop, instead of loading some zeroed memory. I don't even see where it zeros any memory; I guess that's not the asm for the whole function.

If you'd compiled with MSVC's equivalent of -ffast-math, it could have vectorized the reduction (with addpd), and hopefully multiple accumulators. Although with such a tiny vector that you're looping over many times, the non-multiple-of-4 element count is moderately inconvenient. Still, the loop overhead isn't an issue here; the loop-carried dependency chain dominates even when k is kept in a register, since your code only uses one accumulator. One addsd per 3 clocks leaves tons of time for other insns to run.

Ideally, allowing associative FP math reordering would get the compiler to optimize it to k = N * std::accumulate(...); like @Ped7g suggested, treating the sum over the array as a common subexpression.


BTW, there are much better ways to initialize a vector:

Instead of resizing a vector (constructing new elements with the default constructor) and then writing new values, you should just do something like

std::vector<double> buffer(10, 1e-100);   // 10 elements set to 1e-100

That makes sure the asm doesn't waste time storing zeros before storing the value you want. I think resize can also take a value to copy into new elements, so you can still declare an empty vector and then resize it.

这篇关于更改完全不相关的代码时,Visual Studio C++ 编译器会生成 3 倍慢的代码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆