为什么在for循环体内执行一个基本的算术运算要慢于两个算术运算? [英] Why is ONE basic arithmetic operation in for loop body executed SLOWER THAN TWO arithmetic operations?

查看:106
本文介绍了为什么在for循环体内执行一个基本的算术运算要慢于两个算术运算?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

虽然我尝试测量算术运算的执行时间,但是却遇到了非常奇怪的行为.包含for循环且在循环主体中执行一个算术运算的代码块,总是比同一个代码块执行得慢[strong]总是,但是在for循环主体中执行两个算术运算.这是我最终测试的代码:

While I experimented with measuring time of execution of arithmetic operations, I came across very strange behavior. A code block containing a for loop with one arithmetic operation in the loop body was always executed slower than an identical code block, but with two arithmetic operations in the for loop body. Here is the code I ended up testing:

#include <iostream>
#include <chrono>

#define NUM_ITERATIONS 100000000

int main()
{
    // Block 1: one operation in loop body
    {
        int64_t x = 0, y = 0;
        auto start = std::chrono::high_resolution_clock::now();

        for (long i = 0; i < NUM_ITERATIONS; i++) {x+=31;}

        auto end = std::chrono::high_resolution_clock::now();
        std::chrono::duration<double> diff = end-start;
        std::cout << diff.count() << " seconds. x,y = " << x << "," << y << std::endl;
    }

    // Block 2: two operations in loop body
    {
        int64_t x = 0, y = 0;
        auto start = std::chrono::high_resolution_clock::now();

        for (long i = 0; i < NUM_ITERATIONS; i++) {x+=17; y-=37;}

        auto end = std::chrono::high_resolution_clock::now();
        std::chrono::duration<double> diff = end-start;
        std::cout << diff.count() << " seconds. x,y = " << x << "," << y << std::endl;
    }

    return 0;
}

我用不同的代码优化级别(-O0-O1-O2-O3)和不同的在线编译器(例如

I tested this with different levels of code optimization (-O0,-O1,-O2,-O3), with different online compilers (for example onlinegdb.com), on my work machine, on my hame PC and laptop, on RaspberryPi and on my colleague's computer. I rearranged these two code blocks, repeated them, changed constants, changed operations (+, -, <<, =, etc.), changed integer types. But I always got similar result: the block with one line in loop is SLOWER than block with two lines:

1.05681秒. x,y = 3100000000,0
0.90414秒. x,y = 1700000000,-3700000000

1.05681 seconds. x,y = 3100000000,0
0.90414 seconds. x,y = 1700000000,-3700000000

我检查了 https://godbolt.org/上的程序集输出,但一切看起来都与我预期的一样:第二个块在装配输出中又进行了一次操作.

I checked the assembly output on https://godbolt.org/ but everything looked like I expected: second block just had one more operation in assembly output.

三个操作总是表现出预期的效果:它们比一个慢,并且比四个快.那么为什么两次操作会产生这样的异常?

Three operations always behaved as expected: they are slower than one and faster than four. So why two operations produce such an anomaly?

让我重复一遍:我的所有Windows和Unix机器上都存在这种行为,并且代码未经过优化.我查看了我执行的程序集(Visual Studio,Windows),然后看到了要测试的指令.无论如何,如果优化了循环,剩下的代码中我什么都没问.我在问题中添加了优化注意事项,以避免出现不衡量未优化代码"的答案,因为优化不是我要的.问题实际上是为什么我的计算机执行两个操作要比执行一个操作快,首先是在代码中没有优化这些操作.在我的测试中,执行时间的差异为5-25%(非常明显).

Let me repeat: I have such behaviour on all of my Windows and Unix machines with code not optimized. I looked at assembly I execute (Visual Studio, Windows) and I see the instructions I want to test there. Anyway if the loop is optimized away, there is nothing I ask about in the code which left. I added that optimizations notice in the question to avoid "do not measure not optimized code" answers because optimizations is not what I ask about. The question is actually why my computers execute two operations faster than one, first of all in code where these operations are not optimized away. The difference in time of execution is 5-25% on my tests (quite noticeable).

推荐答案

此效果仅在-O0处(或在volatile处)发生,并且是编译器将变量保存在内存中的结果(不是) .您希望只是通过ixy在循环承载的依赖链中引入固定量的额外延迟,但是现代CPU并不是那么简单.

This effect only happens at -O0 (or with volatile), and is a result of the compiler keeping your variables in memory (not registers). You'd expect that to just introduce a fixed amount of extra latency into a loop-carried dependency chains through i, x, and y, but modern CPUs are not that simple.

在Intel Sandybridge系列CPU上,当加载uop在要重新加载其数据的存储区之后运行一段时间(不是立即)时,存储转发延迟降低了 .在内存中带有循环计数器的空循环是最坏的情况.我不明白哪些CPU设计选择会导致这种微体系结构怪异,但这是真实的事情.

On Intel Sandybridge-family CPUs, store-forwarding latency is lower when the load uop runs some time after the store whose data it's reloading, not right away. So an empty loop with the loop counter in memory is the worst case. I don't understand what CPU design choices could lead to that micro-architectural quirk, but it's a real thing.

这基本上是重复的赋值,可在未经优化的情况下编译时增加冗余赋值 ,至少对于英特尔Sandybridge系列CPU.

This is basically a duplicate of Adding a redundant assignment speeds up code when compiled without optimization, at least for Intel Sandybridge-family CPUs.

这是为什么clang会产生低效带有-O0的asm(用于此简单的浮点数总和)?,详细了解编译器为何故意制作如此糟糕的asm.

This is is one of the major reasons why you shouldn't benchmark at -O0: the bottlenecks are different than in realistically optimized code. See Why does clang produce inefficient asm with -O0 (for this simple floating point sum)? for more about why compilers make such terrible asm on purpose.

进行微基准测试很困难;只有使编译器针对要尝试测量的东西发出实际优化的asm循环,您才可以正确地测量某些东西. (即使这样,您也只测量吞吐量延迟,而不是两者;对于在无序流水线CPU上进行单个操作,这些是单独的事情:

Micro-benchmarking is hard; you can only measure something properly if you can get compilers to emit realistically optimized asm loops for the thing you're trying to measure. (And even then you're only measuring throughput or latency, not both; those are separate things for single operations on out-of-order pipelined CPUs: What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?)

请参见

See @rcgldr's answer for measurement + explanation of what would happen with loops that keep variables in registers.

使用clang时,benchmark::DoNotOptimize(x1 += 31)也会进行优化,以将x保留在内存中,但使用GCC时,它只会保留在寄存器中.不幸的是 @SashaKnorre的答案在QuickBench而不是gcc上使用了clang,以获得与您的-O0 asm类似的结果.它的确显示了许多短NOP的开销被内存隐藏在瓶颈中,并且当这些NOP将重载下一次迭代延迟足够长的时间以使存储转发达到较低延迟的良好情况时,速度会稍有提高. (我认为QuickBench在Intel Xeon服务器CPU上运行,每个CPU内核内部的微体系结构与同代台式机版本相同.)

With clang, benchmark::DoNotOptimize(x1 += 31) also de-optimizes into keeping x in memory, but with GCC it does just stay in a register. Unfortunately @SashaKnorre's answer used clang on QuickBench, not gcc, to get results similar to your -O0 asm. It does show the cost of lots of short-NOPs being hidden by the bottleneck through memory, and a slight speedup when those NOPs delay the reload next iteration just long enough for store-forwarding to hit the lower latency good case. (QuickBench I think runs on Intel Xeon server CPUs, with the same microarchitecture inside each CPU core as desktop version of the same generation.)

在过去的十年中,您测试过的所有x86机器大概都装有Intel CPU,否则会对AMD产生类似的影响.如果您的测量确实有意义,那么可能对您的RPi使用的任何ARM CPU都有类似的影响.否则,可能是看到您期望的另一种情况(确认偏差),特别是如果您进行了优化测试在那里启用.

Presumably all the x86 machines you tested on had Intel CPUs from the last 10 years, or else there's a similar effect on AMD. It's plausible there's a similar effect on whichever ARM CPU your RPi uses, if your measurements really were meaningful there. Otherwise, maybe another case of seeing what you expected (confirmation bias), especially if you tested with optimization enabled there.

我用不同级别的代码优化(-O0-O1-O2-O3)测试了此结果[...]但我总是得到相似的结果

I tested this with different levels of code optimization (-O0,-O1,-O2,-O3) [...] But I always got similar result

我补充说,优化在问题中引起注意,以避免不要测量未优化的代码".答案是因为优化不是我要问的.

I added that optimizations notice in the question to avoid "do not measure not optimized code" answers because optimizations is not what I ask about.

(评论后)关于优化:是的,我以不同的优化级别重现了这一点,但是随着循环被优化,执行时间太快了,无法确定.

(later from comments) About optimizations: yes, I reproduced that with different optimization levels, but as the loops were optimized away, the execution time was too fast to say for sure.

因此,实际上您没有将效果复制到-O1或更高的水平,您只是看到了想要看到的内容(确认偏差),并且大部分是声称效果是一样的.如果您准确地报告了数据(在-O0处可测量的效果,在-O1处为空的定时区域,甚至更高),我可能会立即回答.

So actually you didn't reproduce this effect for -O1 or higher, you just saw what you wanted to see (confirmation bias) and mostly made up the claim that the effect was the same. If you'd accurately reported your data (measurable effect at -O0, empty timed region at -O1 and higher), I could have answered right away.

请参见惯用的绩效评估方式?-如果您的时间没有增加随着重复计数的增加呈线性变化,您并没有衡量自己的想法.另外,启动效果(例如冷缓存,软页面错误,惰性动态链接和动态CPU频率)很容易导致第一个空定时区域的速度慢于第二个空定时区域.

See Idiomatic way of performance evaluation? - if your times don't increase linearly with increasing repeat count, you aren't measuring what you think you're measuring. Also, startup effects (like cold caches, soft page faults, lazy dynamic linking, and dynamic CPU frequency) can easily lead to the first empty timed region being slower than the second.

我假设您仅在-O0进行测试时交换了循环,否则您将排除该测试代码对-O1或更高版本有任何影响.

I assume you only swapped the loops around when testing at -O0, otherwise you would have ruled out there being any effect at -O1 or higher with that test code.

正如你可以看到

As you can see on Godbolt, gcc fully removes the loop with optimization enabled. Sometimes GCC leaves empty loops alone, like maybe it thinks the delay was intentional, but here it doesn't even loop at all. Time doesn't scale with anything, and both timed regions look the same like this:

orig_main:
   ...
        call    std::chrono::_V2::system_clock::now()       # demangled C++ symbol name
        mov     rbp, rax                                    # save the return value = start
        call    std::chrono::_V2::system_clock::now()
        # end in RAX

因此,定时区域中的唯一指令是将start保存到保留调用的寄存器中.您实际上没有对源代码进行任何测量.

So the only instruction in the timed region is saving start to a call-preserved register. You're measuring literally nothing about your source code.

有了Google Benchmark,我们可以获得的asm不会优化工作,但是不会存储/重新加载以引入新的瓶颈:

#include <benchmark/benchmark.h>

static void TargetFunc(benchmark::State& state) {
   uint64_t x2 = 0, y2 = 0;
  // Code inside this loop is measured repeatedly
  for (auto _ : state) {
    benchmark::DoNotOptimize(x2 += 31);
    benchmark::DoNotOptimize(y2 += 31);
  }
}
// Register the function as a benchmark
BENCHMARK(TargetFunc);

# just the main loop, from gcc10.1 -O3 
.L7:                         # do{
        add     rax, 31        # x2 += 31
        add     rdx, 31        # y2 += 31
        sub     rbx, 1
        jne     .L7          # }while(--count != 0)

我认为benchmark::DoNotOptimizeasm volatile("" : "+rm"(x) )类似( GNU C内联汇编),以使编译器在寄存器或内存中实现x,并假定左值已被该空的asm语句修改. (即,忘记它对值的任何了解,阻止常量传播,CSE等.)这可以解释为什么在GCC选择寄存器时clang将存储/重装到内存的原因:这是clang内联asm支持的长期遗漏的优化错误. .在给定选择后,它喜欢选择内存,您有时可以使用"+r,m"之类的多种替代约束来解决.但是不在这里;我只需要放弃内存替代方案即可.我们还是不希望编译器溢出/重新加载到内存中.

I assume benchmark::DoNotOptimize is something like asm volatile("" : "+rm"(x) ) (GNU C inline asm) to make the compiler materialize x in a register or memory, and to assume the lvalue has been modified by that empty asm statement. (i.e. forget anything it knew about the value, blocking constant-propagation, CSE, and whatever.) That would explain why clang stores/reloads to memory while GCC picks a register: this is a longstanding missed-optimization bug with clang's inline asm support. It likes to pick memory when given the choice, which you can sometimes work around with multi-alternative constraints like "+r,m". But not here; I had to just drop the memory alternative; we don't want the compiler to spill/reload to memory anyway.

对于与GNU C兼容的编译器,我们可以仅使用"+r"寄存器约束手动使用asm volatile来使clang生成良好的标量汇编代码( Godbolt ),例如GCC.我们得到了一个基本相同的内部循环,带有3条添加指令,最后一条是可以宏熔丝的add rbx, -1/jnz.

For GNU C compatible compilers, we can use asm volatile manually with only "+r" register constraints to get clang to make good scalar asm (Godbolt), like GCC. We get an essentially identical inner loop, with 3 add instructions, the last one being an add rbx, -1 / jnz that can macro-fuse.

static void TargetFunc(benchmark::State& state) {
   uint64_t x2 = 0, y2 = 0;
  // Code inside this loop is measured repeatedly
  for (auto _ : state) {
      x2 += 16;
      y2 += 17;
    asm volatile("" : "+r"(x2), "+r"(y2));
  }
}

所有这些都应该在现代Intel和AMD CPU上以每次迭代1个时钟周期的速度运行,再次参见@rcgldr的答案.

All of these should run at 1 clock cycle per iteration on modern Intel and AMD CPUs, again see @rcgldr's answer.

当然,这也会禁用SIMD的自动矢量化功能,编译器将在许多实际使用情况下进行自动矢量化.或者,如果您在循环的所有外部中使用结果,则可能会将重复的增量优化为单个乘法.

Of course this also disables auto-vectorization with SIMD, which compilers would do in many real use cases. Or if you used the result at all outside the loop, it might optimize the repeated increment into a single multiply.

您无法用C ++衡量+运算符的成本-它可以根据上下文/周围的代码进行非常不同的编译.即使不考虑环行不变的东西,葫芦也能正常工作.例如x + (y<<2) + 4可以针对x86编译为一条LEA指令.

You can't measure the cost of the + operator in C++ - it can compile very differently depending on context / surrounding code. Even without considering loop-invariant stuff that hoists work. e.g. x + (y<<2) + 4 can compile to a single LEA instruction for x86.

问题实际上是为什么我的计算机执行两个操作要比执行一个操作快,首先是在代码中没有优化这些操作

The question is actually why my computers execute two operations faster than one, first of all in code where these operations are not optimized away

TL:DR:这不是操作,而是通过内存进行的循环依赖项链,它阻止CPU在每次迭代1个时钟周期内运行循环,同时在单独的执行端口上并行执行所有3个加法操作.

TL:DR: it's not the operations, it's the loop-carried dependency chain through memory that stops the CPU from running the loop at 1 clock cycle per iteration, doing all 3 adds in parallel on separate execution ports.

请注意,循环计数器的增量与您对x(有时是y)所做的操作一样多.

Note that the loop counter increment is just as much of an operation as what you're doing with x (and sometimes y).

这篇关于为什么在for循环体内执行一个基本的算术运算要慢于两个算术运算?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆