空循环比C中的非空循环慢 [英] Empty loop is slower than a non-empty one in C

查看:99
本文介绍了空循环比C中的非空循环慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当试图知道要执行多长时间的C代码时,我注意到了这个奇怪的事情:

While trying to know how long a line of C code used to execute, I noticed this weird thing :

int main (char argc, char * argv[]) {
    time_t begin, end;
    uint64_t i;
    double total_time, free_time;
    int A = 1;
    int B = 1;

    begin = clock();
    for (i = 0; i<(1<<31)-1; i++);
    end = clock();
    free_time = (double)(end-begin)/CLOCKS_PER_SEC;
    printf("%f\n", free_time);

    begin = clock();
    for (i = 0; i<(1<<31)-1; i++) {
        A += B%2;
    }
    end = clock();
    free_time = (double)(end-begin)/CLOCKS_PER_SEC;
    printf("%f\n", free_time);

    return(0);
}

执行时显示:

5.873425
4.826874

为什么空循环比第二个空循环使用更多的时间?当然,我已经尝试了多种变体,但是每次空循环要比在其中包含一条指令的循环花费更多的时间.

Why does the empty loop use more time than the second which has an instruction within ? Of course I've tried many variants but everytime, an empty loop takes more time than one with one single instruction within.

请注意,我已经尝试过交换循环顺序并添加一些热身代码,但这根本没有改变我的问题.

Note that I have tried swapping loops order and adding some warmup code and it didn't change my problem at all.

我将代码块与GNU gcc编译器,Linux ubuntu 14.04一起用作IDE,并具有2.3GHz的四核Intel i5(我已经尝试在单核上运行programm,这不会改变结果). /p>

I'm using codeblocks as IDE with GNU gcc compiler, linux ubuntu 14.04 and have a quadcore intel i5 at 2.3GHz (I've tried running the programm on a single core, this doesn't change the result).

推荐答案

事实是现代处理器很复杂.所有执行的指令将以复杂而有趣的方式相互交互.感谢那个其他人"发布代码.

The fact is that modern processors are complicated. All the instructions executed will interact with each other in complicated and interesting ways. Thanks for "that other guy" for posting the code.

OP和另一个家伙"显然都发现短循环需要11个周期,而长循环则需要9个周期.对于长循环,即使有很多操作,也需要9个周期的时间.对于短循环,一定是由于它太短而引起的停顿,仅添加nop会使循环足够长,从而避免停顿.

Both OP and "that other guy" apparently found that the short loop takes 11 cycles, while the long one takes 9 cycles. For the long loop, 9 cycles is plenty of time even though there are lot of operations. For the short loop, there must be some stall caused by it being so short, and just adding a nop makes the loop long enough to avoid the stall.

如果我们看一下代码,将会发生一件事:

One thing that happens if we look at the code:

0x00000000004005af <+50>:    addq   $0x1,-0x20(%rbp)
0x00000000004005b4 <+55>:    cmpq   $0x7fffffff,-0x20(%rbp)
0x00000000004005bc <+63>:    jb     0x4005af <main+50>

我们读取i并将其写回(addq).我们立即再次读取它,并进行比较(cmpq).然后我们循环.但是循环使用分支预测.因此,在执行addq时,处理器并不真正确定是否允许写入i(因为分支预测可能是错误的).

We read i and write it back (addq). We read it immediately again, and compare it (cmpq). And then we loop. But the loop uses branch prediction. So at the time when the addq is executed, the processor isn't really sure it is allowed to write to i (because branch prediction could be wrong).

然后我们将其与i进行比较.处理器将尝试避免从内存中读取i,因为读取需要很长时间.取而代之的是,一些硬件会记住我们只是通过添加i来写入i,而cmpq指令不是读取i,而是从存储指令中获取数据.不幸的是,我们不确定在这一点上是否确实写入了i!因此,这可能会在这里造成停滞.

Then we compare to i. The processor will try to avoid reading i from memory, because reading it takes a long time. Instead some bit of hardware will remember that we just wrote to i by adding to it, and instead of reading i, the cmpq instruction gets the data from the store instruction. Unfortunately, we are not sure at this point if the write to i actually happened or not! So that could introduce a stall here.

这里的问题是条件跳转,导致条件存储的addq和不确定从何处获取数据的cmpq都非常接近.他们异常靠近.可能是因为它们之间的距离太近,因此处理器目前无法确定是从存储指令中提取i还是从内存中读取它.并从内存中读取数据,这很慢,因为它必须等待存储完成.而仅添加一个nop可以给处理器足够的时间.

The problem here is that the conditional jump, the addq which leads to a conditional store, and the cmpq which isn't sure where to get the data from, are all very very close together. They are unusually close together. It could be that they are so close together, the processor cannot figure out at this time whether to take i from the store instruction or to read it from memory. And reads it from memory, which is slower because it has to wait for the store to finish. And adding just one nop gives the processor enough time.

通常,您认为有RAM,并且有缓存.在现代的Intel处理器上,读取内存可以从(最慢到最快)读取:

Usually you think that there is RAM, and there is cache. On a modern Intel processor, reading memory can read from (slowest to fastest):

  1. 内存(RAM)
  2. L3缓存(可选)
  3. 二级缓存
  4. L1缓存
  5. 尚未写入L1缓存的先前存储指令.

因此,处理器在短而慢的循环中内部执行的操作:

So what the processor does internally in the short, slow loop:

  1. 从一级缓存中读取i
  2. 将1加到i
  3. i写入L1缓存
  4. 等待,直到将i写入L1缓存
  5. 从一级缓存中读取i
  6. 比较i与INT_MAX
  7. 如果小于则转到(1).
  1. Read i from L1 cache
  2. Add 1 to i
  3. Write i to L1 cache
  4. Wait until i is written to L1 cache
  5. Read i from L1 cache
  6. Compare i with INT_MAX
  7. Branch to (1) if it is less.

在长而快速的循环中,处理器会执行以下操作:

In the long, fast, loop the processor does:

  1. 很多东西
  2. 从L1缓存中读取i
  3. 将1加到i
  4. 执行存储"指令,将i写入L1缓存
  5. 直接从存储"指令中读取i,而无需触摸L1缓存
  6. 比较i与INT_MAX
  7. 如果小于则转到(1).
  1. Lots of stuff
  2. Read i from L1 cache
  3. Add 1 to i
  4. Do a "store" instruction that will write i to L1 cache
  5. Read i directly from the "store" instruction without touching L1 cache
  6. Compare i with INT_MAX
  7. Branch to (1) if it is less.

这篇关于空循环比C中的非空循环慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆